current position:Home>Python office automation - 90 - file automation management - cleaning up duplicate files and batch modifying file names

Python office automation - 90 - file automation management - cleaning up duplicate files and batch modifying file names

2022-05-15 05:08:03Husky eager for power

Wanyeji
Faint thunder , Cloudy sky .
But I hope the wind and rain will come , I can keep you here .

Preface
Author's brief introduction : Husky who yearns for power , You can call me Siberian Husky , Bloggers dedicated to explaining technical knowledge in vernacular
CSDN Blog expert certification 、 Nova plan Season 3 full stack track top_1 、 Huawei cloud sharing expert 、 Alibaba cloud expert Blogger
If there is something wrong with the knowledge of the article , Please correct me. ! Learn with you , Progress together
The motto of life : Being superior to some other man , Is not high , The true nobility is to be superior to one's former self .
If you feel the blogger's article is good , Please pay attention. 、 give the thumbs-up 、 Collect three companies to support bloggers


Series column :
                Python Full stack series - [ Updating ]     【 In this series 】
                        Python Zero foundation beginner level chapter
                        Python Advanced Grammar
                        Python Office automation
               Wangan road series
​                       Stepping on the pit on the road of network security
​                       Net security literacy
​                       Vulhub Loophole recurrence
​                       Shell Script programming
​                       Web Attack and defense    2021 year 9 month 3 Stop updating on , Move to safe communities such as Prophet
​                       Collection of penetration tools   2021 year 9 month 3 Stop updating on , Move to safe communities such as Prophet
​                ️ Dot dot engineer series
​                       Test artifact - Charles piece
​                       Test artifact - Fiddler piece
​                       Test artifact - Jmeter piece
​                       automation - RobotFrameWork series
​                       automation - be based on JAVA Realized WEB End UI automation
                       automation - be based on MonkeyRunner Realized APP End UI automation

Today, let's learn how to use Python Clean up duplicate files , I don't say much nonsense , Direct alignment .

Clean up duplicate files

Known condition :

nothing , Just know it's a file


Implementation method :

From the specified path ( Or the uppermost path ) Start reading , utilize glob Read each folder , Read file , Record name and size , Whether a file with the same name has been read before each detection , If there is , Determine whether the size is the same , If the same , We think this is a duplicate file , Delete them .

The code example is as follows :

# coding:utf-8

import glob
import os.path

data = {
    }       #  Define an empty dictionary , Save the file name temporarily 

def clear(path):
    result = glob.glob(path)    #  take  path  The path passes in , Assign a value to  result

    for _data in result:    # for  Loop to determine whether it is a folder 
        if glob.os.path.isdir(_data):   #  If it's a folder , Continue to pass the path of this folder to  clear()  Function continues to recursively find 
            _path = glob.os.path.join(_data, '*')
            clear(_path)
        else:                           #  If it's a document , Then extract the file name 
            name = glob.os.path.split(_data)[-1]


            if 'zip' in name:         #  Because what we're testing right now  path  There are  ".zip"  file , So skip here  '.zip'  Read the compressed file, otherwise an error will be reported 
                continue

            f = open(_data, 'r')      #  Read out the contents before judging the file name , If it's unreadable 
            content = f.read()        #  Assign the read content to  content

            if name in data:          #  Determine whether the file name already exists  data  This temporarily stores the file name in the dictionary , If it exists, execute the delete action 
                _content_dict = data[name]

                if _content_dict == content:
                    print(' file  \"{}\"  Be deleted ...'.format(_data))     #  debugging 
                    os.remove(_data)
            else:
                data[name] = content


if __name__ == '__main__':
    path = glob.os.path.join(glob.os.getcwd(), 'test_file')
    clear(path)

PS: There's a little bit of caution here , If path If the path is the root directory , There will be super unexpected results , It is recommended to establish a separate folder path to test .( It hurts me to step on this pit ....)

The operation results are as follows :



Optimization of cleaning up duplicate files - Solve the problem of different contents of the same file name under different paths

In fact, you can think of a question here , There may be such a situation , The same file name exists under different folders , But the contents of the file are different . If you use the above script to delete files with the same file name under the folder , In fact, it is a kind of lax operation .

This leads to our next need for script optimization .

Let's take a look at the actual situation first data What should be the value of :data = {'name': {'path/name': 'content', 'path2/name': 'content'}}

  • Uplink content name The root path of the incoming path for us
  • Uplink content path/name It's actually a secondary path
  • Uplink content content For the content of the document

This is a reasonable way to delete duplicate content through the secondary file .


The sample code is as follows :

# coding:utf-8

import glob
import os.path

data = {
    }       #  Define an empty dictionary , Save the file name temporarily 


def clear(path):
    result = glob.glob(path)    #  take  path  The path passes in , Assign a value to  result

    for _data in result:    # for  Loop to determine whether it is a folder 
        if glob.os.path.isdir(_data):   #  If it's a folder , Continue to pass the path of this folder to  clear()  Function continues to recursively find 
            _path = glob.os.path.join(_data, '*')
            clear(_path)
        else:                           #  If it's a document , Then extract the file name 
            name = glob.os.path.split(_data)[-1]


            if 'zip' in name:         #  Because what we're testing right now  path  There are  ".zip"  file , So skip here  '.zip'  Read the compressed file, otherwise an error will be reported 
                continue

            f = open(_data, 'r')      #  Read out the contents before judging the file name , If it's unreadable 
            content = f.read()        #  Assign the read content to  content

            if name in data:        #  Determine whether the file name already exists  data  This temporarily stores the file name in the dictionary , If it exists, execute the delete action 
                                    #  If it doesn't exist , Then the read secondary path and content are stored in  data  This empty dictionary 
                sub_name = data[name]       #  Definition  sub_name  To get the secondary path 

                is_delete = False   # is_delete  Used to record deletion status ; If not deleted , You also need to add secondary paths to  data

                for k, v in sub_name.items():    #  Loop again to determine the files under the secondary path ;k  Is the path ,v  For the content of the document 
                    print(' The secondary path is  \"{}\" ,'.format(k), name, ' The content is  \'{}\' '.format(v))   #  Debug the loop of files under the secondary path of printout 
                    if v == content:             #  If the file name is the same as the content , The delete action is executed 
                        print(' file  \"{}\"  Be deleted ...'.format(_data))     #  Debug deleted files 
                        os.remove(_data)      #  After deleting duplicate files , change  is_delete  Status as True
                        is_delete = True

                if not is_delete:       #  If not deleted, it will  content  The read content is assigned to  data[name][_data]
                    data[name][_data] = content
            else:
                data[name] = {
    
                    _data: content
                }


if __name__ == '__main__':
    path = glob.os.path.join(glob.os.getcwd(), 'test_file')
    clear(path)
    print(data)

The operation results are as follows :



Optimization of cleaning up duplicate files - utilize hashlib The module solves the problem of reading too large files

Now there's another question , It can be seen from the printout of debugging , Because of the test_file The files under the path are relatively small , So there is no big problem in running ; Just imagine , If it's some large files , After reading and storing in the dictionary , It is very likely to cause insufficient memory and other situations , So this method of storing content directly is obviously inappropriate .

In fact, it can solve this problem , Is to use what I learned before Encryption module , adopt hashlib The module encrypts the content into md5 In the form of , md5 Just a short string , As long as the original content remains unchanged , md5 The value of will not change ( This method is also often used for file security detection in operation and maintenance environment ).

So it's important to read the contents of the file in the code content It needs to be changed :

The code example is as follows :

# coding:utf-8

import glob
import hashlib
import os.path

data = {
    }       #  Define an empty dictionary , Save the file name temporarily 
#data = {'name': {'path/name': 'content', 'path2/name': 'content'}}

def clear(path):
    result = glob.glob(path)    #  take  path  The path passes in , Assign a value to  result

    for _data in result:    # for  Loop to determine whether it is a folder 
        if glob.os.path.isdir(_data):   #  If it's a folder , Continue to pass the path of this folder to  clear()  Function continues to recursively find 
            _path = glob.os.path.join(_data, '*')
            clear(_path)
        else:                           #  If it's a document , Then extract the file name 
            name = glob.os.path.split(_data)[-1]


            if 'zip' in name:         #  Because what we're testing right now  path  There are  ".zip"  file , So skip here  '.zip'  Read the compressed file, otherwise an error will be reported 
                continue

            f = open(_data, 'r')      #  Read out the contents before judging the file name , If it's unreadable 
            content = f.read()        #  Assign the read content to  content

            hash_content_obj = hashlib.md5(content.encode('utf-8'))     #  Pass the contents of the read file through  md5  Instantiate in encrypted form 
            hash_content = hash_content_obj.hexdigest()                 # hash_content_obj 16 Assign a value to the hexadecimal string  hash_content
                                                                        #  Come here , Actually  data  What's stored is  hash_content

            if name in data:        #  Determine whether the file name already exists  data  This temporarily stores the file name in the dictionary , If it exists, execute the delete action 
                                    #  If it doesn't exist , Then the read secondary path and content are stored in  data  This empty dictionary 
                sub_name = data[name]       #  Definition  sub_name  To get the secondary path 

                is_delete = False   # is_delete  Used to record deletion status ; If not deleted , You also need to add secondary paths to  data

                for k, v in sub_name.items():    #  Loop again to determine the files under the secondary path ;k  Is the path ,v  For the content of the document 
                    print(' The secondary path is  \"{}\" ,'.format(k), name, ' The content is  \'{}\' '.format(v))   #  Debug the loop of files under the secondary path of printout 
                    if v == hash_content:             #  If the file name is the same as the content , The delete action is executed 
                        print(' file  \"{}\"  Be deleted ...'.format(_data))     #  Debug deleted files 
                        os.remove(_data)      #  After deleting duplicate files , change  is_delete  Status as True
                        is_delete = True

                if not is_delete:       #  If not deleted, it will  content  The read content is assigned to  data[name][_data]
                    data[name][_data] = hash_content
            else:
                data[name] = {
    
                    _data: hash_content
                }


if __name__ == '__main__':
    path = glob.os.path.join(glob.os.getcwd(), 'test_file')
    clear(path)
    print(data)

The operation results are as follows :



Optimization of cleaning up duplicate files - Solve the problem of reading unreadable “zip” File error problem

In the above , When we encounter unreadable “zip” When compressing files , Use of is continue The way to skip . In fact, what I'm talking about here “zip” Unreadable is actually not very rigorous , Because it can be read by Binary Reading , But in the script above , The read content has been “encode” code , So use rb This binary reading method still needs to be optimized .

The sample code is as follows :

# coding:utf-8

import glob
import hashlib

#data = {'name': {'path/name': 'content', 'path2/name': 'content'}}
import os.path

data = {
    }       #  Define an empty dictionary , Save the file name temporarily 


def clear(path):
    result = glob.glob(path)    #  take  path  The path passes in , Assign a value to  result

    for _data in result:    # for  Loop to determine whether it is a folder 
        if glob.os.path.isdir(_data):   #  If it's a folder , Continue to pass the path of this folder to  clear()  Function continues to recursively find 
            _path = glob.os.path.join(_data, '*')
            clear(_path)
        else:                           #  If it's a document , Then extract the file name 
            name = glob.os.path.split(_data)[-1]

            is_byte = False            #  Add one  byte Type read switch ( If there is Chinese in the document , You also need to set the encoding format , And close the open file )

            if 'zip' in name:           #  Because what we're testing right now  path  There are  ".zip"  file , So skip here  '.zip'  Read the compressed file, otherwise an error will be reported 
                is_byte = True
                f = open(_data, 'rb')
            else:
                f = open(_data, 'r', encoding='utf-8')      #  Read out the contents before judging the file name , If it's unreadable 
            content = f.read()        #  Assign the read content to  content
            f.close()

            if is_byte:
                hash_content_obj = hashlib.md5(content)     #  Pass the contents of the read file through  md5  Instantiate in encrypted form 
            else:
                hash_content_obj = hashlib.md5(content.encode('utf-8'))

            hash_content = hash_content_obj.hexdigest()                 # hash_content_obj 16 Assign a value to the hexadecimal string  hash_content
                                                                        #  Come here , Actually  data  What's stored is  hash_content

            if name in data:        #  Determine whether the file name already exists  data  This temporarily stores the file name in the dictionary , If it exists, execute the delete action 
                                    #  If it doesn't exist , Then the read secondary path and content are stored in  data  This empty dictionary 
                sub_name = data[name]       #  Definition  sub_name  To get the secondary path 

                is_delete = False   # is_delete  Used to record deletion status ; If not deleted , You also need to add secondary paths to  data

                for k, v in sub_name.items():    #  Loop again to determine the files under the secondary path ;k  Is the path ,v  For the content of the document 
                    print(' The secondary path is  \"{}\" ,'.format(k), name, ' The content is  \'{}\' '.format(v))   #  Debug the loop of files under the secondary path of printout 
                    if v == hash_content:             #  If the file name is the same as the content , The delete action is executed 
                        print(' file  \"{}\"  Be deleted ...'.format(_data))     #  Debug deleted files 
                        os.remove(_data)      #  After deleting duplicate files , change  is_delete  Status as True
                        is_delete = True

                if not is_delete:       #  If not deleted, it will  content  The read content is assigned to  data[name][_data]
                    data[name][_data] = hash_content
            else:
                data[name] = {
    
                    _data: hash_content
                }


if __name__ == '__main__':
    path = glob.os.path.join(glob.os.getcwd(), 'test_file')
    clear(path)

    for k, v in data.items():
        for _k, v in v.items():
            print(' The file path is  \"{}\" ,'.format(_k), ' The content is  \'{}\' '.format(v))

The operation results are as follows :



Batch modify file name

It's also very simple , Still use what we have learned recently shutil And glob modular ( Refer to the file search and recursive implementation in the previous chapter ).

Known condition :

Know the specified string whose file name needs to be modified ( That is, the file name to be modified )

Implementation method :

Through the loop , Add or modify the specified target string to the string contained in the file name

The code example is as follows :

# coding:utf-8


import glob
import os.path
import shutil


'''  utilize for Loop and recursion , adopt  glob  To read  test_file  All the content   Through the loop, according to each index of the loop , Index every file  '''

def filename_update(path):
    result = glob.glob(path)

    for index, data in enumerate(result):     # for  Loop enumeration : If it is a folder, recursion , If it is a file, add the index value 
        if glob.os.path.isdir(data):
            _path = glob.os.path.join(data, '*')
            filename_update(_path)
        else:
            path_list = glob.os.path.split(data)
            name = path_list[-1]
            new_name = '{}_{}'.format(index, name)
            new_data = glob.os.path.join(path_list[0], new_name)
            shutil.move(data, new_data)


if __name__ == '__main__':
    path = glob.os.path.join(glob.os.getcwd(), 'test_file')
    filename_update(path)

The operation results are as follows :



Maybe you noticed here "test_file" There is no addition to the document "0_*" The index at the beginning , In fact, there is no index , Instead, our script only renames the file , Folders are filtered out .

copyright notice
author[Husky eager for power],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/131/202205111302321995.html

Random recommended