current position:Home>[Python applet] 8 lines of code to realize file de duplication

[Python applet] 8 lines of code to realize file de duplication

2022-01-31 04:49:55 Dream, killer

「 This is my participation 11 The fourth of the yuegengwen challenge 8 God , Check out the activity details :2021 One last more challenge 」.

Requirements describe

Last week, I suddenly received a task , To pass the XX Website export XX year -XX Data between years , The exported file name is the corresponding date , Found... After export , Some files are the same size , The file name is different , Just open the file and look at it , It is found that the contents of two files with different file names are duplicate , The reason is not clear for the time being , Prediction is the reason for the website , Finally, it was found that there was probably only 30% There are no duplicate data . I am !

What also don't say , The first task is to screen out those documents that do not have duplicates , Or delete duplicate files . There are hundreds of files , It is estimated that it will take overtime to delete one by one , Then it suddenly occurred to me Python There's a built-in filecmp Can appear to be a comparison file , Don't talk much , Direct alignment ~

Lu code ing

All exported files are saved in the same folder , The format is the same . then , Look at the official documents filecmp.cmp() Usage of . It is summarized as follows :

filecmp.cmp(f1, f2, shallow=True)

  • f1/f2: Two file paths to be compared .
  • shallow : The default is True, That is, only compare os.stat() Metadata obtained ( Creation time , Size and other information ) Are they the same? , Set to False Words , When comparing documents, you should also compare the contents of the documents .

To prevent code problems , I created a test Folder , Manually created... Under the folder 6 File ,1~5 There are only 1,2,3,4,5 Corresponding digital content , The first 6 An empty file . Then make a copy of all the documents . as follows

Test code

from pathlib import Path
import filecmp

path_list = [path for path in Path(r'C:\Users\pc\Desktop\test').iterdir() if path.is_file()]

for front in range(len(path_list) - 1):
    for later in range(front + 1, len(path_list)):
        if filecmp.cmp(path_list[front], path_list[later], shallow=False):
            path_list[front].unlink()    #  Delete file 
            break
 Copy code 

Running effect The overall logic of the code is very simple , First, get the... Under the corresponding file “ All the files ”, here “ All the files ” refer to test The first level file path under the directory , If test There are subfolders in the folder , The file path in the subfolder will not be obtained , At the same time, due to the designation of path.is_file() , therefore path_list Get only files in (txt、xlsx、csv、zip etc. ) The path of . Then compare whether the file contents of the current two paths are the same through a double-layer loop , If the same , Then delete the file .

Although the amount of code is not much , But it can really reduce the time of manual processing ,OK, End of the flower ~


That's what I want to share today , Wechat search Python New horizons , Take you to learn more useful knowledge every day .

copyright notice
author[Dream, killer],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/01/202201310449527453.html

Random recommended