current position:Home>Python office automation - 90 - file automation management - cleaning up duplicate files and batch modifying file names
Python office automation - 90 - file automation management - cleaning up duplicate files and batch modifying file names
2022-05-15 05:08:03【Husky eager for power】
Wanyeji |
---|
Faint thunder , Cloudy sky . |
But I hope the wind and rain will come , I can keep you here . |
Preface :
Author's brief introduction : Husky who yearns for power , You can call me Siberian Husky , Bloggers dedicated to explaining technical knowledge in vernacular
CSDN Blog expert certification 、 Nova plan Season 3 full stack track top_1 、 Huawei cloud sharing expert 、 Alibaba cloud expert Blogger
If there is something wrong with the knowledge of the article , Please correct me. ! Learn with you , Progress together
The motto of life : Being superior to some other man , Is not high , The true nobility is to be superior to one's former self .
If you feel the blogger's article is good , Please pay attention. 、 give the thumbs-up 、 Collect three companies to support bloggersSeries column :
Python Full stack series - [ Updating ] 【 In this series 】
Python Zero foundation beginner level chapter
Python Advanced Grammar
Python Office automation
Wangan road series
Stepping on the pit on the road of network security
Net security literacy
Vulhub Loophole recurrence
Shell Script programming
Web Attack and defense 2021 year 9 month 3 Stop updating on , Move to safe communities such as Prophet
Collection of penetration tools 2021 year 9 month 3 Stop updating on , Move to safe communities such as Prophet
️ Dot dot engineer series
Test artifact - Charles piece
Test artifact - Fiddler piece
Test artifact - Jmeter piece
automation - RobotFrameWork series
automation - be based on JAVA Realized WEB End UI automation
automation - be based on MonkeyRunner Realized APP End UI automation


List of articles
- Clean up duplicate files
- Optimization of cleaning up duplicate files - Solve the problem of different contents of the same file name under different paths
- Optimization of cleaning up duplicate files - utilize hashlib The module solves the problem of reading too large files
- Optimization of cleaning up duplicate files - Solve the problem of reading unreadable "zip" File error problem
- Batch modify file name
Today, let's learn how to use Python Clean up duplicate files , I don't say much nonsense , Direct alignment .
Clean up duplicate files
Known condition :
nothing , Just know it's a file
Implementation method :
From the specified path ( Or the uppermost path ) Start reading , utilize
glob
Read each folder , Read file , Record name and size , Whether a file with the same name has been read before each detection , If there is , Determine whether the size is the same , If the same , We think this is a duplicate file , Delete them .
The code example is as follows :
# coding:utf-8
import glob
import os.path
data = {
} # Define an empty dictionary , Save the file name temporarily
def clear(path):
result = glob.glob(path) # take path The path passes in , Assign a value to result
for _data in result: # for Loop to determine whether it is a folder
if glob.os.path.isdir(_data): # If it's a folder , Continue to pass the path of this folder to clear() Function continues to recursively find
_path = glob.os.path.join(_data, '*')
clear(_path)
else: # If it's a document , Then extract the file name
name = glob.os.path.split(_data)[-1]
if 'zip' in name: # Because what we're testing right now path There are ".zip" file , So skip here '.zip' Read the compressed file, otherwise an error will be reported
continue
f = open(_data, 'r') # Read out the contents before judging the file name , If it's unreadable
content = f.read() # Assign the read content to content
if name in data: # Determine whether the file name already exists data This temporarily stores the file name in the dictionary , If it exists, execute the delete action
_content_dict = data[name]
if _content_dict == content:
print(' file \"{}\" Be deleted ...'.format(_data)) # debugging
os.remove(_data)
else:
data[name] = content
if __name__ == '__main__':
path = glob.os.path.join(glob.os.getcwd(), 'test_file')
clear(path)
PS: There's a little bit of caution here , If path If the path is the root directory , There will be super unexpected results , It is recommended to establish a separate folder path to test .( It hurts me to step on this pit ....)
The operation results are as follows :

Optimization of cleaning up duplicate files - Solve the problem of different contents of the same file name under different paths
In fact, you can think of a question here , There may be such a situation , The same file name exists under different folders , But the contents of the file are different . If you use the above script to delete files with the same file name under the folder , In fact, it is a kind of lax operation .
This leads to our next need for script optimization .
Let's take a look at the actual situation first data
What should be the value of :data = {'name': {'path/name': 'content', 'path2/name': 'content'}}
- Uplink content
name
The root path of the incoming path for us - Uplink content
path/name
It's actually a secondary path - Uplink content
content
For the content of the document
This is a reasonable way to delete duplicate content through the secondary file .
The sample code is as follows :
# coding:utf-8
import glob
import os.path
data = {
} # Define an empty dictionary , Save the file name temporarily
def clear(path):
result = glob.glob(path) # take path The path passes in , Assign a value to result
for _data in result: # for Loop to determine whether it is a folder
if glob.os.path.isdir(_data): # If it's a folder , Continue to pass the path of this folder to clear() Function continues to recursively find
_path = glob.os.path.join(_data, '*')
clear(_path)
else: # If it's a document , Then extract the file name
name = glob.os.path.split(_data)[-1]
if 'zip' in name: # Because what we're testing right now path There are ".zip" file , So skip here '.zip' Read the compressed file, otherwise an error will be reported
continue
f = open(_data, 'r') # Read out the contents before judging the file name , If it's unreadable
content = f.read() # Assign the read content to content
if name in data: # Determine whether the file name already exists data This temporarily stores the file name in the dictionary , If it exists, execute the delete action
# If it doesn't exist , Then the read secondary path and content are stored in data This empty dictionary
sub_name = data[name] # Definition sub_name To get the secondary path
is_delete = False # is_delete Used to record deletion status ; If not deleted , You also need to add secondary paths to data
for k, v in sub_name.items(): # Loop again to determine the files under the secondary path ;k Is the path ,v For the content of the document
print(' The secondary path is \"{}\" ,'.format(k), name, ' The content is \'{}\' '.format(v)) # Debug the loop of files under the secondary path of printout
if v == content: # If the file name is the same as the content , The delete action is executed
print(' file \"{}\" Be deleted ...'.format(_data)) # Debug deleted files
os.remove(_data) # After deleting duplicate files , change is_delete Status as True
is_delete = True
if not is_delete: # If not deleted, it will content The read content is assigned to data[name][_data]
data[name][_data] = content
else:
data[name] = {
_data: content
}
if __name__ == '__main__':
path = glob.os.path.join(glob.os.getcwd(), 'test_file')
clear(path)
print(data)
The operation results are as follows :

Optimization of cleaning up duplicate files - utilize hashlib The module solves the problem of reading too large files
Now there's another question , It can be seen from the printout of debugging , Because of the test_file
The files under the path are relatively small , So there is no big problem in running ; Just imagine , If it's some large files , After reading and storing in the dictionary , It is very likely to cause insufficient memory and other situations , So this method of storing content directly is obviously inappropriate .
In fact, it can solve this problem , Is to use what I learned before Encryption module , adopt hashlib
The module encrypts the content into md5
In the form of , md5
Just a short string , As long as the original content remains unchanged , md5
The value of will not change ( This method is also often used for file security detection in operation and maintenance environment
).
So it's important to read the contents of the file in the code content
It needs to be changed :
The code example is as follows :
# coding:utf-8
import glob
import hashlib
import os.path
data = {
} # Define an empty dictionary , Save the file name temporarily
#data = {'name': {'path/name': 'content', 'path2/name': 'content'}}
def clear(path):
result = glob.glob(path) # take path The path passes in , Assign a value to result
for _data in result: # for Loop to determine whether it is a folder
if glob.os.path.isdir(_data): # If it's a folder , Continue to pass the path of this folder to clear() Function continues to recursively find
_path = glob.os.path.join(_data, '*')
clear(_path)
else: # If it's a document , Then extract the file name
name = glob.os.path.split(_data)[-1]
if 'zip' in name: # Because what we're testing right now path There are ".zip" file , So skip here '.zip' Read the compressed file, otherwise an error will be reported
continue
f = open(_data, 'r') # Read out the contents before judging the file name , If it's unreadable
content = f.read() # Assign the read content to content
hash_content_obj = hashlib.md5(content.encode('utf-8')) # Pass the contents of the read file through md5 Instantiate in encrypted form
hash_content = hash_content_obj.hexdigest() # hash_content_obj 16 Assign a value to the hexadecimal string hash_content
# Come here , Actually data What's stored is hash_content
if name in data: # Determine whether the file name already exists data This temporarily stores the file name in the dictionary , If it exists, execute the delete action
# If it doesn't exist , Then the read secondary path and content are stored in data This empty dictionary
sub_name = data[name] # Definition sub_name To get the secondary path
is_delete = False # is_delete Used to record deletion status ; If not deleted , You also need to add secondary paths to data
for k, v in sub_name.items(): # Loop again to determine the files under the secondary path ;k Is the path ,v For the content of the document
print(' The secondary path is \"{}\" ,'.format(k), name, ' The content is \'{}\' '.format(v)) # Debug the loop of files under the secondary path of printout
if v == hash_content: # If the file name is the same as the content , The delete action is executed
print(' file \"{}\" Be deleted ...'.format(_data)) # Debug deleted files
os.remove(_data) # After deleting duplicate files , change is_delete Status as True
is_delete = True
if not is_delete: # If not deleted, it will content The read content is assigned to data[name][_data]
data[name][_data] = hash_content
else:
data[name] = {
_data: hash_content
}
if __name__ == '__main__':
path = glob.os.path.join(glob.os.getcwd(), 'test_file')
clear(path)
print(data)
The operation results are as follows :

Optimization of cleaning up duplicate files - Solve the problem of reading unreadable “zip” File error problem
In the above , When we encounter unreadable “zip” When compressing files , Use of is continue
The way to skip . In fact, what I'm talking about here “zip” Unreadable is actually not very rigorous , Because it can be read by Binary Reading , But in the script above , The read content has been “encode” code , So use rb
This binary reading method still needs to be optimized .
The sample code is as follows :
# coding:utf-8
import glob
import hashlib
#data = {'name': {'path/name': 'content', 'path2/name': 'content'}}
import os.path
data = {
} # Define an empty dictionary , Save the file name temporarily
def clear(path):
result = glob.glob(path) # take path The path passes in , Assign a value to result
for _data in result: # for Loop to determine whether it is a folder
if glob.os.path.isdir(_data): # If it's a folder , Continue to pass the path of this folder to clear() Function continues to recursively find
_path = glob.os.path.join(_data, '*')
clear(_path)
else: # If it's a document , Then extract the file name
name = glob.os.path.split(_data)[-1]
is_byte = False # Add one byte Type read switch ( If there is Chinese in the document , You also need to set the encoding format , And close the open file )
if 'zip' in name: # Because what we're testing right now path There are ".zip" file , So skip here '.zip' Read the compressed file, otherwise an error will be reported
is_byte = True
f = open(_data, 'rb')
else:
f = open(_data, 'r', encoding='utf-8') # Read out the contents before judging the file name , If it's unreadable
content = f.read() # Assign the read content to content
f.close()
if is_byte:
hash_content_obj = hashlib.md5(content) # Pass the contents of the read file through md5 Instantiate in encrypted form
else:
hash_content_obj = hashlib.md5(content.encode('utf-8'))
hash_content = hash_content_obj.hexdigest() # hash_content_obj 16 Assign a value to the hexadecimal string hash_content
# Come here , Actually data What's stored is hash_content
if name in data: # Determine whether the file name already exists data This temporarily stores the file name in the dictionary , If it exists, execute the delete action
# If it doesn't exist , Then the read secondary path and content are stored in data This empty dictionary
sub_name = data[name] # Definition sub_name To get the secondary path
is_delete = False # is_delete Used to record deletion status ; If not deleted , You also need to add secondary paths to data
for k, v in sub_name.items(): # Loop again to determine the files under the secondary path ;k Is the path ,v For the content of the document
print(' The secondary path is \"{}\" ,'.format(k), name, ' The content is \'{}\' '.format(v)) # Debug the loop of files under the secondary path of printout
if v == hash_content: # If the file name is the same as the content , The delete action is executed
print(' file \"{}\" Be deleted ...'.format(_data)) # Debug deleted files
os.remove(_data) # After deleting duplicate files , change is_delete Status as True
is_delete = True
if not is_delete: # If not deleted, it will content The read content is assigned to data[name][_data]
data[name][_data] = hash_content
else:
data[name] = {
_data: hash_content
}
if __name__ == '__main__':
path = glob.os.path.join(glob.os.getcwd(), 'test_file')
clear(path)
for k, v in data.items():
for _k, v in v.items():
print(' The file path is \"{}\" ,'.format(_k), ' The content is \'{}\' '.format(v))
The operation results are as follows :

Batch modify file name
It's also very simple , Still use what we have learned recently shutil
And glob
modular ( Refer to the file search and recursive implementation in the previous chapter ).
Known condition :
Know the specified string whose file name needs to be modified ( That is, the file name to be modified )
Implementation method :
Through the loop , Add or modify the specified target string to the string contained in the file name
The code example is as follows :
# coding:utf-8
import glob
import os.path
import shutil
''' utilize for Loop and recursion , adopt glob To read test_file All the content Through the loop, according to each index of the loop , Index every file '''
def filename_update(path):
result = glob.glob(path)
for index, data in enumerate(result): # for Loop enumeration : If it is a folder, recursion , If it is a file, add the index value
if glob.os.path.isdir(data):
_path = glob.os.path.join(data, '*')
filename_update(_path)
else:
path_list = glob.os.path.split(data)
name = path_list[-1]
new_name = '{}_{}'.format(index, name)
new_data = glob.os.path.join(path_list[0], new_name)
shutil.move(data, new_data)
if __name__ == '__main__':
path = glob.os.path.join(glob.os.getcwd(), 'test_file')
filename_update(path)
The operation results are as follows :

Maybe you noticed here "test_file" There is no addition to the document "0_*" The index at the beginning , In fact, there is no index , Instead, our script only renames the file , Folders are filtered out .
copyright notice
author[Husky eager for power],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/131/202205111302321995.html
The sidebar is recommended
- Build Python project in Jenkins, pychar output is normal and Jenkins output modulenotfounderror: no module named problem
- Interface request processing of Python webservice
- Download third-party libraries offline in Python
- Web automation in Python
- Importlib.exe in Python import_ Module import module
- Operation of OS Library in Python
- Some integration operations on Web pages in Python
- Python realizes the super fast window screenshot, automatically obtains the current active window and displays the screenshot
- Implementation of workstation monitoring system with Python socket
- Resume Automation - word 92
guess what you like
Django foundation -- 02 small project based on Database
Python drawing word cloud
Django foundation -- 02 small project based on Database
MNIST dataset classification based on Python
Design of FTP client server based on Python
Signing using RSA algorithm based on Python
Website backend of online book purchase function based on Python
Implementation of Tetris game based on Python greedy search
Django Foundation
Case: Python weather broadcast system, this is a rainy day
Random recommended
- Python development alert notification SMS alert
- How to configure Python environment library offline in FME
- Python: fastapi - beginner interface development
- Generate password based on fast token and fast token
- [Django CI system] use of json-20220509
- [Django CI system] if the front-end date is complete, it will be fully updated to the back-end; If the front-end date is incomplete, the date will not be updated to the back-end-20220510
- [Django CI system] echarts dataset standard writing - 20220509
- [Django CI system] obtain the current time, the first day and the last day of the month, etc. - 20220510
- wxPython wx. Correction of font class · Wx Font tutorial
- NCT youth programming proficiency level test python programming level 3 - simulation volume 2 (with answers)
- Design of personal simple blog system based on Django (with source code acquisition method)
- [Python Script] classify pictures according to their definition
- Wu Enda's classic ml class is fully upgraded! Update to Python implementation and add more intuitive visual teaching
- Six built-in functions called immortals in Python
- Some insights of pandas in machine learning
- Introduction to Python [preliminary knowledge] - programming idea
- Stay up late to tidy up! Pandas text processing Encyclopedia
- Python recursion to find values by dichotomy
- Open 3D Python Interface
- [true title 02 of Blue Bridge Cup] Python output natural number youth group analysis of true title of Blue Bridge Cup Python national competition
- Introduction to the differences between Python and Java
- Explain Python CONDA in detail
- The pycham downloaded by MAC reports an error as soon as it is opened. The downloaded Python interpreter is also the latest version
- From entry to mastery, python full stack engineers have personally taught Python core technology and practical combat for ten years
- Python is used to detect some problems of word frequency in English text.
- How to choose between excel, database and pandas (Python third-party library)?
- WxPython download has been reporting errors
- Pyside6 UIC and other tools cannot be found in the higher version of pyside6 (QT for Python 6). How to solve it?
- About Python Crawlers
- Successfully imported pandas, unable to use dataframe
- How to extract some keywords in the path with Python
- Python encountered a problem reading the file!
- When Python is packaged into exe, an error is reported when opening assertionerror: C: \ users \ Acer \ appdata \ local \ temp\_ MEI105682\distutils\core. pyc
- Eight practical "no code" features of Python
- Python meets SQL, so a useful Python third-party library appears
- 100 Python algorithm super detailed explanation: a hundred dollars and a hundred chickens
- [fundamentals of Python] Python code and so on
- When Python uses probit regression, the program statement is deleted by mistake, and then it appears_ raise_ linalgerror_ Unrecognized error of singular
- Python testing Nicholas theorem
- Accelerating parallel computing based on python (BL) 136