current position:Home>Analysis on several implementations of Python crawler data De duplication
Analysis on several implementations of Python crawler data De duplication
2022-02-01 12:01:42 【Salted fish Science】
「 This is my participation 11 The fourth of the yuegengwen challenge 25 God , Check out the activity details :2021 One last more challenge 」
Reptile de duplication scene
1、 Prevent duplicate requests
2、 Prevent storing duplicate data
The basic principle of de duplication of crawler data
According to the given judgment basis and the given weight removal container , Judge the original data one by one , Judge whether there is this data in the weight removal container , If not, add the judgment basis corresponding to the data to the de duplication container , At the same time, mark the data as non duplicate data , If so, don't add , At the same time, mark the data as duplicate data
- The basis of judgment ( Raw data 、 Original data eigenvalue ) - How to specify that two data are duplicate ?
- Remove the heavy container ( Store the judgment basis for judging the original data )
Remove the heavy container
Based on the original data
Based on the eigenvalues of the original data , It won't take up too much space
Temporary de duplication container and persistent de duplication container
1、 Temporary weight removal container
Refers to the use of list、set The data structure of such programming language stores the de duplication data , Once the program is closed or restarted , The data in the de duplication container is recycled . advantage : Easy to use and implement ;
shortcoming : But you can't share 、 Cannot persist
2、 Persistent de duplication container
Refers to the use of redis、mysql Wait for the database to store the de duplication data .
advantage : Persistence 、 share ;
shortcoming : But the use and implementation are relatively complex
Several special original data eigenvalue calculations are commonly used
1、 The information in this paper, hash Algorithm ( The fingerprint ) 2、SimHash Algorithm - Blur text 3、 Bloom filter mode - Hundreds of millions of levels of data De duplication
De duplication based on information summarization algorithm
The information in this paper, hash Algorithm means that text of any length can be 、 Bytes of data , Get a fixed length text through an algorithm . Such as MD5(128 position )、SHA1(160 position ) etc. .
features : As long as the source text is different , The result of the calculation , It must be different ( Abstract ).
Abstract : The algorithm is mainly used to compare whether the information sources are consistent , Because as long as the source changes , The resulting summary must be different ; And usually the result is much shorter than the source , So called “ Abstract ”.
Is therefore , Using the information summarization algorithm can greatly reduce the storage space utilization of the de duplication container , And improve the judgment speed , And because of its strong uniqueness , There is almost no miscalculation .
Be careful :hash The result of the algorithm is essentially a string of values , Such as md5 Of 128 Bit refers to the length of binary , The length of hexadecimal is 32 position . A binary equals a hexadecimal .
be based on simhash Algorithm de duplication
Simhash The algorithm is a locally sensitive hash algorithm , It can realize the de duplication of similar text content
Information digest algorithm : If the original content is only one byte apart , The resulting signatures are also likely to be very different .
Simhash Algorithm : If the original content is only one byte apart , The resulting signature difference is very small .
Simhash Value comparison : Through both simhash The difference of the binary bit of the value represents the difference of the original text content . The number of differences is also called Hamming distance .
Be careful
Simhash For long text 500 word + More applicable , Short text may deviate greatly
stay google In the data given in the paper ,64 position simhash value , At Hamming, the distance is 3 Under the circumstances , It can be considered that the two documents are similar or repetitive . Of course, this value is only a reference value , There may be different test values for your own application
copyright notice
author[Salted fish Science],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/02/202202011201393543.html
The sidebar is recommended
- Python from 0 to 1 (day 14) - Python conditional judgment 1
- Several very interesting modules in Python
- How IOS developers learn Python Programming 15 - object oriented programming 1
- Daily python, Chapter 20, exception handling
- Understand the basis of Python collaboration in a few minutes
- [centos7] how to install and use Python under Linux
- leetcode 1130. Minimum Cost Tree From Leaf Values(python)
- leetcode 1433. Check If a String Can Break Another String(python)
- Python Matplotlib drawing 3D graphics
- Talk about deep and shallow copying in Python
guess what you like
-
Python crawler series - network requests
-
Python thread 01 understanding thread
-
Analysis of earthquake distribution in the past 10 years with Python~
-
You need to master these before learning Python crawlers
-
After the old friend (R & D post) was laid off, I wanted to join the snack bar. I collected some data in Python. It's more or less a intention
-
Python uses redis
-
Python crawler - ETF fund acquisition
-
Detailed tutorial on Python operation Tencent object storage (COS)
-
[Python] comparison of list, tuple, array and bidirectional queue methods
-
Go Python 3 usage and pit Prevention Guide
Random recommended
- Python logging log error and exception exception callback method
- Learn Python quickly and take a shortcut~
- Python from 0 to 1 (day 15) - Python conditional judgment 2
- Python crawler actual combat, requests module, python to capture headlines and take beautiful pictures
- The whole activity collected 8 proxy IP sites to pave the way for the python proxy pool, and the 15th of 120 crawlers
- Why can't list be used as dictionary key value in Python
- Python from 0 to 1 (day 16) - Python conditional judgment 3
- What is the python programming language?
- Python crawler reverse webpack, a real estate management platform login password parameter encryption logic
- Python crawler reverse, a college entrance examination volunteer filling platform encrypts the parameter signsafe and decrypts the returned results
- Python simulated Login, selenium module, python identification graphic verification code to realize automatic login
- Python -- datetime (timedelta class)
- Python's five strange skills will bring you a sense of enrichment in mastering efficient programming skills
- [Python] comparison of dictionary dict, defaultdict and orderdict
- Test driven development using Django
- Face recognition practice: face recognition using Python opencv and deep learning
- leetcode 1610. Maximum Number of Visible Points(python)
- Python thread 03 thread synchronization
- Introduction and internal principles of Python's widely used concurrent processing Library Futures
- Python - progress bar artifact tqdm usage
- Python learning notes - the fifth bullet * class & object oriented
- Python learning notes - the fourth bullet IO operation
- Python crawler actual combat: crawl all the pictures in the answer
- Quick reference manual of common regular expressions, necessary for Python text processing
- [Python] the characteristics of dictionaries and collections and the hash table behind them
- Python crawler - fund information storage
- Python crawler actual combat, pyteseract module, python realizes the visualization of boos direct employment & hook post data
- Pit filling summary: Python memory leak troubleshooting tips
- Python code reading (Chapter 61): delaying function calls
- Through the for loop, compare the differences between Python and Ruby Programming ideas
- leetcode 1606. Find Servers That Handled Most Number of Requests(python)
- leetcode 1611. Minimum One Bit Operations to Make Integers Zero(python)
- 06python learning notes - reading external text data
- [Python] functions, higher-order functions, anonymous functions and function attributes
- Python Networkx practice social network visualization
- Data analysis starts from scratch, and pandas reads and writes CSV data
- Python review (format string)
- [pandas learning notes 01] powerful tool set for analyzing structured data
- leetcode 147. Insertion Sort List(python)
- apache2. 4 + windows deployment Django (multi site)