current position:Home>Analysis on several implementations of Python crawler data De duplication

Analysis on several implementations of Python crawler data De duplication

2022-02-01 12:01:42 Salted fish Science

「 This is my participation 11 The fourth of the yuegengwen challenge 25 God , Check out the activity details :2021 One last more challenge

Reptile de duplication scene

1、 Prevent duplicate requests

2、 Prevent storing duplicate data

The basic principle of de duplication of crawler data

According to the given judgment basis and the given weight removal container , Judge the original data one by one , Judge whether there is this data in the weight removal container , If not, add the judgment basis corresponding to the data to the de duplication container , At the same time, mark the data as non duplicate data , If so, don't add , At the same time, mark the data as duplicate data

  • The basis of judgment ( Raw data 、 Original data eigenvalue ) - How to specify that two data are duplicate ?
  • Remove the heavy container ( Store the judgment basis for judging the original data )

Remove the heavy container

Based on the original data

Based on the eigenvalues of the original data , It won't take up too much space

Temporary de duplication container and persistent de duplication container

1、 Temporary weight removal container

Refers to the use of list、set The data structure of such programming language stores the de duplication data , Once the program is closed or restarted , The data in the de duplication container is recycled . advantage : Easy to use and implement ;

shortcoming : But you can't share 、 Cannot persist

2、 Persistent de duplication container

Refers to the use of redis、mysql Wait for the database to store the de duplication data .

advantage : Persistence 、 share ;

shortcoming : But the use and implementation are relatively complex

Several special original data eigenvalue calculations are commonly used

1、 The information in this paper, hash Algorithm ( The fingerprint ) 2、SimHash Algorithm - Blur text 3、 Bloom filter mode - Hundreds of millions of levels of data De duplication

De duplication based on information summarization algorithm

The information in this paper, hash Algorithm means that text of any length can be 、 Bytes of data , Get a fixed length text through an algorithm . Such as MD5(128 position )、SHA1(160 position ) etc. .

features : As long as the source text is different , The result of the calculation , It must be different ( Abstract ).

Abstract : The algorithm is mainly used to compare whether the information sources are consistent , Because as long as the source changes , The resulting summary must be different ; And usually the result is much shorter than the source , So called “ Abstract ”.

Is therefore , Using the information summarization algorithm can greatly reduce the storage space utilization of the de duplication container , And improve the judgment speed , And because of its strong uniqueness , There is almost no miscalculation .

Be careful :hash The result of the algorithm is essentially a string of values , Such as md5 Of 128 Bit refers to the length of binary , The length of hexadecimal is 32 position . A binary equals a hexadecimal .

be based on simhash Algorithm de duplication

Simhash The algorithm is a locally sensitive hash algorithm , It can realize the de duplication of similar text content

Information digest algorithm : If the original content is only one byte apart , The resulting signatures are also likely to be very different .

Simhash Algorithm : If the original content is only one byte apart , The resulting signature difference is very small .

Simhash Value comparison : Through both simhash The difference of the binary bit of the value represents the difference of the original text content . The number of differences is also called Hamming distance .

Be careful

Simhash For long text 500 word + More applicable , Short text may deviate greatly

stay google In the data given in the paper ,64 position simhash value , At Hamming, the distance is 3 Under the circumstances , It can be considered that the two documents are similar or repetitive . Of course, this value is only a reference value , There may be different test values for your own application

copyright notice
author[Salted fish Science],Please bring the original link to reprint, thank you.

Random recommended