current position:Home>Python and Bloom filters

Python and Bloom filters

2022-02-01 17:14:50 Cheng Xuyuan Xiaozhuang

This is my participation 11 The fourth of the yuegengwen challenge 30 God , Check out the activity details :2021 One last more challenge

What is a bloon filter

In essence, the bloom filter is a data structure , A more ingenious probabilistic data structure (probabilistic data structure), Features efficient insertion and query , It can be used to tell you “ Something must not exist or may exist ”.

Compared to traditional List、Set、Map And so on , It's more efficient 、 Take up less space , But the disadvantage is that the returned result is probabilistic , Not exactly .

The bloom filter can be used to retrieve whether an element is in a collection . Its advantage is that the space efficiency and query time are far more than the general algorithm , The disadvantage is that it has certain error recognition rate and deletion difficulty .

Realization principle

HashMap The problem of

Before we talk about the principle of Bloom filter , Let's think about , Usually you judge whether an element exists by what ? Many people should answer HashMap Well , You can actually map values to HashMap Of Key, Then you can go to O(1) Return results within the time complexity of , Extremely efficient . however HashMap The implementation also has disadvantages , For example, the proportion of storage capacity is high , Considering the existence of load factor , Usually space cannot be used up , And once you are worth a lot, such as hundreds of millions , that HashMap The amount of memory occupied becomes considerable .

For example, your data set is stored on a remote server , Local services accept input , The data set is too large to be read into memory at one time HashMap When , There will be problems .

Data structure of Bloom filter

The bloom filter is a bit Vector or bit Array , Long like this :

img

If we want to map a value to a bloom filter , We need to use Multiple different hash functions Generate ** Multiple hash values ,** And for each generated hash value that points to bit Location 1, For example, for values “baidu” And three different hash functions generate hash values 1、4、7, Then the figure above changes to :

img

Ok, Let's save another value now “tencent”, If the hash function returns 3、4、8 Words , Figure goes on to :

img

It is worth noting that ,4 This bit Bit because the hash function for both values returns this bit position , So it's covered . Now if we want to check “dianping” Does this value exist , Hash function returned 1、5、8 Three values , As a result, we found that 5 This bit The value on bit is 0, It means that no value is mapped to this bit On a , So we can say for sure “dianping” This value does not exist . And when we need to query “baidu” If this value exists , The hash function must return 1、4、7, And then we checked and found these three bit All values on the bit are 1, So we can say “baidu” Does it exist ? The answer is no , Can only be “baidu” This value may exist .

Why is that ? The answer is simple , Because as the value increases, more and more , Be set to 1 Of bit More and more , Such a value “taobao” Even if it's not stored , But in case the hash function returns three bit Bits are set by other values 1 , Then the program will still judge “taobao” This value exists .

python Implement the bloon filter - Define bloom manually

import mmh3
from bitarray import bitarray


# Implement a simple bloom filter with murmurhash algorithm.
# Bloom filter is used to check whether an element exists in a collection,
# and it has a good performance in big data situation.
# It may has positive rate depend on hash functions and elements count.


BIT_SIZE = 5000000


class BloomFilter:

    def __init__(self, bit_size=BIT_SIZE):
        # Initialize bloom filter, set size and all bits to 0
        self.bit_size = bit_size
        self.bit_array = bitarray(self.bit_size)
        self.bit_array.setall(0)

    def add(self, url):
        # Add a url, and set points in bitarray to 1 (Points count is equal to hash funcs count.)
        # Here use 7 hash functions.
        point_list = self.get_positions(url)

        for b in point_list:
            self.bit_array[b] = 1

    def is_contain(self, url):
        # Check if a url is in a collection
        point_list = self.get_positions(url)
        is_contain = True
        for b in point_list:
            is_contain = is_contain and self.bit_array[b]
        return is_contain

    def get_positions(self, url):
        # Get points positions in bit vector.
        return [mmh3.hash(url, 41+i) % self.bit_size for i in range(7)]


if __name__ == '__main__':
    bl = BloomFilter()
    bl.add('baidu')
    result = bl.is_contain('baidu11')
    print(result)
 Copy code 

Python Using a bloom filter - Based on thread module

Install the module :bitarray、pybloom_live

#python3.6  install 
# You need to install bitarray
pip3 install bitarray-0.8.1-cp36-cp36m-win_amd64.whl(pybloom_live Rely on this package , You need to install )
# Download address :https://www.lfd.uci.edu/~gohlke/pythonlibs/
pip3 install pybloom_live
 Copy code 

Example 1

#ScalableBloomFilter  It can automatically expand capacity 
from pybloom_live import ScalableBloomFilter

bloom = ScalableBloomFilter(
    initial_capacity=100, 
    error_rate=0.001, 
    mode=ScalableBloomFilter.LARGE_SET_GROWTH
)

url = "www.cnblogs.com"
url2 = "www.sougou.com"

bloom.add(url)

print(url in bloom)		#  There is returned True
print(url2 in bloom)	#  There is no return False
 Copy code 

Example 2

Copy#BloomFilter  It's fixed length 
from pybloom_live import BloomFilter

bf = BloomFilter(capacity=1000)
url='www.baidu.com'
bf.add(url)

print(url in bf)

print("www.sougou.com" in bf)
 Copy code 

Conclusion

The article begins with the official account of WeChat Cheng Xuyuan Village , Synchronize on Nuggets .

It's not easy to code words , Reprint please explain the source , Let's go after passing by with our lovely little fingers (╹▽╹)

copyright notice
author[Cheng Xuyuan Xiaozhuang],Please bring the original link to reprint, thank you.
https://en.pythonmana.com/2022/02/202202011714486648.html

Random recommended