Principle of Python Apriori algorithm (11)

2022-01-31 12:37:10 FizzH

「 This is my participation 11 The fourth of the yuegengwen challenge 11 God , Check out the activity details ：2021 One last more challenge 」 Apriori The algorithm is a famous association rule mining algorithm .

If we are running a grocery store with a small variety of goods , We are very interested in what we often buy together . We only have four kinds of goods ： goods 0、 goods 1、 goods 2、 goods 3. So what are all the combinations of goods that may be purchased together ？ These commodity combinations may have one commodity , For example, commodities 0, It may also include two 、 Three or all four commodities . But we don't care if someone buys two items 0 And four items 2 The situation of , Only care that he bought one or more goods .

The figure below shows all possible combinations between items ：

• The item number used in the drawing 0 To represent an object 0 In itself .
• The first set from top to bottom in the figure is $\phi$, Represents an empty set or a set that does not contain any items .
• The connection between item collections indicates that two or more collections can be combined to form a larger collection .

The goal is ： Our goal is to find a collection of items that we often buy together . We use the support of a set to measure its frequency .

The support of a set refers to the proportion of transactions that contain the set .

problem ： How to a given set , such as {0,3}, To calculate its support ？

• We can traverse each record and check that the record contains 0 and 3, If the record does contain both , Then add the total value . After scanning all the data , Use the total number of statistics divided by the total number of transactions , You can get support .

Be careful ： The above processes and results are only for a single set {0,3}. To obtain the support of each possible set, you need to repeat the above process several times . We can count the number of sets in the figure below , You will find that even for only 4 A collection of items , You also need to traverse the data 15 Time . As the number of items increases, the number of iterations will increase sharply . To contain N The data sets for each item share $2^{N-1}$ An itemset combination . And actually sell 10 000 It is not uncommon for shops to have more or more items . Even if you only sell 100 There will also be stores for these kinds of goods $1.26 * 10^{30}$ Two possible itemset combinations . This amount of computation , In fact, even for many computers today , It also takes a long time to complete the operation .

Apriori The principle of the algorithm can help us reduce the itemsets we may be interested in , Reduce the required calculation time .

Apriori Algorithm principle ：

• If an itemset is frequent , Then all its subsets are frequent , for example , hypothesis {1,2} It's frequent , that {1} and {2} It must be frequent .

• Take this principle backwards and you will find ： If a itemset is infrequent , Then all its supersets are also infrequent

Known itemsets {2,3} It's not frequent , Then you can immediately determine the itemset {0,2,3}{1,2,3}{0,1,2,3} Are infrequent , Therefore, the support of these itemsets does not need to be calculated

Apriori The general process of the algorithm ：

1. collecting data ： Use any method .
2. Prepare the data ： Any data type can , Because we only save collections .
3. Analyze the data ： Use any method .
4. Training algorithm ： Use Apriori Algorithm to find frequent itemsets .
5. The test algorithm ： No testing process is required .
6. Usage algorithm ： Used to discover frequent itemsets and association rules between items .

Implementation of data set scanning method ：

from numpy import *

Copy code 

 :return: dataset
'''
return [[1, 3, 4], [2, 3, 5], [1, 2, 3, 5], [2, 5]]

def createC1(dataSet):
'''
establish C1 Candidate item set ,C1 Yes, all sizes are 1 List of candidate itemsets
:param dataSet:
:return: C1
'''
# C1 Yes, all sizes are 1 List of candidate itemsets
C1 = []
#  Traversing data sets , Add one by one to C1 in
for record in dataSet:
for item in record:
if not [item] in C1:
C1.append([item])
C1.sort()
#  Use invariant sets to store C1 Each candidate set inside , Then it can be used as a dictionary Key, If it is list Type cannot be used directly as a dictionary Key
return list(map(frozenset, C1))

def scanDataset(dataset, ck, minSupport):
'''
Scan data set , Determine frequent itemsets
:param dataset:
:param ck: ck Yes, all sizes are k List of candidate itemsets
:param minSupport:  Set the minimum support threshold
:return:  Eligible itemsets 、 Support for each itemset
'''
#  Number of occurrences of the save item set
selectedSetCount = {}
for record in dataset:    #  Traverse each record
for candidateSet in ck:
#  Judge whether the current candidate item set is a subset of the current record
if candidateSet.issubset(record):
if candidateSet not in selectedSetCount:
selectedSetCount[candidateSet] = 1
else:
selectedSetCount[candidateSet] += 1
#  Calculate the total number of entries
numItems = float(len(dataset))
#  Store eligible itemsets
retList = []
#  Support for save item sets
supportData = {}
for key in selectedSetCount:
#  Calculate support
support = selectedSetCount[key] / numItems
if support >= minSupport:
retList.insert(0, key)
supportData[key] = support
return retList, supportData

if __name__ == '__main__':
from pprint import pprint
Copy code