current position:Home>Principle of Python Apriori algorithm (11)

Principle of Python Apriori algorithm (11)

2022-01-31 12:37:10 FizzH

「 This is my participation 11 The fourth of the yuegengwen challenge 11 God , Check out the activity details :2021 One last more challenge 」 Apriori The algorithm is a famous association rule mining algorithm .

If we are running a grocery store with a small variety of goods , We are very interested in what we often buy together . We only have four kinds of goods : goods 0、 goods 1、 goods 2、 goods 3. So what are all the combinations of goods that may be purchased together ? These commodity combinations may have one commodity , For example, commodities 0, It may also include two 、 Three or all four commodities . But we don't care if someone buys two items 0 And four items 2 The situation of , Only care that he bought one or more goods .

The figure below shows all possible combinations between items :

  • The item number used in the drawing 0 To represent an object 0 In itself .
  • The first set from top to bottom in the figure is ϕ \phi , Represents an empty set or a set that does not contain any items .
  • The connection between item collections indicates that two or more collections can be combined to form a larger collection .

The goal is : Our goal is to find a collection of items that we often buy together . We use the support of a set to measure its frequency .

The support of a set refers to the proportion of transactions that contain the set .

problem : How to a given set , such as {0,3}, To calculate its support ?

  • We can traverse each record and check that the record contains 0 and 3, If the record does contain both , Then add the total value . After scanning all the data , Use the total number of statistics divided by the total number of transactions , You can get support .

Be careful : The above processes and results are only for a single set {0,3}. To obtain the support of each possible set, you need to repeat the above process several times . We can count the number of sets in the figure below , You will find that even for only 4 A collection of items , You also need to traverse the data 15 Time . As the number of items increases, the number of iterations will increase sharply . To contain N The data sets for each item share 2 N 1 2^{N-1} An itemset combination . And actually sell 10 000 It is not uncommon for shops to have more or more items . Even if you only sell 100 There will also be stores for these kinds of goods 1.26 1 0 30 1.26 * 10^{30} Two possible itemset combinations . This amount of computation , In fact, even for many computers today , It also takes a long time to complete the operation .

Apriori The principle of the algorithm can help us reduce the itemsets we may be interested in , Reduce the required calculation time .

Apriori Algorithm principle :

  • If an itemset is frequent , Then all its subsets are frequent , for example , hypothesis {1,2} It's frequent , that {1} and {2} It must be frequent .

  • Take this principle backwards and you will find : If a itemset is infrequent , Then all its supersets are also infrequent

Known itemsets {2,3} It's not frequent , Then you can immediately determine the itemset {0,2,3}{1,2,3}{0,1,2,3} Are infrequent , Therefore, the support of these itemsets does not need to be calculated

Apriori The general process of the algorithm :

  1. collecting data : Use any method .
  2. Prepare the data : Any data type can , Because we only save collections .
  3. Analyze the data : Use any method .
  4. Training algorithm : Use Apriori Algorithm to find frequent itemsets .
  5. The test algorithm : No testing process is required .
  6. Usage algorithm : Used to discover frequent itemsets and association rules between items .

Implementation of data set scanning method :

from numpy import *

def loadDataSet():
 Copy code 

Load data set

 :return: dataset
    return [[1, 3, 4], [2, 3, 5], [1, 2, 3, 5], [2, 5]]

def createC1(dataSet):
     establish C1 Candidate item set ,C1 Yes, all sizes are 1 List of candidate itemsets 
    :param dataSet:
    :return: C1
    # C1 Yes, all sizes are 1 List of candidate itemsets 
    C1 = []
    #  Traversing data sets , Add one by one to C1 in 
    for record in dataSet:
        for item in record:
            if not [item] in C1:
    #  Use invariant sets to store C1 Each candidate set inside , Then it can be used as a dictionary Key, If it is list Type cannot be used directly as a dictionary Key
    return list(map(frozenset, C1))

def scanDataset(dataset, ck, minSupport):
     Scan data set , Determine frequent itemsets 
    :param dataset:
    :param ck: ck Yes, all sizes are k List of candidate itemsets 
    :param minSupport:  Set the minimum support threshold 
    :return:  Eligible itemsets 、 Support for each itemset 
    #  Number of occurrences of the save item set 
    selectedSetCount = {}
    for record in dataset:    #  Traverse each record 
        for candidateSet in ck:
            #  Judge whether the current candidate item set is a subset of the current record 
            if candidateSet.issubset(record):    
                if candidateSet not in selectedSetCount:
                    selectedSetCount[candidateSet] = 1
                    selectedSetCount[candidateSet] += 1
    #  Calculate the total number of entries 
    numItems = float(len(dataset))
    #  Store eligible itemsets 
    retList = []
    #  Support for save item sets 
    supportData = {}
    for key in selectedSetCount:
        #  Calculate support 
        support = selectedSetCount[key] / numItems
        if support >= minSupport:
            retList.insert(0, key)
        supportData[key] = support
    return retList, supportData

if __name__ == '__main__':
    from pprint import pprint
    dataset = loadDataSet()
    c1 = createC1(dataset)
    pprint(scanDataset(dataset, c1, 0.5))
 Copy code 

copyright notice
author[FizzH],Please bring the original link to reprint, thank you.

Random recommended