Data Mining | Association Analysis

Data Mining | Association Analysis#

Association Analysis#

Concept#

In data mining, association analysis is a commonly used unsupervised learning algorithm. Unlike classification and clustering algorithms, the main purpose of this type of algorithm is to discover the associations between the inherent structural features of the data. Association rules generally include the following content, and this article mainly introduces simple association analysis and the Apriori algorithm.

Source: THU Data Mining

In simpler terms, association rules are used to find frequent patterns, associations, correlations, or causal structures that exist between sets of items or objects in transaction data, relational data, or other information carriers. In other words, association analysis is about discovering the relationships between different products (items) in a transaction database.

These relationships can take two forms: frequent itemsets or association rules.

Frequent itemsets are collections of items that frequently appear together.

Association rules imply that there may be a strong relationship between two items.

Here is an example to illustrate these two concepts. The table below shows the transaction list of a grocery store:

Transaction Code	Products
0	Soy milk, Lettuce
1	Lettuce, Diapers, Wine, Beets
2	Soy milk, Diapers, Wine, Orange juice
3	Lettuce, Soy milk, Diapers, Wine
4	Lettuce, Soy milk, Diapers, Orange juice

Frequent itemsets refer to collections of items that frequently appear together. The set ${Wine, Diapers, Soy milk}$ in the table is an example of a frequent itemset. From the dataset, we can also find association rules such as $Diapers \rightarrow Wine$. This means that if someone buys diapers, they are likely to also buy wine. By using frequent itemsets and association rules, businesses can better understand their customers.

How should these interesting relationships be defined? Who defines what is interesting? What is the definition of frequent when searching for frequent itemsets? There are many concepts that can answer the above questions, but the most important ones are support and confidence.

Validity Indicators#

Support#

The support of an itemset is defined as the proportion of records in the dataset that contain that itemset, i.e., the frequency of the event occurring.

support(\{A, B\})=\operatorname{num}(A \cup B)=P(AB)

From the table above, the support of ${Soy milk}$ is 4/5. Among the 5 transaction records, 3 contain ${Soy milk, Diapers}$, so the support of ${Soy milk, Diapers}$ is 3/5. Support is defined for itemsets, so a minimum support can be defined, retaining only those itemsets that meet the minimum support.

Confidence#

Confidence is defined for an association rule, revealing the probability that event B occurs given that event A has occurred.

confidence(A \rightarrow B)=\frac{\text { support }(\{A, B\})}{\operatorname{support}(\{A\})}=P(B \mid A)=\frac{P(A B)}{P(A)}

For example, the confidence of the rule ${Diapers} \rightarrow {Wine}$ is defined as $support({Diapers, Wine}) / support({Diapers})$. From the table, since the support of ${Diapers, Wine}$ is 3/5 and the support of ${Diapers}$ is 4/5, the confidence of $Diapers \rightarrow Wine$ is 3/4=0.75. This means that for all records containing "Diapers," our rule applies to 75% of them.

Support and confidence are used to quantify whether association analysis is successful. If we want to find all itemsets with support greater than 0.8, how should we do it? One way is to generate a list of all possible combinations of items and then count the frequency of each combination, but when there are thousands of items, this approach is very slow. The next section will analyze this situation in detail and discuss the Apriori principle, which reduces the computational burden required for learning association rules.

Apriori Algorithm for Discovering Frequent Itemsets#

As mentioned above, the goals of association analysis include two aspects: discovering frequent itemsets and discovering association rules. First, frequent itemsets need to be found before obtaining association rules. This section will focus solely on discovering frequent itemsets.

Theorems of Frequent Itemsets#

Frequent itemsets have the following two theorems, which will be applied in our search for frequent itemsets and their association rules later:

Theorem 1: The subset of a frequent itemset must also be a frequent itemset (if the itemset ${A,C}$ is a frequent itemset, then ${A}$ and ${C}$ are also frequent itemsets).

Theorem 2: The superset of a non-frequent itemset must also be non-frequent (if the itemset ${D}$ is not a frequent itemset, then ${A,D}$ and ${C,D}$ are also not frequent itemsets).

The proof of Theorem 1 is relatively straightforward, as the support of a subset of an itemset cannot be less than that of the itemset itself. Theorem 2 is the contrapositive of Theorem 1.

Apriori Principle

The figure shows all possible itemsets, with non-frequent itemsets represented in gray. Since the set ${2,3}$ is non-frequent, ${0,2,3}$, ${1,2,3}$, and ${0,1,2,3}$ are also non-frequent, and their support does not need to be calculated.

Generating Frequent Itemsets#

The Apriori algorithm is a method for discovering frequent itemsets. The two input parameters for the Apriori algorithm are the minimum support and the dataset. The algorithm first generates a list of itemsets containing all single items. It then scans the transaction records to see which itemsets meet the minimum support requirement, discarding those that do not. Next, it combines the remaining sets to generate itemsets containing two elements. The process is repeated until all itemsets are discarded.

In summary, the Apriori algorithm iteratively searches for lower-level supersets and discovers frequent itemsets within those supersets. This continues through multiple iterations until the maximum frequent itemsets are obtained at the top level. Each round of iteration includes the following two steps:

Generate candidate set $C_k$, which is a collection of items that may become frequent itemsets.

Prune candidate set $C_k$, i.e., calculate the corresponding support based on $C_k$ and reduce candidate set $C_k$ according to the minimum support $S_{min}$, resulting in a new candidate set $C_{k+1}$. This process is repeated until no candidate itemsets can be generated, and the last round's frequent itemsets are the maximum frequent itemsets required by Apriori.

The pseudocode for the entire Apriori algorithm is as follows:

While the number of items in the set is greater than 0:
    Construct a list of candidate itemsets of size k
    Check the data to confirm each itemset is frequent
        For each transaction record tran in the dataset:
            For each candidate itemset can:
                Check if can is a subset of tran:
                    If so, increase the count of can
        For each candidate itemset:
            If its support is not less than the minimum value, retain the itemset
    Return the list of all frequent itemsets
    Retain frequent itemsets and construct a list of candidate itemsets of size k+1
        The second theorem of frequent itemsets: the superset of a non-frequent itemset must also be non-frequent

Mining Association Rules from Frequent Itemsets#

Obtaining the maximum frequent itemsets is not the final goal. Earlier, when assessing the validity of association rules, we learned about the two indicators of confidence and support. Among them, support has already played a role in the process of finding maximum frequent itemsets, and now it is time for confidence to shine in generating association rules.

How many association rules can be generated from a frequent itemset? The grid diagram below shows all association rules generated from the itemset ${0,1,2,3}$. To find the rules of interest, we first generate a list of possible rules and then test the confidence of each rule. If the confidence does not meet the minimum requirement, the rule is discarded.

Grid diagram of association rules for frequent itemset {0,1,2,3}. The shaded area indicates rules with low confidence. If the rule 0,1,2→3 is found to have low confidence, then all other rules with 3 as the consequent will also have low confidence.

Similar to the generation of frequent itemsets in the previous section, we can generate many association rules for each frequent itemset. If we can reduce the number of rules to ensure the problem's solvability, the calculations will be much easier. It can be observed that if a rule does not meet the minimum confidence requirement, then all subsets of that rule will also not meet the minimum confidence requirement. For example, if the rule $0,1,2 \rightarrow 3$ does not meet the minimum confidence requirement, then any rule with ${0,1,2}$ as the antecedent will also not meet the minimum confidence requirement. These rules are shaded in the diagram.

The above property of association rules can be used to reduce the number of rules that need to be tested. Similar to the Apriori algorithm, we can start from a frequent itemset and then create a list of rules where the right-hand side of the rules contains only one element, and then test these rules. Next, we merge all remaining rules to create a new rule list where the right-hand side contains two elements. This method is also known as the hierarchical method.

Code Listings#

Python#

# Auxiliary functions in the Apriori algorithm

def loadDataSet():
    """
    Create a simple dataset for testing
    """
    return [[1,3,4], [2,3,5], [1,2,3,5], [2,5]]

def creatC1(dataSet):
    """
    Input: dataset dataSet
    Output: C1 is the collection of all candidate itemsets of size 1
    Process: First create an empty list C1 to store all unique item values. Then traverse all transaction records in the dataset. For each record, iterate through each item in the record. If an item is not present in C1, add it to C1. Here, we do not simply add each item, but rather add a list containing only that item. This is done to construct a set for each item, as set operations are needed in the subsequent processing of the Apriori algorithm. Python cannot create a set with only one integer, so we must use a list here (interested readers can try it). This is why we use a large list composed of single-item lists. Finally, sort the large list and map each single-element list to frozenset(), returning a list of frozensets.
    Details: Since the algorithm starts by extracting the candidate itemset list from the input data, a special function is needed to handle this, while subsequent itemset lists are stored in a specific format. The format used here is the frozenset type in Python. A frozenset is a "frozen" set, meaning it is immutable and cannot be modified by the user. It is necessary to use frozenset instead of set type because these sets must be used as dictionary keys later, which frozenset allows but set does not.
    """
    C1 = []
    for transaction in dataSet:
        for item in transaction:
            if not [item] in C1:
                C1.append([item])
                
    C1.sort()
    return map(frozenset, C1)

def scanD(D, Ck, minSupport):
    """
    Input: dataset, candidate itemset list Ck, minimum support minSupport of interest
    Output: This function generates Lk from Ck, and also returns a dictionary containing support values for later use.
    Process: The function first creates an empty dictionary ssCnt, then iterates through all transaction records in the dataset and all candidate sets in C1. If a set in C1 is part of a record, the corresponding count in the dictionary is increased. Here, the keys of the dictionary are the sets. After scanning all items in the dataset and all candidate sets, support needs to be calculated. Sets that do not meet the minimum support requirement will not be output. The function will also first construct an empty list that contains sets meeting the minimum support requirement. The next loop iterates through each element in the dictionary and calculates support. If support meets the minimum support requirement, the dictionary element is added to retList, and the function finally returns the support data of the frequent itemsets.
    Details: You can use the statement retList.insert(0,key) to insert any new set at the beginning of the list. Of course, it doesn't have to be at the beginning; this is just to make the list look organized.
    """
    ssCnt = {}
    for tid in D:
        for can in Ck:
            if can.issubset(tid):
                if not ssCnt.has_key(can): ssCnt[can]=1
                else: ssCnt[can] += 1
    numItems = float(len(D))
    retList = []
    supportData = {}
    for key in ssCnt:
        support = ssCnt[key]/numItems
        if support >= minSupport:
            retList.insert(0, key)
        supportData[key] = support
    return retList, supportData


# Apriori Algorithm

def aprioriGen(Lk, k):
    """
    Input: frequent itemset list Lk, number of elements in itemset k
    Output: candidate itemset Ck
    Example: Input {0}, {1}, {2} will generate {0,1}, {0,2}, and {1,2}
    Process: First create an empty list, then calculate the number of elements in Lk. Next, compare each element in Lk with other elements, which can be done using two for loops. Then, take two sets from the list for comparison. If the first k-2 elements of these two sets are equal, merge these two sets into a set of size k.
    Details: The k-2 part may be confusing. Further discussion will clarify the details. When constructing {0,1}, {0,2}, {1,2} from {0}, {1}, {2}, we are actually combining single items together. Now, if we want to use {0,1}, {0,2}, {1,2} to create three-element itemsets, how should we do it? If we merge every two sets, we would get {0,1,2}, {0,1,2}, {0,1,2}. In other words, the same resulting set would be repeated three times. We need to scan the three-element itemset list to get non-repeating results, and we want to ensure the number of iterations is minimized. Now, if we compare the first element of the sets {0,1}, {0,2}, {1,2} and only perform the merge operation on sets with the same first element, what result do we get? {0,1,2}, and only one operation! This way, we do not need to traverse the list to find non-repeating values.
    """
    retList = []
    lenLk = len(Lk)
    for i in range(lenLk):
        for j in range(i+1, lenLk):
            L1 = list(Lk[i])[:k-2]; L2 = list(Lk[j])[:k-2]
            L1.sort(); L2.sort()
            if L1==L2:
                retList.append(Lk[i] | Lk[j])
    return retList

def apriori(dataSet, minSupport = 0.5):
    """
    Input: dataset dataSet, support minSupport
    Output: L contains the list of frequent itemsets that meet the minimum support, suppData is a dictionary containing the support values of the itemsets
    Process: The function will generate a list of candidate itemsets, which is done by first creating C1 and then reading the dataset to convert it into D (a list of sets). The program uses the map function to map set() to each item in the dataSet list. Next, the scanD() function is used to create L1, and L1 is placed in the list L. L will contain L1, L2, L3, etc. Now that we have L1, we will continue to find L2, L3, etc., which can be done using a while loop that creates a larger list of larger itemsets until the next larger itemset is empty. If this sounds a bit confusing, you will see its workflow shortly. First, use aprioriGen() to create Ck, and then use scanD() based on Ck to create Lk. Ck is a candidate itemset list, and then scanD() will traverse Ck, discarding itemsets that do not meet the minimum support requirement. The Lk list is added to L, while increasing the value of k, and repeating the above process. Finally, when Lk is empty, the program returns L and exits.
    """
    C1 = creatC1(dataSet)
    D = map(set, dataSet)
    L1, supportData = scanD(D, C1, minSupport)
    L = [L1]
    k = 2
    while (len(L[k-2]) > 0):
        Ck = aprioriGen(L[k-2], k)
        Lk, supK = scanD(D, Ck, minSupport)
        supportData.update(supK)
        L.append(Lk)
        k += 1
    return L, supportData


# Association Rule Generation Function

def generateRules(L, supportData, minConf=0.7):
    """
    Input: frequent itemset list, dictionary containing support data of those frequent itemsets, minimum confidence threshold
    Output: generates a list of rules containing confidence, which can later be sorted based on confidence
    Process: This function iterates through each frequent itemset in L and creates a list H1 containing single-element sets for each frequent itemset. Since association rules cannot be constructed from single-element itemsets, the rule construction process starts from itemsets containing two or more elements. If starting from the set {0,1,2}, then H1 should be [{0},{1},{2}]. If the number of elements in the frequent itemset exceeds 2, further merging will be considered. The specific merging can be done through the function rulesFromConseq(), which will be discussed in detail later. If the itemset contains only two elements, the function calcConf() will be used to calculate the confidence value.
    """
    bigRuleList = []
    for i in range(1, len(L)):
        for freqSet in L[i]:
            H1 = [frozenset([item]) for item in freqSet]
            if (i > 1):
                rulesFromConseq(freqSet, H1, supportData, bigRuleList, minConf)
            else:
                calcConf(freqSet, H1, supportData, bigRuleList, minConf)
    return bigRuleList

def calcConf(freqSet, H, supportData, brl, minConf=0.7):
    """
    Purpose: Evaluate rules, calculate the confidence of the rules, and find rules that meet the minimum confidence requirement
    Process: The function will return a list of rules that meet the minimum confidence requirement. To store these rules, an empty list prunedH is created. Next, iterate through all itemsets in H and calculate their confidence values. The confidence calculation uses the support data from supportData. By importing this support data, a lot of computation time can be saved. If a rule meets the minimum confidence value, these rules will be output to the screen. The checked rules will also be returned and used in the next function rulesFromConseq(). At the same time, the list brl will also be filled, which is the bigRuleList checked earlier.
    """
    prunedH = []
    for conseq in H:
        conf = supportData[freqSet]/supportData[freqSet-conseq]
        if conf >= minConf:
            print(freqSet-conseq,'-->',conseq,'conf',conf)
            brl.append((freqSet-conseq, conseq, conf))
            prunedH.append(conseq)
    return prunedH

def rulesFromConseq(freqSet, H, supportData, brl, minConf=0.7):
    """
    Purpose: Generate a candidate rule set
    Process: To generate more association rules from the initial itemset, the rulesFromConseq() function can be used. This function has two parameters: one is the frequent itemset, and the other is the list of elements that can appear on the right side of the rule H. The function first calculates the size m of the frequent set in H. Next, it checks whether the frequent itemset is large enough to remove subsets of size m. If it can be removed, it will be. The aprioriGen() function can be used to generate non-repeating combinations of elements in H. The result will be stored in Hmp1, which is also the H list for the next iteration. Hmp1 contains all possible rules. The calcConf() function can be used to test their confidence to determine whether the rules meet the requirements. If more than one rule meets the requirements, the function rulesFromConseq() will be called iteratively with Hmp1 to determine whether these rules can be further combined.
    """
    m = len(H[0])
    if (len(freqSet) > (m+1)):
        Hmp1 = aprioriGen(H, m+1)
        Hmp1 = calcConf(freqSet, Hmp1, supportData, brl, minConf)
        if (len(Hmp1) > 1):
            rulesFromConseq(freqSet, Hmp1, supportData, brl, minConf)

R#

See Association Analysis with R

References:#

Association Analysis - This entry was written and reviewed by the "Science Popularization China" scientific encyclopedia project

Li Rui et al., Machine Learning in Action, People's Posts and Telecommunications Press, 2013, 201-222.