What is Apriori Algorithm in Data Mining? Implementation with Examples

Posted in

What is Apriori Algorithm in Data Mining? Implementation with Examples
ramyashankar

Ramya Shankar
Last updated on September 8, 2024

    The Apriori algorithm is a type of unsupervised learning algorithm used for association rule mining. The algorithm searches for frequent items in datasets and builds correlations and associations in the itemsets. We often see ‘frequently bought together and ‘you may also like’ in the recommendation section of online shopping platforms. That’s possible because of the apriori algorithm!

    The name apriori comes from the fact that we have ‘a’ ‘prior’ knowledge of the frequent itemset properties. If there are 2 frequent itemsets, then the algorithm aims to find the 3rd one. In general, if there are n frequent itemsets (think frequently bought items), then the Apriori algorithm aims to find the n+1 frequent itemset.

    Apriori considers each transaction as an itemset and uses a bottom-up approach in which frequent subsets are extended item by item. It was developed in 1994 by Agrawal and Srikant.

    A Little About Data Mining

    Data mining is a part of the data science process, where huge datasets are explored and analyzed to find useful patterns, trends, and rules. We use data mining techniques to develop machine learning models and help predict future outcomes.

    For example, suppose 10 customers visit a supermarket on a particular day, 8 of them buy bread, and 6 of them buy bread and milk. This trend continued the next day too, and so on.

    Through data mining, we will be able to know such combinations of products that users frequently buy. Another example is shopping on Amazon. When you buy a certain product, Amazon suggests a few more related items to buy, displaying under “frequently bought together.”

    So, if you bought a laptop, it would show you a laptop bag or laptop holder based on similar purchases from other buyers. The gist is that data mining allows businesses to understand what buyers usually prefer and customize promotions, and segment various market groups.

    Important Terms

    • Apriori property

    The Apriori property states that all the non-empty subsets of frequent itemsets should be frequent. Apriori property helps reduce the search space, thereby improving the efficiency of level-wise generation of frequent item sets.

    • Itemset

    A subset of items that frequently occur in the database is called an itemset. In a huge database, there could be many itemsets, and the search space of these itemsets is exponential. The 2 most important search mechanisms for searching frequent itemsets are breadth-first search and depth-first search. Example of itemsets:

    itemsets
    {2,3}
    {1,3,4}
    {2,3,4}
    {2,3,4}
    {1,2,3,4}
    • Frequent pattern mining

    It is the extracting of frequent itemsets from the database. Frequent pattern mining forms the basis for association rules on which the Apriori algorithm is based. For example, in the above itemsets, {2,3,4} is a frequent itemset. Through mining, machines can find such patterns.

    • Association rules

    Through association rules, we can find various associations and relationships between items in a huge dataset based on how frequently an item set occurs in a transaction. The best example of an association rule is the market basket analysis. Based on a set of transactions, supermarkets and stores can identify the association between 2 or more items that most people like to buy together – frequently.

    For example, {Bread, Milk}, {Bread, eggs}, {bread, milk, eggs}

    • Support

    It tells about the items that are frequently bought together. Support count is the frequency of occurrence of an item set. It is represented by ?. In the above example, the support count ?({2,3,4}) = 2 and support count ?({3,4)} = 4.

    • Minimum support count

    It is the minimum frequency of the itemset in the entire dataset.

    • Candidate set

    It is defined as the support count of each item in the dataset.

    • Confidence

    If items A and B are bought together, Confidence tells us the number of times that A and B are bought together, given the number of times A is bought. For every purchase of A, Confidence tells us the number of times B was also bought along with A.

    Confidence c = frequency(A and B)/frequency(A)

    • Frequent itemset mining

    When we know that certain items are bought together by most users, we can club those items together and offer better discounts and ease of purchase for the users. For example, in a supermarket, bread and eggs are kept together so that it is easy for users to pick them.

    Amazon shows ‘frequently bought together or ‘you may also like’ sections and also offers discounts if particular items are bought together, for example, {Oil+rice+pulses} or {tea+biscuits+sugar}. Such an itemset is called a frequent itemset, as it frequently occurs when the database is searched upon. Searching for such itemsets is called frequent itemset mining.

    • Lift

    Lift is the likelihood of purchase of item B when item A is purchased. It also takes into account the popularity of item B. If Lift(A=>B) = 1, the items are not correlated in the itemset. If Lift(A=>B) > 1, the items are most likely to be purchased together. If Lift(A=>B) < 1, it is unlikely that the items A and B will ever be bought together.

    • Conviction

    Conviction is another way of finding an association, wherein,

    Conviction(A=>B) = (1-support(B))/(1-confidence(A=>B))

    If the Conviction value is 1, it indicates that the items are not correlated.

    The Apriori Rule

    So now that we are done with the background of the algorithm and relevant terms, we can move ahead with understanding what the actual algorithm is. The Apriori algorithm uses a breadth-first algorithm and hash tree structure to mine the transactional database for frequent itemsets and forms association rules and relationships between the items. The 2 main steps in the algorithm are join and prune, both of which are iteratively performed to get the most frequent itemsets.

    1. Join – In this step, each item joins with itself to generate (K+1)th itemset from K itemsets.
    2. Prune – This step eliminates the items that have a support count less than the minimum threshold value. This reduces the size of the candidate set.

    If item A is frequent, then,

    • The probability P(A) > minimum support threshold.
    • If item B also belongs to the itemset and P(AUB) > minimum support threshold, then (A+B) is also frequent.

    Step-by-step Understanding of Apriori Algorithm

    The Apriori algorithm is simple to learn and implement. It follows a set of processes that help us determine the frequent itemset from the dataset or database. Usually, the minimum support threshold is set (assumed) by the analysis team that is going to apply the algorithm. Here are the steps:

    • 1st Step: Scan the whole transaction database to fetch the support value S for each item.
    • 2nd Step: If the Support S is more than or equal to the minimum threshold, add the item to the frequent itemset (L1); else, go to step 1.
    • 3rd Step: Join Lk-1 and Lk-1, and generate the set of candidate k-itemsets.
    • 4th Step: For each k-itemset, get the support S and check the minimum support threshold
    • 5th Step: Repeat the iteration in step 4, if support is not more than or equal to the minimum value
    • 6th Step: If S is more than the required value, add to the frequent k-itemsets
    • 7th Step: If there are no itemsets, stop the algorithm
    • 8th Step: Till there are frequent itemsets, for each frequent itemset L, get all the non-empty subsets
    • 9th Step: For each frequent subset of L, find the confidence C
    • 10th Step: If Confidence C is more than or equal to the minimum required Confidence, add it to the strong rules; else, move to the next frequent subset.

    Supermarket Data Analysis – A Simple Example

    This is the most popular example and the main use-case of the unsupervised learning algorithm – market basket analysis – i.e., analyzing the items that most customers buy frequently.

    For example, if you are a student, you would buy stationery items like notebooks, pencils, erasers, glue, colored papers, and so on. Most students would buy all these items together! Same way, most bachelors would buy milk, eggs, and bread together! Let us take the above example, with a limited (self-created) dataset, to understand how the algorithm is actually applied.

    Suppose in our database there are 6 transactions, and we choose the support threshold to be 50%. So, our minimum support value (sup_threshold) = 6*50/100 = 3. This means that any item is having a count of 3 or more only will be considered a frequent item. The table below shows all the transactions:

    Transaction

    Item list

    T1 Bread, Milk, Eggs
    T2 Bread, Milk, Eggs, Butter
    T3 Milk, Horlicks
    T4 Bread, Milk, Butter
    T5 Bread, Milk, Horlicks
    T6 Bread, Milk, Eggs, Butter

    Now, let us calculate the count of each item:

    Item

    Count

    Bread 5
    Milk 6
    Eggs 3
    Butter 3
    Horlicks 2

    We see that the count of Horlicks is less than 3, so Horlicks will be removed, while the other items will be added to the frequent itemset. This is called the pruning step. Now, we will find the combinations of items that occur together. This is the join step. Since there are 4 items, there will be 6 unique combinations:

    Itemset

    Count

    Bread, Milk 5
    Bread, Eggs 3
    Bread, Butter 3
    Milk, Eggs 3
    Milk, Butter 3
    Eggs, Butter 2

    Again, we apply to prune and eliminate the last itemset {Eggs, butter} because its value is less than sup_threshold. Now, with the above, we will create a 3-item set. For this, we will need the first table and the 2-itemset table and combine both to see the itemsets that pass the minimum threshold test.

    Transaction

    Item list

    Itemset

    Count

    T1 Bread, Milk, Eggs Bread, Milk 5
    T2 Bread, Milk, Eggs, Butter Bread, Eggs 3
    T3 Milk, Horlicks Bread, Butter 3
    T4 Bread, Milk, Butter Milk, Eggs 3
    T5 Bread, Milk, Horlicks Milk, Butter 3
    T6 Bread, Milk, Eggs, Butter Eggs, Butter 2

    We see from above that the combination {Bread, Milk, Eggs} and {Bread, Milk, Butter} are the only 2 combinations that pass the minimum threshold value of support (3). This means that only these 2 itemsets are frequent. Now, for each of these frequent itemsets, we have to create the association rule.

    Let us create the same for {Bread, Milk, Eggs}: {Bread, Milk} => {Eggs} Confidence Cbm = Support(Bread, Milk, Eggs)/Support(Bread, Milk) = 3/5 = 60% {Bread, Eggs} => {Milk} Confidence Cbe = Support(Bread, Milk, Eggs)/Support(Bread, Eggs) = 3/3 = 100% {Milk, Eggs} => {Bread} Confidence Cme = Support(Bread, Milk, Eggs)/Support(Milk, Eggs) = 3/3 = 100%

    This shows that the above rules are strong; for example, those who are buying bread and eggs would buy milk! Similarly, we can find the association rules for the other itemset too.

    Python Implementation of the Apriori Algorithm

    Implementing the Apriori algorithm in Python is simple, as there are libraries already in place. So, all we need to do is import the libraries, load the dataset and build the model with the support and confidence threshold values. Here is how to do it:

    #importing the necessary libraries
    import numpy as np
    import pandas as pd
    from apyori import apriori
    
    # load the dataset
    mydata = pd.read_csv(‘<path to the data file>’)
    
    #build a list of lists to store the data, note that here 50 is the number of rows in the data
    record_set = []
    for i in range(0,50):
          record_set.append([str(mydata.values[i,j]) for j in range(0,10)]) #where 10 is the number of columns of the data
    
    #You can check rows and columns using the shape method
    
    mydata.shape
    #Build the model
    assoc_rules = apriori(record_set, min_support=0.50, min_confidence=0.70, min_lift=1.2, min_length=2)
    assoc_results = list(assoc_rules)
    
    #print the results
    print(assoc_results)

    Pros and Cons of Apriori

    Some advantages of Apriori are:

    • Easy to understand and implement among all the association rule learning algorithms.
    • Exhaustive and finds all the rules with the specified confidence and support value.
    • Intuitive results that are easy to comprehend and communicate.
    • Fully unsupervised and thus can be used for many different situations as it doesn’t need labeling of data.

    Some shortcomings of Apriori are:

    • If there are a large number of transactions but limited memory, then the algorithm may find inconsistent and incorrect associations that have low support values.
    • Because of the huge candidate set, the algorithm could be slow and inefficient for transactions with many items.

    Approaches to Increase the Efficiency of the Apriori Algorithm

    • Hash-Based Technique : In this method, we use a hash table (hash-based structure) for generating the k-itemsets and the corresponding count. A hash function is used for table generation.
    • Transaction Reduction : The glitch with the Apriori algorithm is the time taken to scan the transaction database. The performance can be greatly enhanced if this time and the database size are both reduced. This is done by deleting the transactions where the value of support is less than the minimum threshold. This process iteratively reduces the size of the database and, thus, the time is taken to scan the database. This has been the most effective approach to increasing the efficiency of the algorithm. If you wish to learn more details about the approach, refer to the pdf explaining the same .
    • Partitioning : In this method, only 2 database scans are done to mine the frequent itemsets. The idea is that any frequent itemset should be present in at least one of the partitions.
    • Sampling : Well, unlike others on the list, not a very accurate method, but it works well if the min_sup (minimum support) value is lowered. A random sample S is picked from the database, and then a frequent itemset is searched.
    • Dynamic Itemset Counting : In this method, new candidate itemsets can be added at any marked start point of the database during the database scanning.

    Applications of the Apriori Algorithm

    Although there are several applications of the Apriori algorithm, the most popular one is the supermarket analysis (market basket analysis) example we saw above. Some other important applications of the unsupervised learning algorithm are:

    • Discovering the social status of diabetics.
    • Analyzing the probability of a forest fire.
    • Recommendation system (Amazon).
    • Google autocomplete feature.
    • Analysis of patient records to suggest them relevant tests and health plans.

    Conclusion

    That was all about the basics of the Apriori algorithm and also its general implementation using Python. Apriori is one of the simplest algorithms to work with. It is also a very common algorithm, the applications of which we can see in our daily lives. Remember that Apriori is an iterative algorithm that uses Support and Confidence to build the frequent itemset. These are important terms that form the core of the algorithm.

    People are also reading:

    FAQs


    It is referred to as the Apriori algorithm because it uses previous knowledge of frequent itemset properties. In this algorithm, we use an iterative approach where we use k-frequent itemsets to find k+1-frequent itemsets.

    The Apriori algorithm is used in data mining for mining frequent itemsets and relevant association rules.

    Apriori is an unsupervised learning approach, as it is used to discover interesting patterns and relationships in data.

    The major limitation of the Apriori algorithm is the waste of time to hold a vast number of candidate sets with low minimum support, large itemsets, or frequent itemsets.

    The Apriori algorithm is used to discover itemsets that occurs frequently in a transaction database.

    Leave a Comment on this Post

    0 Comments