What is Apriori Algorithm in Data Mining? Implementation with Examples

By | August 26, 2020
Apriori Algorithm in Data Mining

Introduction

Apriori algorithm is a type of unsupervised learning algorithm used for association rule mining. The algorithm searches for frequent items in datasets and builds the correlations and associations in the itemsets. We often see ‘frequently bought together’ and ‘you may also like’ on the recommendation section of online shopping platforms – that’s the apriori algorithm!

The name Apriori comes from the fact that we have ‘a’ ‘prior’ knowledge of the frequent itemset properties. If there are two frequent itemsets, the algorithm aims to find the 3rd one. In general, if there are n frequent itemsets (think frequently bought items), then the Apriori algorithm aims to find the n+1 frequent itemset.

Vamware

Apriori considers each transaction as an itemset and uses a bottom-up approach in which frequent subsets are extended item by item. It was developed in 1994 by Agrawal and Srikant.

A little about data mining

Data mining is a part of the data science process, where huge datasets are explored and analyzed to find useful patterns, trends and rules. Data mining techniques are used to build machine learning models and help predict future outcomes. For example, suppose ten customers visited a supermarket on a particular day, and 8 of them bought bread, 6 of them bought bread and milk both. This trend continued on the next day too, and so on. Through data mining, we will be able to know such combinations of products that users frequently buy.

Another example is shopping on Amazon. When you buy a certain product, Amazon suggests you few more related items to buy displaying “frequently bought together”. So, if you bought a laptop, it would show you a laptop bag or laptop holder based on similar purchases from other buyers.

The gist is that data mining allows businesses to understand what buyers usually prefer and customize promotions and segment various market groups.

Important terms

Apriori property

Apriori property states that all the non-empty subsets of frequent itemset should be frequent. Apriori property helps reduce the search space, thereby improving the efficiency of level-wise generation of frequent itemsets.

Itemset

A subset of items that frequently occur in the database is called an itemset. In a huge database, there could be many itemsets, and the search space of these itemsets is exponential. The two most important search mechanisms for searching frequent itemsets are breadth-first search and depth-first search. Example of itemsets:

itemsets
{2,3}
{1,3,4}
{2,3,4}
{2,3,4}
{1,2,3,4}

Frequent pattern mining

It is the extracting of frequent itemsets from the database. Frequent pattern mining forms the basis for association rules on which the Apriori algorithm is based. For example, in the above itemsets, {2,3,4} is a frequent itemset. Through mining machines can find such patterns.

Association rules

Through association rules, we can find various associations and relationships between items in a huge dataset, based on how frequently an itemset occurs in a transaction. The best example of an association rule is the market basket analysis. Based on a set of transactions, supermarkets and stores can identify the association between two or more items that most people like to buy together – frequently. For example, {Bread, Milk}, {Bread, eggs}, {bread, milk, eggs}

Support

Support tells about the items that are frequently bought together. Support count is the frequency of occurrence of an itemset. It is represented by σ. In our above example, the support count σ({2,3,4}) = 2 and support count σ({3,4)} = 4.

Minimum support count

It is the minimum frequency of the itemset in the entire dataset.

Candidate set

It is defined as the support count of each item in the dataset.

Confidence

If items A & B are bought together, Confidence tells us the number of times that A & B are bought together, given the number of times A is bought. For every purchase of A, Confidence tells us the number of times that B was also bought along with A.

Confidence c = frequency(A & B)/frequency(A)

Frequent itemset mining

When we know that certain items are bought together by most users, we can club those items together and offer better discounts and ease of purchase for the users. For example, in a supermarket, bread and eggs are kept together, so that it is easy for users to pick them. Amazon shows ‘frequently bought together’ or ‘you may also like’ and also offers discounts if particular items are bought together, for example, {Oil+rice+pulses} or {tea+biscuits+sugar}. Such itemset is called a frequent itemset, as it frequently occurs when the database is searched upon. Searching for such itemsets is called frequent itemset mining.

Lift

Lift is the likelihood of purchase of item B when item A is purchased. It also takes into account, the popularity of the item B. If Lift(A=>B) = 1, the items are not correlated in the itemset, if Lift(A=>B) > 1, the items are most likely to be purchased together, if Lift(A=>B) < 1, it is unlikely that the items A & B will ever be bought together.

Conviction

Conviction is another way of finding association, where-in,

Conviction(A=>B) = (1-support(B))/(1-confidence(A=>B))

If Conviction value is 1, it indicates that the items are not correlated.

The Apriori rule

With the background behind the algorithm and relevant terms clear, we can move ahead with understanding what the actual algorithm is.

Apriori algorithm uses a breadth-first algorithm and hash tree structure to mine the transactional database for frequent itemsets and forms association rules and relationships between the items. The two main steps in the algorithm are join and prune, both of which are iteratively performed to get the most frequent itemsets.

Join – In this step, each item joins with itself to generate (K+1)th itemset from K-itemsets.

Prune – this step eliminates the items that have a support count less than the minimum threshold value. This way, the size of the candidate set is reduced.

If an item A is frequent, then,

  • The probability P(A) > minimum support threshold
  • If item B also belongs to the itemset and P(AUB) > minimum support threshold, then (A+B) is also frequent.

Step by step understanding of Apriori algorithm

Apriori algorithm is simple to learn and implement. It follows a set of processes that help us determine the frequent itemset from the dataset or database. Usually, the minimum support threshold is set (assumed) by the analysis team who is going to apply the algorithm. Here are the steps followed:

  • Step 1: Scan the whole transaction database to fetch the support value S for each item.
  • Step 2: If the Support S is more than or equal to the minimum threshold, add the item to frequent itemset (L1), else go to step 1.
  • Step 3: Join Lk-1 and Lk-1, and generate the set of candidate k-itemsets
  • Step 4: For each k-itemset, get the support S and check the minimum support threshold
  • Step 5: Repeat the iteration in step 4, if support is not more than or equal to the minimum value
  • Step 6: If S is more than the required value, add to the frequent k-itemsets
  • Step 7: If there are no itemsets, stop the algorithm
  • Step 8: Till there are frequent itemsets, for each frequent itemset L, get all the non-empty subsets
  • Step 9: For each frequent subset of L, find the confidence C
  • Step 10: If the Confidence C is more than or equal to the minimum required Confidence, add it to the strong rules, else move to the next frequent subset.

Supermarket data analysis – a simple example for illustration

This is the most popular example and the main purpose for which the algorithm is being widely used – market basket analysis – i.e. analyzing the items that most customers buy frequently. For example, if you are a student, you would buy stationery items like notebooks, pencil, eraser, glue, color papers and so on. Most students would buy all these items together! Same way, most bachelors would buy milk, eggs and bread together! Let us take the above example, with a limited (self-created) dataset, to understand how the algorithm is actually applied.

Suppose in our database, there are six transactions, and we choose the support threshold to be 50%. So, our minimum support value (sup_threshold) = 6*50/100 = 3. This means that any item having a count of 3 or more only will be considered as a frequent item.

The table below shows all the transactions:

Transaction Item list
T1 Bread, Milk, Eggs
T2 Bread, Milk, Eggs, Butter
T3 Milk, Horlicks
T4 Bread, Milk, Butter
T5 Bread, Milk, Horlicks
T6 Bread, Milk, Eggs, Butter

Now, let us calculate the count of each item:

Item Count
Bread 5
Milk 6
Eggs 3
Butter 3
Horlicks 2

We see that the count of Horlicks is less than 3, so Horlicks will be removed, while the other items will be added to the frequent itemset. This is called the pruning step.

Now, we will find the combinations of items that occur together. This is called as the join step. Since there are four items, there will be six unique combinations:

Itemset Count
Bread, Milk 5
Bread, Eggs 3
Bread, Butter 3
Milk, Eggs 3
Milk, Butter 3
Eggs, Butter 2

Again, we apply pruning and eliminate the last itemset {Eggs, butter} because its value is less than sup_threshold.

Now, with the above, we will create a 3-item set. For this, we will need the first table and the 2-itemset table, and combine both to see the itemsets that pass the minimum threshold test.

Transaction Item list Itemset Count
T1 Bread, Milk, Eggs Bread, Milk 5
T2 Bread, Milk, Eggs, Butter Bread, Eggs 3
T3 Milk, Horlicks Bread, Butter 3
T4 Bread, Milk, Butter Milk, Eggs 3
T5 Bread, Milk, Horlicks Milk, Butter 3
T6 Bread, Milk, Eggs, Butter Eggs, Butter 2

We see from above that the combination {Bread, Milk, Eggs} and {Bread, Milk, Butter} are the only two combinations that pass the minimum threshold value of support (3). This means that only these two itemsets are frequent.

Now, for each of these frequent itemsets, we have to create the association rule. Let us create the same for {Bread, Milk, Eggs}:

{Bread, Milk} => {Eggs}

Confidence Cbm = Support(Bread, Milk, Eggs)/Support(Bread, Milk) = 3/5 = 60%

{Bread, Eggs} => {Milk}

Confidence Cbe = Support(Bread, Milk, Eggs)/Support(Bread, Eggs) = 3/3 = 100%

{Milk, Eggs} => {Bread}

Confidence Cme = Support(Bread, Milk, Eggs)/Support(Milk, Eggs) = 3/3 = 100%

This shows that the above rules are strong; for example, those who are buying bread and eggs would buy milk!

Similarly, we can find the association rules for the other itemset too.

Python implementation of the Apriori algorithm

To implement the algorithm in Python is simple, as there are libraries already in place. All we need to do is import the libraries, load the dataset and build the model with the support and confidence threshold values.

#importing the necessary libraries
import numpy as np
import pandas as pd
from apyori import apriori

# load the dataset
mydata = pd.read_csv(‘<path to the data file>’)

#build a list of lists to store the data, note that here 50 is the number of rows in the data
record_set = []
for i in range(0,50):
      record_set.append([str(mydata.values[i,j]) for j in range(0,10)]) #where 10 is the number of columns of the data

#You can check rows and columns using the shape method

mydata.shape
#Build the model
assoc_rules = apriori(record_set, min_support=0.50, min_confidence=0.70, min_lift=1.2, min_length=2)
assoc_results = list(assoc_rules)

#print the results
print(assoc_results)

Pros and Cons of Apriori

Some advantages of Apriori are:

  • Easy to understand and implement amongst all the association rule learning algorithms
  • Exhaustive and finds all the rules with the specified confidence and support value
  • Intuitive results that are easy to comprehend and communicate
  • Fully unsupervised, thus can be used for many different situations as it doesn’t need labelling of data

Some shortcomings of Apriori are:

  • If there are a large number of transactions, but the memory is limited, the algorithm may find inconsistent and incorrect associations which have low support value.
  • Because of the huge candidate set, the algorithm could be slow and inefficient for transactions with many items.

Approaches to increase Apriori algorithm efficiency

  • Hash-Based Technique: In this method, a hash table (hash-based structure) is used for generating the k-itemsets and the corresponding count. A hash function is used for table generation.
  • Transaction Reduction: The glitch with Apriori algorithm is the time taken to scan the transaction database. The performance can be greatly enhanced if this time and the database size both are reduced. This is done by deleting the transactions where the value of support is less than the minimum threshold. This process iteratively reduces the size of the database and thus the time taken to scan the database. This has been the most effective approach to increase the efficiency of the algorithm. If you wish to learn more details about the approach, refer to the pdf explaining the same.
  • Partitioning: In this method, only two database scans are done to mine the frequent itemsets. The idea is that any frequent itemset should be present in at least one of the partitions.
  • Sampling: Well, not a very accurate method as compared to others, but works well if min_sup (minimum support) value is lowered. A random sample S is picked from the database, and then frequent itemset is searched.
  • Dynamic Itemset Counting: In this method, new candidate itemsets can be added at any marked start point of the database during database scanning.

Applications of Apriori algorithm

There are several applications of the Apriori algorithm, the most popular one being the supermarket analysis (market basket analysis) example we saw above. Some other applications are:

  • Discovering the social status of Diabetics
  • Analyzing the probability of forest fire
  • Recommendation system (Amazon)
  • Google auto-complete feature
  • Analysis of patient records to suggest them relevant tests and health plans

Conclusion

In this article, we have discussed the basics of the Apriori algorithm and also seen the general implementation using Python. Apriori is one of the simplest algorithms to understand and implement. It is also a very common algorithm, the application of which we can see in our daily lives. Remember that Apriori is an iterative algorithm that uses Support and Confidence to build the frequent itemset – these are important terms that form the core of the algorithm.

Leave a Reply

Your email address will not be published. Required fields are marked *