What is Apriori Algorithm in Data Mining? Implementation with Examples

By | February 26, 2022
Apriori Algorithm in Data Mining - TGB

Apriori algorithm is a type of unsupervised learning algorithm used for association rule mining. The algorithm searches for frequent items in datasets and builds correlations and associations in the itemsets.

We often see ‘frequently bought together’ and ‘you may also like’ on the recommendation section of online shopping platforms. That’s possible because of the apriori algorithm!

The name apriori comes from the fact that we have ‘a’ ‘prior’ knowledge of the frequent itemset properties. If there are 2 frequent itemsets, then the algorithm aims to find the 3rd one. In general, if there are n frequent itemsets (think frequently bought items), then the Apriori algorithm aims to find the n+1 frequent itemset.

Apriori considers each transaction as an itemset and uses a bottom-up approach in which frequent subsets are extended item by item. It was developed in 1994 by Agrawal and Srikant.

A Little About Data Mining

Data mining is a part of the data science process, where huge datasets are explored and analyzed to find useful patterns, trends and rules. We use data mining techniques to develop machine learning models and help predict future outcomes.

For example, suppose 10 customers visit a supermarket on a particular day, and 8 of them buy bread, 6 of them buy bread and milk both. This trend continued on the next day too, and so on. Through data mining, we will be able to know such combinations of products that users frequently buy.

Another example is shopping on Amazon. When you buy a certain product, Amazon suggests a few more related items to buy displaying under “frequently bought together.” So, if you bought a laptop, it would show you a laptop bag or laptop holder based on similar purchases from other buyers.

Vamware

The gist is that data mining allows businesses to understand what buyers usually prefer and customize promotions and segment various market groups.

Important Terms

  • Apriori property

The Apriori property states that all the non-empty subsets of frequent itemsets should be frequent. Apriori property helps reduce the search space, thereby improving the efficiency of level-wise generation of frequent itemsets.

Vamware
  • Itemset

A subset of items that frequently occur in the database is called an itemset. In a huge database, there could be many itemsets, and the search space of these itemsets is exponential. The 2 most important search mechanisms for searching frequent itemsets are breadth-first search and depth-first search. Example of itemsets:

itemsets
{2,3}
{1,3,4}
{2,3,4}
{2,3,4}
{1,2,3,4}
  • Frequent pattern mining

It is the extracting of frequent itemsets from the database. Frequent pattern mining forms the basis for association rules on which the Apriori algorithm is based. For example, in the above itemsets, {2,3,4} is a frequent itemset. Through mining, machines can find such patterns.

  • Association rules

Through association rules, we can find various associations and relationships between items in a huge dataset, based on how frequently an itemset occurs in a transaction. The best example of an association rule is the market basket analysis.

Based on a set of transactions, supermarkets and stores can identify the association between 2 or more items that most people like to buy together – frequently. For example, {Bread, Milk}, {Bread, eggs}, {bread, milk, eggs}

  • Support

It tells about the items that are frequently bought together. Support count is the frequency of occurrence of an itemset. It is represented by σ. In the above example, the support count σ({2,3,4}) = 2 and support count σ({3,4)} = 4.

Vamware
  • Minimum support count

It is the minimum frequency of the itemset in the entire dataset.

  • Candidate set

It is defined as the support count of each item in the dataset.

  • Confidence

If items A and B are bought together, Confidence tells us the number of times that A and B are bought together, given the number of times A is bought. For every purchase of A, Confidence tells us the number of times B was also bought along with A.

Confidence c = frequency(A and B)/frequency(A)

  • Frequent itemset mining

When we know that certain items are bought together by most users, we can club those items together and offer better discounts and ease of purchase for the users. For example, in a supermarket, bread and eggs are kept together, so that it is easy for users to pick them.

Amazon shows ‘frequently bought together’ or ‘you may also like’ sections and also offers discounts if particular items are bought together, for example, {Oil+rice+pulses} or {tea+biscuits+sugar}. Such itemset is called a frequent itemset, as it frequently occurs when the database is searched upon. Searching for such itemsets is called frequent itemset mining.

  • Lift

Lift is the likelihood of purchase of item B when item A is purchased. It also takes into account, the popularity of item B. If Lift(A=>B) = 1, the items are not correlated in the itemset, if Lift(A=>B) > 1, the items are most likely to be purchased together, if Lift(A=>B) < 1, it is unlikely that the items A and B will ever be bought together.

  • Conviction

Conviction is another way of finding association, where-in,

Conviction(A=>B) = (1-support(B))/(1-confidence(A=>B))

If Conviction value is 1, it indicates that the items are not correlated.

The Apriori Rule

So now that we are done with the background of the algorithm and relevant terms, we can move ahead with understanding what the actual algorithm is.

Apriori algorithm uses a breadth-first algorithm and hash tree structure to mine the transactional database for frequent itemsets and forms association rules and relationships between the items. The 2 main steps in the algorithm are join and prune, both of which are iteratively performed to get the most frequent itemsets.

  1. Join – In this step, each item joins with itself to generate (K+1)th itemset from K itemsets.
  2. Prune – This step eliminates the items that have a support count less than the minimum threshold value. This reduces the size of the candidate set.

If an item A is frequent, then,

  • The probability P(A) > minimum support threshold.
  • If item B also belongs to the itemset and P(AUB) > minimum support threshold, then (A+B) is also frequent.

Step-by-step Understanding of Apriori Algorithm

Apriori algorithm is simple to learn and implement. It follows a set of processes that help us determine the frequent itemset from the dataset or database. Usually, the minimum support threshold is set (assumed) by the analysis team that is going to apply the algorithm. Here are the steps:

  • 1st Step: Scan the whole transaction database to fetch the support value S for each item.
  • 2nd Step: If the Support S is more than or equal to the minimum threshold, add the item to frequent itemset (L1), else go to step 1.
  • 3rd Step: Join Lk-1 and Lk-1, and generate the set of candidate k-itemsets.
  • 4th Step: For each k-itemset, get the support S and check the minimum support threshold
  • 5th Step: Repeat the iteration in step 4, if support is not more than or equal to the minimum value
  • 6th Step: If S is more than the required value, add to the frequent k-itemsets
  • 7th Step: If there are no itemsets, stop the algorithm
  • 8th Step: Till there are frequent itemsets, for each frequent itemset L, get all the non-empty subsets
  • 9th Step: For each frequent subset of L, find the confidence C
  • 10th Step: If the Confidence C is more than or equal to the minimum required Confidence, add it to the strong rules, else move to the next frequent subset.

Supermarket Data Analysis – A Simple Example

This is the most popular example and the main use-case of the unsupervised learning algorithm – market basket analysis – i.e. analyzing the items that most customers buy frequently. For example, if you are a student, you would buy stationery items like notebooks, pencils, erasers, glue, colored papers, and so on.

Most students would buy all these items together! Same way, most bachelors would buy milk, eggs and bread together! Let us take the above example, with a limited (self-created) dataset, to understand how the algorithm is actually applied.

Suppose in our database, there are 6 transactions, and we choose the support threshold to be 50%. So, our minimum support value (sup_threshold) = 6*50/100 = 3. This means that any item having a count of 3 or more only will be considered as a frequent item. The table below shows all the transactions:

Transaction

Item list

T1 Bread, Milk, Eggs
T2 Bread, Milk, Eggs, Butter
T3 Milk, Horlicks
T4 Bread, Milk, Butter
T5 Bread, Milk, Horlicks
T6 Bread, Milk, Eggs, Butter

Now, let us calculate the count of each item:

Item

Count

Bread 5
Milk 6
Eggs 3
Butter 3
Horlicks 2

We see that the count of Horlicks is less than 3, so Horlicks will be removed, while the other items will be added to the frequent itemset. This is called the pruning step.

Now, we will find the combinations of items that occur together. This is the join step. Since there are 4 items, there will be 6 unique combinations:

Itemset

Count

Bread, Milk 5
Bread, Eggs 3
Bread, Butter 3
Milk, Eggs 3
Milk, Butter 3
Eggs, Butter 2

Again, we apply pruning and eliminate the last itemset {Eggs, butter} because its value is less than sup_threshold.

Now, with the above, we will create a 3-item set. For this, we will need the first table and the 2-itemset table, and combine both to see the itemsets that pass the minimum threshold test.

Transaction

Item list

Itemset

Count

T1 Bread, Milk, Eggs Bread, Milk 5
T2 Bread, Milk, Eggs, Butter Bread, Eggs 3
T3 Milk, Horlicks Bread, Butter 3
T4 Bread, Milk, Butter Milk, Eggs 3
T5 Bread, Milk, Horlicks Milk, Butter 3
T6 Bread, Milk, Eggs, Butter Eggs, Butter 2

We see from above that the combination {Bread, Milk, Eggs} and {Bread, Milk, Butter} are the only 2 combinations that pass the minimum threshold value of support (3). This means that only these 2 itemsets are frequent.

Now, for each of these frequent itemsets, we have to create the association rule. Let us create the same for {Bread, Milk, Eggs}:

{Bread, Milk} => {Eggs}

Confidence Cbm = Support(Bread, Milk, Eggs)/Support(Bread, Milk) = 3/5 = 60%

{Bread, Eggs} => {Milk}

Confidence Cbe = Support(Bread, Milk, Eggs)/Support(Bread, Eggs) = 3/3 = 100%

{Milk, Eggs} => {Bread}

Confidence Cme = Support(Bread, Milk, Eggs)/Support(Milk, Eggs) = 3/3 = 100%

This shows that the above rules are strong; for example, those who are buying bread and eggs would buy milk! Similarly, we can find the association rules for the other itemset too.

Python Implementation of the Apriori Algorithm

Implementing the Apriori algorithm in Python is simple as there are libraries already in place. So, all we need to do is import the libraries, load the dataset and build the model with the support and confidence threshold values. Here is how to do it:

#importing the necessary libraries
import numpy as np
import pandas as pd
from apyori import apriori

# load the dataset
mydata = pd.read_csv(‘<path to the data file>’)

#build a list of lists to store the data, note that here 50 is the number of rows in the data
record_set = []
for i in range(0,50):
      record_set.append([str(mydata.values[i,j]) for j in range(0,10)]) #where 10 is the number of columns of the data

#You can check rows and columns using the shape method

mydata.shape
#Build the model
assoc_rules = apriori(record_set, min_support=0.50, min_confidence=0.70, min_lift=1.2, min_length=2)
assoc_results = list(assoc_rules)

#print the results
print(assoc_results)

Pros and Cons of Apriori

Some advantages of Apriori are:

  • Easy to understand and implement among all the association rule learning algorithms.
  • Exhaustive and finds all the rules with the specified confidence and support value.
  • Intuitive results that are easy to comprehend and communicate.
  • Fully unsupervised, thus can be used for many different situations as it doesn’t need labeling of data.

Some shortcomings of Apriori are:

  • If there are a large number of transactions, but limited memory then the algorithm may find inconsistent and incorrect associations that have low support value.
  • Because of the huge candidate set, the algorithm could be slow and inefficient for transactions with many items.

Approaches to Increase the Efficiency of the Apriori Algorithm

  • Hash-Based Technique: In this method, we use a hash table (hash-based structure) for generating the k-itemsets and the corresponding count. A hash function is used for table generation.
  • Transaction Reduction: The glitch with the Apriori algorithm is the time taken to scan the transaction database. The performance can be greatly enhanced if this time and the database size both are reduced. This is done by deleting the transactions where the value of support is less than the minimum threshold.
    This process iteratively reduces the size of the database and thus, the time taken to scan the database. This has been the most effective approach to increase the efficiency of the algorithm. If you wish to learn more details about the approach, refer to the pdf explaining the same.
  • Partitioning: In this method, only 2 database scans are done to mine the frequent itemsets. The idea is that any frequent itemset should be present in at least one of the partitions.
  • Sampling: Well, unlike others on the list, not a very accurate method, but works well if the min_sup (minimum support) value is lowered. A random sample S is picked from the database, and then a frequent itemset is searched.
  • Dynamic Itemset Counting: In this method, new candidate itemsets can be added at any marked start point of the database during the database scanning.

Applications of the Apriori Algorithm

Although there are several applications of the Apriori algorithm, the most popular one is the supermarket analysis (market basket analysis) example we saw above. Some other important applications of the unsupervised learning algorithm are:

  • Discovering the social status of diabetics.
  • Analyzing the probability of a forest fire.
  • Recommendation system (Amazon).
  • Google auto-complete feature.
  • Analysis of patient records to suggest them relevant tests and health plans.

Conclusion

That was all about the basics of the Apriori algorithm and also its general implementation using Python. Apriori is one of the simplest algorithms to work with. It is also a very common algorithm, the applications of which we can see in our daily lives.

Remember that Apriori is an iterative algorithm that uses Support and Confidence to build the frequent itemset. These are important terms that form the core of the algorithm.

People are also reading: 

Author: Ramya Shankar

A cheerful, full of life and vibrant person, I hold a lot of dreams that I want to fulfill on my own. My passion for writing started with small diary entries and travel blogs, after which I have moved on to writing well-researched technical content. I find it fascinating to blend thoughts and research and shape them into something beautiful through my writing.

Leave a Reply

Your email address will not be published.