Apriori Algorithm in Python (Recommendation Engine)
Apriori algorithm works on the principle of Association Rule Mining.
ssociation rule mining is a technique to identify underlying relations between different items. This relationship can be a similarity between items on how frequently they are bought or how similar users bought it.
In this article, we will be looking on how the Apriori algorithm works with a python example.
In the supermarket, the Apriori algorithm can be used to keep similar items together. like shaving foam, shaving cream and other men's grooming products can be kept adjacent to each other based on the order or popularity that they are bought together. So that it becomes easy for customers to buy the product and can add in more business and profit to the supermarket.
Now that we have a basic idea of Apriori algo, now we will into the theory of Apriori algo.
Theory of Apriori Algorithm.
The Apriori algorithm was proposed by Agrawal and Srikant in 1994.
There are three major components of the Apriori algorithm:
1) Support
2) Confidence
3) Lift
We will explain this concept with the help of an example.
suppose we have a record of 1000 customers transactions and we want to find out support, confidence and lift for milk and diapers. out of 1000 transactions, 120 contains a milk and 150 contains a diaper. out of this 150 transaction where a diaper is purchased 30 contains transaction contains milk as well. we will use this data to calculate support, confidence and lift.
Support
support refers to the popularity of item and can be calculated by finding the number of transactions containing a particular item divided by the total number of transactions.
Support(diaper) = (Transactions containing (diaper))/(Total Transactions)
Support(diaper) = 150 / 1000 = 15 %
Confidence
Confidence refers to the likelihood that an item B is also bought if item A is bought. It can be calculated by finding the number of transactions where A and B are bought together, divided by the total number of transactions where A is bought. Mathematically, it can be represented as:
Confidence(A → B) = (Transactions containing both (A and B))/(Transactions containing A)
The confidence of likelihood of purchasing a diaper if a customer purchase milk.
Confidence(milk → diaper) = (Transactions containing both (milk and diaper))/(Transactions containing milk)
Confidence(milk → daiper) =30 / 120 = 25 %
Confidence is similar to Naive Based Algorithm.
Lift
Lift refers to the increase in the ratio of the sale of B when A is sold.
Lift(A –> B) can be calculated by dividing Confidence(A -> B) divided by Support(B).
Mathematically it can be represented as:
Lift(A→B) = (Confidence (A→B))/(Support (B))
Lift(milk → diaper) = (Confidence (milk → diaper))/(Support (diaper))
Lift(milk → diaper) = 25 / 15 = 1.66
So by Lift theory, there is 1.66 times more chance of buying milk and diaper together then just buying diaper alone.
Association rule by Lift
lift = 1 → There is no association between A and B.
lift < 1→ A and B are unlikely to be bought together.
lift > 1 → greater the lift greater is the likelihood of buying both products together.
Steps Involved in Apriori Algorithm
The Apriori algorithm tries to extract rules for each possible combination of items. For instance, Lift can be calculated for item A and item B, item Aand item C, item A and item D and then item B and item C, item B and item D and then combinations of items e.g. item A, item B and item C; similarly item A, item B, and item D, and so on.
For larger dataset, this computation can make the process extremely slow.
To speed up the process, we need to perform the following steps:
- Set a minimum value for support and confidence. This means that we are only interested in finding rules for the items that have certain default existence (e.g. support) and have a minimum value for co-occurrence with other items (e.g. confidence).
- Extract all the subsets having a higher value of support than a minimum threshold.
- Select all the rules from the subsets with confidence value higher than the minimum threshold.
- Order the rules by descending order of Lift.
Apriori Algorithm in python
Now that we know all about how Apriori algo works we will implement this algo using a data dataset
you can download the dataset here.
This dataset contains 7500 transactions over the course of a week at a French retail store.
We will not implement the algorithm, we will use already developed apriori algo in python. The library can be installed using the documentation here.
I will be using Jupyter-notebook to write code.
4 steps to implement Apriori Algorithm
- Importing Libraries.
we will import numpy, pandas, matplotlib and apriori.
2. Importing the Dataset
Now lets import dataset and see how our dataset looks like, how many transactions are there and what is the shape of the dataset.
so we have 20 columns and 7500 transactions.
3. Data Preprocessing
The Apriori library we are going to use requires our dataset to be in the form of a list of lists, where the whole dataset is a big list and each transaction in the dataset is an inner list within the outer big list. Currently, we have data in the form of a pandas dataframe. To convert our pandas dataframe into a list of lists, execute the following code.
4. Using Apriori
The next step is to apply the Apriori algorithm on the dataset. To do so, we can use the apriori class that we imported from the apriori library.
The apriori class requires some parameter values to work. The first parameter is the list of the list that you want to extract rules from. The second parameter is the min_support parameter. This parameter is used to select the items with support values greater than the value specified by the parameter. Next, the min_confidence parameter filters those rules that have confidence greater than the confidence threshold specified by the parameter. Similarly, the min_lift parameter specifies the minimum lift value for the shortlisted rules. Finally, the min_length parameter specifies the minimum number of items that you want in your rules.
min_length is 3 since we want at least two products in our rules.
support for this data can be calculated as ( min num of times product purchased in a day * 7 ) / (num of transactions in a week )
support = ( 6 * 7 ) / 7500 = 0.0056
The minimum confidence for the rules is 20% or 0.2.
Let's see the number of rules mined
so we have in all 26 rules mined form 7500 transactions
let us see the first rule.
The first rule consists of a list of items that can be bought together. the support vector of 0.0057 is calculated by dividing the number of transactions containing mushroom cream sauce divided by the total number of transactions. The confidence of 0.30, tells us that of the total transactions 30 % of transactions also contains escalope. Finally, the lift of 3.79 tells us that there are 3.79 times chances that escalope will be bought with mushroom cream sauce.
So now we have a basic idea of how to build a product recommendation system for small retail stores, but if you have complex data like Amazon or Netflix you should build recommendation system using Recommendation filtering techniques like Collaborative Filtering and Content-based filtering.
Thanks for following the article please hit 👏if you liked this article,
and follow me for more such articles on Rcsy, ML and AI.
Follow for more like on LinkedIn: https://www.linkedin.com/in/deepak6446r/