Market Basket Analysis using Association Rule Mining

Ishika Tailor
6 min readApr 15, 2021

Hello data enthusiasts ! In this blog i am going to implement Market Basket Analysis using Association Rule Mining on Groceries Data.

Have you ever wonder why you came up with so many items from Dmart/Market which is not in your item list. This is because chances of buying product which are highly correlated to each other. When you buying a bread, on next step you saw butter that stuck your eyes. This items are not similar but there is association between them that tends to increase the probability of buying item. We’re going to mine these association rules using Apriori Algorithm. We’ll be doing all this using PYTHON.

Association Rule Mining

Association Rule Mining is used when you want to find, 1) an association between different objects in a set, 2) find frequent patterns in a transaction database, relational databases or any other information repository. The applications of Association Rule Mining are found in Marketing, Basket Data Analysis (or Market Basket Analysis) in retailing, clustering and classification. It can tell you what items do customers frequently buy together by generating a set of rules called Association Rules. In simple words, it gives you output as rules in form if this then that.

APRIORI Algorithm

Association Rule Mining is viewed as a two-step approach:

  1. Frequent Itemset Generation: Find all frequent item-sets with support >= pre-determined min_support count
  2. Rule Generation: List all Association Rules from frequent item-sets. Calculate Support and Confidence for all rules. Prune rules that fail min_support and min_confidence thresholds.

Frequent Item-set Generation is the most computationally expensive step because it requires a full database scan.

We are going to implement each task step by step:

Photo by gemma on Unsplash

Step: 1 Import Libraries and Dataset

Importing necessary libraries, modules that will be required for Market Basket Analysis. You can pip install these libraries if you get ModuleNotFoundError:

Step: 2 Exploratory Data Analysis

→ Get the shape

→ Find the top 20 “sold items” that occur in the dataset

#set of all unique items present in dataset
items = set()
with open("documents/data_csv/groceries.csv") as f:
reader = csv.reader(f)
for row, data in enumerate(reader):
items.update(data)
icount = list()
with open("documents/data_csv/groceries.csv") as f:
reader = csv.reader(f, delimiter=",")
for i, line in enumerate(reader):
#initializing item with o
row = {item:0 for item in items}
#assign value 1 if item present
row.update({item:1 for item in line})
#append updated row
icount.append(row)
Converting icount to dataframe
item_sum = grocerydf.sum().sort_values(ascending = False).reset_index().head(n=21)item_sum.rename(columns={item_sum.columns[0]:'Item_name',item_sum.columns[1]:'Item_count'}, inplace=True)item_sum.drop([0],axis=0,inplace=True)item_sum.head(20)
Top 20 Items with Highest-Count

→ Find how much of the total sales they account for.

Only the top 20 items are responsible for over 50% of the sales! This is important for us, as we don’t want to find association rules for items which are bought very infrequently(rare). With this information we can limit the items we want to explore for creating our association rules. This also helps us in keeping our possible item set number to a manageable figure.

Photo by Matthew Ball on Unsplash

Step: 3 Data Visualization

Graph 1
Graph 2

Step: 4 Create a prune_dataset, which will help us reduce the size of our dataset based on our requirements. This will remove item which are infrequent. The function should perform Pruning based on percentage of total sales.

output_df, item_counts = prune_dataset(input_df=grocery_df, length_trans=2,total_sales_perc=0.4)

Here, length_trans=2 indicates that we are interested in transactions with at least two items. (2:thresold value)

essential_item = item_sum[item_sum['Tot_percent'] <=0.4]essential_item = list(essential_item['Item_name'].values)essential_item
Those items who have ≤0.4 total sales percentage

Step: 5 Creating Prune-Dataset

temp_df = grocerydf[essential_item] #Atleast 2 Transaction item.
temp_df = temp_df[temp_df.sum(axis=1) >= 2]
temp_df.reset_index(drop=True,inplace=True)
temp_df = temp_df.astype(int)

Before Pruning, we have (9835,32) shape of data. After minimize and removing unnecessary data, it becomes (4585,13).

Step: 6 The final step is creating our rules. We need to specify two pieces of information for generating our rules: support and confidence.

→ An important piece of information is to start with higher support, as lower support will mean a higher number of frequent itemsets and hence a longer execution time.

Counting frequency of itemset

→ Generating sample rules using transactions that explain 40% of total sales, min-support of 1% (required number of transactions >=45) and confidence greater than 30%.

confidence = 0.3
rules_df = pd.DataFrame()
#association_rule will generate antecedent, consequent, support, confidence
rules = [(P, Q, supp, conf) for P, Q, supp, conf in association_rules(itemsets, confidence) if len(Q) == 1]
rules
rules

→ Generating eligible_antecedent for further use.

We are creating a one-hot encoded dataset from the pruned data where each item corresponds to two columns that is either 1 (True) if it is present in that transaction and 0 (False), otherwise. and then creating a NumPy array from that.

input_assoc_rules = temp_df 
domain_grocery = Domain([DiscreteVariable.make(name=item,values=['0', '1']) for item in input_assoc_rules.columns])
data_gro_1 = Orange.data.Table.from_numpy(domain=domain_grocery, X=input_assoc_rules,Y= None)data_gro_1_en, mapping = OneHot.encode(data_gro_1, include_class=False)

Reference: https://orange3.readthedocs.io/projects/orange-data-mining-library/en/latest/reference/data.domain.html


names = {item: '{}={}'.format(var.name, val)
for item, var, val in
OneHot.decode(mapping, data_gro_1, mapping)}
eligible_antecedent = [v for k,v in names.items()
if v.endswith("1")]
eligible_antecedent
eligible_antecedent

→ Let’s create rules_stat that will generate antecedent, consequent, support, confidence, coverage, lift, strength, leverage in list format with the help of rules, item-sets and N. So, for better understanding we are converting into dataframe as rules_df as follow:

N = input_assoc_rules.shape[0] #4585#rules_stat will generate antecedent, consequent, support, confidence, coverage, lift, strength, leveragerule_stats = list(rules_stats(rules, itemsets, N))
rule_stats
rule_stats

Step: 7 Converting rule_stats to dataframe

rule_list_df = []
for ex_rule_from_rule_stat in rule_stats:
ante = ex_rule_from_rule_stat[0] #antecedent
cons = ex_rule_from_rule_stat[1] #consequent

named_cons = names[next(iter(cons))]
#one-by-one value of consequent
if named_cons in eligible_antecedent:
rule_lhs = [names[i][:-2] for i in ante
if names[i] in eligible_antecedent]
ante_rule = ', '.join(rule_lhs)
# generating antecedent list

if ante_rule and len(rule_lhs)>1 :
rule_dict = {'support' : ex_rule_from_rule_stat[2],
'confidence' : ex_rule_from_rule_stat[3],
'coverage' : ex_rule_from_rule_stat[4],
'strength' : ex_rule_from_rule_stat[5],
'lift' : ex_rule_from_rule_stat[6],
'leverage' : ex_rule_from_rule_stat[7],
'antecedent': ante_rule,
'consequent':named_cons[:-2] }
rule_list_df.append(rule_dict)
rules_df = pd.DataFrame(rule_list_df)
rules_df
rules_df

Step: 8 By performing grouping operation on antecedent and consequent value, we got our final output as follow:

(pruned_rules_df[['antecedent','consequent','support','confidence',
'lift']].groupby('consequent')
.max()
.reset_index()
.sort_values(['lift', 'support','confidence'],
ascending=False))
Final Output

Let’s understand the above metrics.

Support of the rule is 228, which means, all the items(left + right) together appear in total 228 transactions in the dataset. Confidence of the rule is 46%, which means that 46% of the time the antecedent items occurred we also had the consequent in the transaction (i.e. 46% of times, customers who bought the right side items yogurt, tropical fruit(antecedent) also bought root vegetables(consequent)).

Another important metric is Lift.
Lift means that the probability of finding root vegetables in the transactions which have yogurt, whole milk, and tropical fruit is greater than the normal probability of finding root vegetables in the previous transactions (2.23)(we are finding probability of consequent where chances of buying antecedent is high).

Typically, a lift value of 1 indicates that the probability of occurrence of the antecedent and consequent together are independent of each other. Hence, the idea is to look for rules having a lift much greater than 1.

Github Link: Market_Basket_Analysis.ipynb

Reference: Beginner Tutorial for Apriori algorithm

--

--

Ishika Tailor

Coding with Passion, Serving with Joy. 22 years old. Software Engineer at HSBC. From India.