December 2018

Volume 33 Number 12

[Artificially Intelligent]

Market Basket Analysis

By Frank La

Frank La VigneThere is a bit of artificial intelligence (AI) that you have, no doubt, encountered, especially as the holiday shopping season is underway—the recommended purchase. Nearly every online retailer will display additional product recommendations, sometimes under the header of “Frequently bought together,” or “Customers who purchased X also purchased Y.” According to one study by McKinsey in 2013 (bit.ly/2yK3Bu8), 35 percent of what consumers purchase on Amazon come from product recommendation algorithms. What’s more, this tactic is no longer limited to retailers, online streaming services like Netflix and YouTube use sophisticated recommendation algorithms to keep viewers tuned in longer.

Clearly, recommendation systems have an impact on our daily lives. You could argue that they’re the most prominent form of AI that consumers encounter. In this column, I’ll explore a basic form of recommendation system known as a Market Basket Analysis.

Market Basket Analysis

Market Basket Analysis, also known as Affinity Analysis, is a modeling technique based on the theory that if you buy a certain group of items, you’re more likely to purchase another group of items. For example, someone purchasing peanut butter and bread is far more likely to also want to purchase jelly. However, not all relationships are as immediately obvious. Foreknowledge of consumer behavior can increase sales and give the retailer a significant edge against competitors. Strictly speaking, Market Basket Analysis is just one application of association analysis techniques, although many online articles and tutorials may confuse the two. To put it in perspective of other machine learning techniques I’ve written about before, Market Basket Analysis is an unsupervised learning tool that requires little in the way of feature engineering and a limited amount of data cleaning and preparation. In practice, insights gleaned from Market Basket Analysis can be further explored with other AI or data science tools.

Despite its ability to uncover hidden patterns, Market Basket Analysis is relatively easy to explain and doesn’t require knowledge of advanced statistics or calculus. However, there are a few terms and conventional notations to review. First, the notions of cause and effect are referred to as antecedent and consequent. In the example I mentioned previously, peanut butter and bread are the antecedent and jelly is the consequent. The formal notation for this relationship would be {Peanut Butter, Bread} -> {Jelly} indicating that there’s a connection between these items. Also take note that both antecedents and consequents can consist of multiple items.

There are three important mathematical measures required for Market Basket Analysis: Support, Lift and Confidence. Support represents the number of times antecedents appear together in the data. To simplify the example, imagine the following relationship: {Peanut Butter} -> {Grape Jelly}. Given 100 customers (and one transaction per customer), consider the following scenario:

  • 15 customers bought Peanut Butter
  • 13 bought Grape Jelly
  • 11 bought Peanut Butter and Grape Jelly

Support represents the number of times items appear in a transaction together, which in this example is 11 out of 100, or 0.11. To use statistical terms, there’s a probability of 11 percent that any given transaction will include both Peanut Butter and Grape Jelly. Confidence takes the value of Support (.11) and divides it by the probability of a transaction of having Grape Jelly alone, equating to a value of 0.846. This means that nearly 85 percent of the time that Grape Jelly was purchased, it was purchased along with Peanut Butter. Finally, there’s Lift, which takes Confidence (0.846) and divides it by the probability of Peanut Butter. This equate to 5.64 (rounded to two decimal places).

All this might be clearer in a simple chart, as shown in Figure 1.

Figure 1 Support Confidence and Lift Values

Measure Formula Value
Support P(Peanut Butter & Grape Jelly) .011
Confidence Support / P(Grape Jelly) 0.846
Lift Confidence / P(Peanut Butter) 5.64 (rounded)

 

Market Basket Analysis in Action

Keeping the previous metrics in mind, it’s time to try out Market Basket Analysis on a real data set. The first step is to get retail data to analyze. Fortunately, the University of California, Irvine, provides a dataset that contains transactions for a U.K.-based Web site. More information about the dataset is available at bit.ly/2DgATFl. Create a Python 3 notebook on your preferred platform (I covered Jupyter notebooks in a previous column at msdn.com/magazine/mt829269). Create an empty cell, enter the following code to download the sample data, and execute the cell:

! curl https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx -o retail.xlsx

Once this completes, enter the following code into a new cell to load the Excel spreadsheet into a Pandas DataFrame and print out the columns of the data set:

import pandas as pd
df = pd.read_excel('retail.xlsx')
print( df.columns)
The output will look something like this:
Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country'],
      dtype='object')

While Market Basket Analysis doesn’t require rigorous data munging, it does make sense to remove extraneous records, such as those with null invoice numbers and canceled orders. It’s also useful to remove extraneous spaces in the product descriptions and convert all the invoice numbers to string. You can do that by executing the following code:

df['Description'] = df['Description'].str.strip()
df.dropna(axis = 0, subset=['InvoiceNo'], inplace = True)
df['InvoiceNo'] = df['InvoiceNo'].astype('str')
df = df[~df['InvoiceNo'].str.contains('C')]

Now enter the following code to aggregate the data and view it from a country level:

df.groupby('Country').count().reset_index().sort_values(
  'InvoiceNo', ascending = False).head()

Next, I’ll rearrange the data with each product one hot encoded and one transaction per row. One hot encoding is a data transformation technique where categorical values are converted into columns, with the value of 1 entered where a categorical value is present. I will also limit the scope of the dataset to one country, in this case France, to compare consumer behavior in an individual market. Enter and execute the following code to do that (notice the shape of the basket_uk data frame in the cell output; the one hot encoding expands the columns from 8 to 4175):

basket_fr = (df[df['Country']=="France"]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))
basket_fr.head(10)

A quick glance at the results reveals an issue with my one hot encoding. The sixth item down has a value of 24.0 in the second column. My intention was to have either a one or a zero in the data, not quantities. Therefore, I will need to locate any non-zero values and convert them to 1. To fix this, run the following code and note that the 24.0 has been converted to a 1:

def sum_to_boolean(x):
  if x<=0:
    return 0
  else:
    return 1
basket_fr_final = basket_fr.applymap(sum_to_boolean)
basket_fr_final.head(10)

I will use MLXTEND (rasbt.github.io/mlxtend) to further analyze the data. MLXTEND is a Python library of useful tools for common data science tasks, including Market Basket Analysis. To install this library from within the notebook, execute the following code:

! pip install mlxtend

With the MLXTEND package installed, it’s time to import the relevant libraries from MLXTEND, like this:

from mlxtend.frequent_patterns import association_rules
from mlxtend.frequent_patterns import apriori

In a new cell, enter the following code to view sets of items with at least 6 percent support:

frequent_itemsets_fr = apriori(basket_fr_final, min_support = 0.06,
  use_colnames = True)
frequent_itemsets_fr.sort_values('support', ascending = False).head()

With the key sets of items identified, I can now apply the association rules library to discover customers’ purchase behaviors. Enter the following code and execute it:

a_rules = association_rules(frequent_itemsets_fr, metric = "lift",
  min_threshold = 1)
a_rules.sort_values('lift',ascending = False)

The results, sorted by Lift, should look similar to those in Figure 2, revealing the purchasing patterns of French customers on the site. A quick glance at the results shows that customers who buy CHILDRENS CUTLERY SPACEBOY also purchase CHILDRENS CUTLERY DOLLY GIRL, and that customers who buy an alarm clock in one color also purchase an alarm clock in another color. As far as actionable insights go, I could recommend to the site owners to offer bundle deals on gender-specific cutlery, as well as offer multi-color alarm clock bundles.

Association Rules from French Customers
Figure 2 Association Rules from French Customers

However, keep in mind that this list is sorted by Lift and not occurrence. It may not make sense to introduce a new bundle or product offering if it isn’t popular enough. To get a view of how popular these cutlery items are, enter the following Python code:

print( basket_fr_final['CHILDRENS CUTLERY SPACEBOY'].sum())
print( basket_fr_final['CHILDRENS CUTLERY DOLLY GIRL'].sum())

The results aren’t promising; only 27 for SPACEBOY and 28 for DOLLY GIRL. With a little exploration, I find an association rule with some more promise. It turns out that the association rule index 50 (SET/20 RED RETROSPOT PAPER NAPKINS) is the antecedent for red paper cups and red paper plates. Enter the following code to see how many units are sold:

basket_fr_final['SET/20 RED RETROSPOT PAPER NAPKINS'].sum()

While the number is low, it stands to reason that customers purchasing disposable cups want matching paper plates and napkins. A savvy retailer could easily package these into a bundle offer to entice the customer to purchase.

Sharp-eyed readers will notice that there are two other metrics in the resulting data frame: Leverage and Conviction. These are additional values to be considered when performing Market Basket Analysis. More information about this can be found by exploring so-called “alternative measures of interestingness.” Wikipedia is a handy place to start (bit.ly/2AECRNh).

Recall that when I aggregated the data from a country level, there were vastly more invoices from the United Kingdom than any other country. Perhaps more could be learned by examining customer behavior with more raw data available. Let’s explore this by entering the following code into a new cell and executing it, like so:

basket_uk = (df[df['Country']=="United Kingdom"]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))
basket_final_uk = basket_uk.applymap(sum_to_boolean)
frequent_itemsets_uk = apriori(basket_final_de, min_support = 0.06,
  use_colnames = True)
a_rules_uk = association_rules(frequent_itemsets_uk, metric = "lift",
  min_threshold = 1)
a_rules_uk.sort_values('lift',ascending = False).head()

For the United Kingdom, the execution time is much longer due to the larger dataset. Also note that the results are quite different. Could this be a result of more data changing the analysis, or simply a function of different customer preferences in a different market? Or could this be that the retailer offers different products for sale in different markets? These are the kinds of variables you have to understand as you work through your analysis. In this case, we have little context beyond this being an online retailer based in the United Kingdom. However, in a real-world project, engagement with the business’ subject-matter experts is a critical element for success in data analytics projects.

Wrapping Up

In this article, I explored the use of Market Basket Analysis to uncover patterns in consumer behavior. Market Basket Analysis belongs to the larger field of Affinity Analysis, which major companies use to get customers to spend more money on products and more time on streaming platforms.

Market Basket Analysis provides a great entry point for persons and organizations looking to explore data science. The barrier to entry is low in terms of the mathematical skill. In fact, the mathematics doesn’t go beyond simple division and basic probability theory. It’s an easy problem space to explore for beginners and offers a great place to start in applying AI to the enterprise. That said, do not be fooled into thinking that this isn’t a powerful means to conduct data science or show value to company leadership. The impacts on the bottom line can be significant.


Frank La Vigne works at Microsoft as an AI Technology Solutions professional where he helps companies achieve more by getting the most out of their data with analytics and AI. He also co-hosts the DataDriven podcast. He blogs regularly at FranksWorld.com and you can watch him on his YouTube channel, “Frank’s World TV” (FranksWorld.TV).

Thanks to the following technical expert for reviewing this article: Andy Leonard (Enterprise Data & Analytics)


Discuss this article in the MSDN Magazine forum