Association rule mining

Association rule mining allows us to look for items that are associated in some way. For instance, we may have heard that whoever buys diapers is very likely to buy beer as well. The association rule is {diapers} -> {beer}. But this doesn't necessary mean that whoever buys beer is going to buy diapers as well. The arrow is valid only in one direction.

The opportunity to look for such associations is everywhere around us. Does drinking coffee relate with increased blood pressure? Does eating carrots improve eyesight? All we need is plenty of the right data if we want to be reasonably sure in any claim we make.

Interestingly, there are some metrics that allow us to look for association rules. The first one is called support and measures how frequently a rule appears in the data. It is defined as the number of transactions containing itemset X divided by the total number of transactions in the database. Support can also be computed for a combination of items. Another measure is confidence, which refers to the predictive power or the accuracy of an association rule. While support can be is defined for a single item, we have to measure confidence only on a combination. confidence(X, Y) indicates how certain we are that {X} -> {Y}. What we are looking for are strong rules having both high support and high confidence. This can be used, for instance, to recommend products to stock based on recent purchases. In the literature you may also find the term market basket analysis.

We can compute support and confidence for every possible combination of products in a dataset, but this can be slow if we have captured a large number of transactions. This is why specialized algorithms like Apriori and Eclat exist, as they are likely much more efficient than what will be introduced next.

Here we will try to illustrate association rule mining on a sample dataset. As a starting point we obtain the "Online retail" dataset from the UCI machine learning repository (you can download it here). This dataset contains over 500000 transactions, so it is far too big to run on a slow machine, which is why we will focus on a reduced version of it (also available). If we look at the data, we see the columns "CustomerID" and "Description" that appear of interest. To obtain an individual transaction, we could group by CustomerID and then extract all product descriptions from the group.

To look for association rules, we could use the following code:

import pandas as pd from operator import itemgetter from decimal import * getcontext().prec = 5 df = pd.read_csv('online_retail_small.csv') # 52558 entries having description groups = df.groupby('CustomerID') total_transactions = len(groups) products_set = set() transactions = [] combinations = [] product_counts = {} def count_occurences(products): if len(products) == 1: p = products[0] if p in product_counts: return product_counts[p] hash_pos = p else: p1, p2 = products rev = (p2, p1) if rev in product_counts: return product_counts[rev] hash_pos = products product_set = set(products) count = sum((1 for transaction_set in transactions if product_set <= transaction_set)) product_counts[hash_pos] = count return count def support(products): return Decimal(count_occurences(products) / float(total_transactions)) def confidence(products): return Decimal(support(products) / support(products[:1])) for _, group in groups: trs = group['Description'].str.lower().str.capitalize() trs_set = set(trs) transactions.append(trs_set) products_set |= trs_set # update first with second # Sort combinations by highest support and confidence combinations = sorted(( (prod1, prod2, support((prod1, prod2)), confidence((prod1, prod2))) for prod1 in products_set for prod2 in products_set if prod1 != prod2 ), key=itemgetter(3, 2), reverse=True) for p1, p2, supp, conf in combinations[:15]: print(p1, p2, supp+0, conf+0) # trigger Decimal precision if it hasn't been applied yet

The code produces the following output:

('Christmas gingham star', 'Christmas gingham heart', Decimal('0.013928'), Decimal('1')) ('Herb marker parsley', 'Herb marker basil', Decimal('0.012071'), Decimal('1')) ('Herb marker parsley', 'Herb marker thyme', Decimal('0.012071'), Decimal('1')) ('Herb marker basil', 'Herb marker parsley', Decimal('0.012071'), Decimal('1')) ('Herb marker basil', 'Herb marker thyme', Decimal('0.012071'), Decimal('1')) ('Herb marker thyme', 'Herb marker parsley', Decimal('0.012071'), Decimal('1')) ('Herb marker thyme', 'Herb marker basil', Decimal('0.012071'), Decimal('1')) ('Herb marker mint', 'Herb marker parsley', Decimal('0.011142'), Decimal('1')) ('Herb marker mint', 'Herb marker basil', Decimal('0.011142'), Decimal('1')) ('Herb marker mint', 'Herb marker thyme', Decimal('0.011142'), Decimal('1')) ('Herb marker rosemary', 'Herb marker parsley', Decimal('0.010214'), Decimal('1')) ('Herb marker rosemary', 'Herb marker basil', Decimal('0.010214'), Decimal('1')) ('Herb marker rosemary', 'Herb marker thyme', Decimal('0.010214'), Decimal('1')) ('Herb marker chives ', 'Herb marker parsley', Decimal('0.010214'), Decimal('1')) ('Herb marker chives ', 'Herb marker basil', Decimal('0.010214'), Decimal('1'))

This seems interesting as many product combinations appear with confidence of 1. The first association rule is {'Christmas gingham star'} -> {'Christmas gingham heart'}, which appears with the highest support on this list. This means that we can be reasonably sure that whoever buys the star is also going to buy the ❤. Interestingly, the other association rules in this dataset are mainly about herb markers. Whoever buys parsley seems to be interested in either basil or thyme as well. Whoever buys mint is equally interested in parsley, basil or thyme.

The code is neither optimal, nor perfect; it can certainly be improved. The goal here was to illustrate in the simplest possible way how we could look for plausible association rules. If you execute the script, you will see that it is quite memory-intensive (it considers every possible two-product combination) and could easily fill up your entire memory. The alternative would have been to use no caching (as in count_occurences) and recompute the same things repeatedly, which may have been even slower. If you prefer seeing instant results while you test your code, you could also create yourself a fake dataset (like this one) and ensure that you like the results, before you switch to the bigger one. If you choose this route, you may have to swap the pandas module with the csv module as well.

Now you know how you could mine your own association rules. A quick look at Amazon reveals that whoever buys Kindle buys a power adapter for Kindle as well...