January 2014

Volume 29 Number 1

Test Run : Frequent Item-Sets for Association Rule Learning

James McCaffrey

James McCaffreyAn important task in the areas of machine learning and natural user interfaces (NUIs) is extracting frequent item-sets from a list of transactions. The problem is best explained by example. Suppose you have a list of items purchased by different customers in a supermarket. For instance, one customer’s transaction might be (apples, bread, celery) and another’s might be (bread, donuts, eggs, lettuce). There could be many thousands of these transactions. An item-set is a subset of all possible items, such as (apples, bread). The problem is to extract from the list of all transactions just those item-sets that meet some minimum count. Suppose the minimum count is 20 and item-set (apples, bread) occurs in the list of transactions 25 times; therefore, (apples, bread) would be a frequent item-set.

Identifying frequent item-sets can be useful in many ways. In the case of a supermarket, knowledge of which items are often purchased together could be used for product placement, targeted marketing, and so on. Additionally, once all frequent items sets have been identified, these item-sets could be analyzed further to extract rules such as, “if customers purchase both apples and bread, then there’s high probability they will also purchase lettuce and milk.” This overall process of first extracting frequent item-sets and then harvesting if-then rules is called association rule learning.

This article explains in detail the frequent item-set extraction process. The problem is surprisingly difficult and has been the subject of quite a bit of research. The best way to see where this article is headed is to take a look at the demo program in Figure 1.

Extracting Frequent Item-Sets from Transactions
Figure 1 Extracting Frequent Item-Sets from Transactions

The demo program sets up a dummy list of 10 transactions. Behind the scenes, there are a total of 12 items in a hypothetical supermarket: apples, bread, celery, donuts, eggs, flour, grapes, honey, icing, jelly, kale and lettuce. The first transaction is (apples, bread, celery, flour). The demo then shows how raw transactions can be encoded into 0-based integers for efficient processing. Because apples maps to 0, bread to 1, celery to 2, and so on, the first transaction is encoded as (0, 1, 2, 5).

The algorithm to extract frequent item-sets requires three input parameters. The first parameter is a minimum threshold value necessary for an item-set to be classified as frequent. This is usually called the support value. In the demo, the support value is set to 0.30, which means that 0.30 * 10 = 3 is the minimum number of times an item-set must occur in the list of all transactions for the item-set to be added to the result list of frequent item-sets. Different problem domains will have different meaningful support values.

The second input parameter is the minimum size of an item-set. In the demo that value is set to 2, so an item-set of (apples, donuts) is examined but an item-set of just (apples) is not. The third input parameter is the maximum size of an item set. Here that value is set to 4, so only item-sets with up to four items, such as (apples, celery, eggs, grapes), are examined while larger item-sets such as (apples, bread, donuts, flour, honey) are not.

The demo program calls the method to extract frequent item-sets from the list of transactions and finds a total of 12 frequent item-sets. The first frequent item-set is (0, 1), which occurs five times in the transaction list. The last frequent item-set is (0, 1, 2, 5), which occurs three times. The demo concludes by displaying the frequent item-sets in string form.

This article assumes you have at least intermediate-level programming skills, but does not assume you know anything about association rule learning. The complete demo program code is presented in this article, and it’s also available as a download from msdn.microsoft.com/magazine/msdnmag0114. The demo is in C# but you shouldn’t have too much trouble refactoring to another language, such as Python. All normal error-checking has been removed to keep the size of the code smaller.

Why Is Frequent Item-Set Extraction a Difficult Problem?

At first thought, extracting frequent item-sets from a list of transactions doesn’t seem so difficult. A simple brute-force approach would be to simply generate all possible item-sets and iterate through the transactions for each candidate, counting the number of times the candidate appears to see if that item-set meets the minimum support count. Unfortunately, in all but artificially small problems, the number of possible item-sets is unimaginably large. For example, a typical supermarket might have far more than 10,000 distinct items in stock. The number of item-sets of size 2 is Choose(10,000, 2) = 49,995,000, where Choose(n, k) is the number of ways to select k items from n items. The number of possible item-sets of size 9 is Choose(10,000, 9) = 2,745,826,321,280,434,929,668,521,390,000, which is … well, a lot.

There are several clever algorithms to efficiently extract frequent item-sets from a list of transactions, including the Apriori, Eclat, and FP-growth algorithms. Each algorithm has many variations. The demo program uses a variation of the Apriori algorithm, which is illustrated in Figure 2.

The Apriori Algorithm in Action
Figure 2 The Apriori Algorithm in Action

Briefly, the Apriori algorithm starts with all frequent item-sets of size k = 1—individual items that meet the support threshold in the list of transactions. Then the construction process iteratively adds frequent item-sets of size k = 2, 3 and so on until no new frequent item-sets are found.

The image in Figure 2 shows how item-sets of size k = 4 are constructed. A single list of frequent item-sets of all sizes is maintained. In the figure, there are currently three frequent item-sets with size k = 3: (0, 2, 3), (0, 2, 8) and (0, 3, 6). The algorithm also maintains a list of items that are valid at any given point in time. Here there are five valid items: {0, 2, 3, 6, 8}. Valid items to construct frequent item-sets of size k = 4 are the distinct items in all frequent item-sets of size k-1 = 3.

The algorithm scans each frequent item-set of size k = 3. For each existing item-set, a new candidate frequent item-set of size k = 4 is generated. So, the first candidate is (0, 2, 3, ?). The ? can be filled in with a valid item. The algorithm assumes that the items within an item-set are always stored in order, so the ? can be only 6 or 8 in this case. Each new candidate is examined to count how many times it occurs in the transactions list. If the transaction count meets the minimum support count, the candidate is added to the list of frequent item-sets.

The algorithm greatly reduces the amount of computation needed, compared to the brute-force generation approach. For example, notice that the second existing frequent item-set with size k = 3 is (0, 2, 8). The potential candidates will have form (0, 2, 8, ?). But because no valid item is greater than 8, there are no possible candidates generated.

Overall Program Structure

The overall structure of the demo program, with some WriteLine statements removed and a few minor edits, is presented in Figure 3. To create the demo, I launched Visual Studio 2012 and created a new console application program named FreqItemSets. The demo has no significant .NET dependencies, and any relatively recent version of Visual Studio will work fine. After the template code loaded into the editor, in the Solution Explorer window I renamed file Program.cs to FreqItemSetProgram.cs, and Visual Studio automatically renamed class Program for me.

Figure 3 Demo Program Structure

using System;
using System.Collections.Generic;
namespace FreqItemSets
{
  class FreqItemSetProgram
  {
    static void Main(string[] args)
    {
      try
      {
        Console.WriteLine("\nBegin frequent item-set extraction demo\n");
        string[] rawItems = new string[] { "apples", "bread ", "celery", "donuts",
          "eggs  ", "flour ", "grapes", "honey ", "icing ", "jelly ",
             "kale  ", "lettuce" };
        int N = rawItems.Length; // total number of items to deal with ( [0..11] )
        string[][] rawTransactions = new string[10][];
        rawTransactions[0] = new string[] { "apples", "bread ", "celery",
           "flour " };
        rawTransactions[1] = new string[] { "bread ", "eggs  ", "flour " };
        rawTransactions[2] = new string[] { "apples", "bread ", "donuts",
           "eggs  " };
        rawTransactions[3] = new string[] { "celery", "donuts", "flour ",
           "grapes" };
        rawTransactions[4] = new string[] { "donuts", "eggs  " };
        rawTransactions[5] = new string[] { "donuts", "eggs  ", "jelly " };
        rawTransactions[6] = new string[] { "apples", "bread ", "donuts",
           "icing " };
        rawTransactions[7] = new string[] { "bread ", "grapes", "honey " }; 
        rawTransactions[8] = new string[] { "apples", "bread ", "celery",
           "flour ", "kale  " };
        rawTransactions[9] = new string[] { "apples", "bread ", "celery",
           "flour " };
        for (int i = 0; i < rawTransactions.Length; ++i) {
          Console.Write("[" + i + "] : ");
          for (int j = 0; j < rawTransactions[i].Length; ++j)
            Console.Write(rawTransactions[i][j] + "   ");
          Console.WriteLine("");
        }
        List<int[]> transactions = new List<int[]>();
        transactions.Add(new int[] { 0, 1, 2, 5 });
        transactions.Add(new int[] { 1, 4, 5 });
        transactions.Add(new int[] { 0, 1, 3, 4 });
        transactions.Add(new int[] { 2, 3, 5, 6 });
        transactions.Add(new int[] { 3, 4 });
        transactions.Add(new int[] { 3, 4, 9 });
        transactions.Add(new int[] { 0, 1, 3, 8 });
        transactions.Add(new int[] { 1, 6, 7 });
        transactions.Add(new int[] { 0, 1, 2, 5, 10 });
        transactions.Add(new int[] { 0, 1, 2, 5 });
        for (int i = 0; i < transactions.Count; ++i) {
          Console.Write("[" + i + "] : ");
          for (int j = 0; j < transactions[i].Length; ++j)
            Console.Write(transactions[i][j].ToString() + " ") ;
          Console.WriteLine("");
        }
        Console.WriteLine("");
        double minSupportPct = 0.30;
        int minItemSetLength = 2;
        int maxItemSetLength = 4;
        List<ItemSet> frequentItemSets =
          GetFrequentItemSets(N, transactions, minSupportPct,
          minItemSetLength, maxItemSetLength);
        Console.WriteLine("\nFrequent item-sets in numeric form are:");
        for (int i = 0; i < frequentItemSets.Count; ++i)
          Console.WriteLine(frequentItemSets[i].ToString());
        Console.WriteLine("\nFrequent item-sets in string form are:");
        for (int i = 0; i < frequentItemSets.Count; ++i) {
          for (int j = 0; j < frequentItemSets[i].data.Length; ++j) {
            int v = frequentItemSets[i].data[j];
            Console.Write(rawItems[v] + "   ");
          }
          Console.WriteLine("");
        }
        Console.WriteLine("\nEnd frequent item-set extraction demo\n");
        Console.ReadLine();
      }
      catch (Exception ex)
      {
        Console.WriteLine(ex.Message);  Console.ReadLine();
      }
    }
    static List<ItemSet> GetFrequentItemSets(int N, List<int[]> transactions,
      double minSupportPct, int minItemSetLength, int maxItemSetLength) { . . }
    static int CountTimesInTransactions(ItemSet itemSet,
      List<int[]> transactions) { . . }
  }
  public class ItemSet { . . }
} // ns

At the top of the source code I deleted all unnecessary references to .NET namespaces, leaving just System and Collections.Generic. The demo begins by setting up a collection of all items in a hypothetical supermarket:

string[] rawItems = new string[] { "apples", "bread ", "celery",
  "donuts", "eggs  ", "flour ", "grapes", "honey ", "icing ",
  "jelly ", "kale  ", "lettuce" };
int N = rawItems.Length; // total items in inventory

Next the program sets up 10 hard-coded transactions. In most scenarios, your transactions will be stored in a text file or SQL database. Notice duplicate transactions are allowed and not all items in inventory are necessarily in a transaction:

string[][] rawTransactions = new string[10][];
rawTransactions[0] = new string[] { "apples", "bread ", "celery", "flour "};
rawTransactions[1] = new string[] { "bread ", "eggs  ", "flour " };
...
rawTransactions[9] = new string[] { "apples", "bread ", "celery", "flour "};

After displaying the raw transactions, the demo sets up a List of 0-based encoded transactions. Instead of hard-coding, in most situations you’d programmatically create the numeric transactions from the raw transactions. Notice that the items in each transaction are distinct and sorted:

List<int[]> transactions = new List<int[]>();
transactions.Add(new int[] { 0, 1, 2, 5 });
transactions.Add(new int[] { 1, 4, 5 });
...
transactions.Add(new int[] { 0, 1, 2, 5 });

After displaying the transactions list, the three input parameter value are set up. Here, the maximum size of a frequent item-set is set to 4. You may want to scan the transactions list and set the value to the length of the longest transaction:

double minSupportPct = 0.30;
int minItemSetLength = 2;
int maxItemSetLength = 4;

All the work is performed by a call to method GetFrequentItemSets:

List<ItemSet> frequentItemSets =
  GetFrequentItemSets(N, transactions, minSupportPct,
  minItemSetLength, maxItemSetLength);

Notice the return result uses a program-defined ItemSet class. The demo concludes by displaying the frequent item-sets in both numeric and string form.

The ItemSet Class

In essence, an item-set is just an array of integers, so no program-defined class is necessary. But, in my opinion, using a class in this case simplifies the code. The ItemSet class is defined in Figure 4.

Figure 4 The ItemSet Class

public class ItemSet
{
  public int N; // data items are [0..N-1]
  public int k; // number of items
  public int[] data; // ex: [0 2 5]
  public int hashValue; // "0 2 5" -> 520 (for hashing)
  public int ct; // num times this occurs in transactions
  public ItemSet(int N, int[] items, int ct) { . . }
  private static int ComputeHashValue(int[] data) { . . }
  public override string ToString() { . . }
  public bool IsSubsetOf(int[] larger) { . . }
  private static int IndexOf(int[] array, int item, int startIdx) { . . }
}

Member field N is the total number of items in inventory. Field k is the size of the item-set. Array data holds the item values. Field hashValue is a unique integer to identify the item-set so duplicate item-sets can be avoided. Field ct is the number of times the item-set appears in the transactions list. The ItemSet constructor is defined:

public ItemSet(int N, int[] items, int ct)
{
  this.N = N;
  this.k = items.Length;
  this.data = new int[this.k];
  Array.Copy(items, this.data, items.Length);
  this.hashValue = ComputeHashValue(items);
  this.ct = ct;
}

The helper method to compute the hashValue is:

private static int ComputeHashValue(int[] data)
{
  int value = 0;
  int multiplier = 1;
  for (int i = 0; i < data.Length; ++i) {
    value = value + (data[i] * multiplier);
    multiplier = multiplier * 10;
  }
  return value;
}

The helper converts the item values to a single integer in reverse. For example, if the items are (0, 2, 5), the hash value is 520. The method works in reverse to deal with leading 0-items, because otherwise (0, 2, 5) and (2, 5) would both hash to 25.

Method IsSubsetOf returns true if the item-set object is a subset of a transaction:

public bool IsSubsetOf(int[] trans)
{
  int foundIdx = -1;
  for (int j = 0; j < this.data.Length; ++j) {
    foundIdx = IndexOf(trans, this.data[j], foundIdx + 1);
    if (foundIdx == -1) return false;
  }
  return true;
}

The method is short but slightly subtle. Because transactions and item-sets are ordered, after one item value of an item-set has been found inside a transaction, the search for the next item value does not have to begin at index 0—it can start at the next index following the location where the previous item value was found. Helper IndexOf is defined:

private static int IndexOf(int[] array, int item, int startIdx)
{
  for (int i = startIdx; i < array.Length; ++i) {
    if (i > item) return -1; // i is past target loc
    if (array[i] == target) return i;
  }
  return -1;
}

Method IndexOf also takes advantage of ordering. If the current search index is larger than the target item value being searched for, the search has gone too far and will never find the item. For example, suppose a transaction is (0, 2, 4, 5, 6, 8) and the target item being searched for is 3. Because transactions are ordered, the very latest value 3 can occur would be in a transaction that has form (0, 1, 2, 3, x, x). The ToString method uses string concatenation for simplicity:

public override string ToString()
{
  string s = "{ ";
  for (int i = 0; i < data.Length; ++i)
    s += data[i] + " ";
  return s + "}" + "   ct = " + ct; ;
}

The GetFrequentItemSets Method

The definition of method GetFrequentItemSets begins:

static List<ItemSet> GetFrequentItemSets(int N, List<int[]> transactions,
  double minSupportPct, int minItemSetLength, int maxItemSetLength)
{
  int minSupportCount = (int)(transactions.Count * minSupportPct);
...

Instead of specifying an input parameter for the support threshold as a percentage of transactions, you may want to use an absolute count. Next, three important collections are instantiated:

Dictionary<int, bool> frequentDict = new Dictionary<int, bool>();
List<ItemSet> frequentList = new List<ItemSet>();
List<int> validItems = new List<int>();
...

Collection frequentList holds all frequent item-sets found. Instead of storing frequent item-sets of all sizes in a single list, an important alternative is to use an array of lists where there are separate lists for each item-set size. Collection frequentDict holds the IDs of those item-sets that have been added to frequentList so duplicates can be avoided. Collection validItems holds item values that, at any given point in the algorithm, can be added to an existing frequent item-set of size k-1 to generate a candidate frequent item-set of size k. Next, the individual item values in the transaction list are counted:

int[] counts = new int[N];
for (int i = 0; i < transactions.Count; ++i)
{
  for (int j = 0; j < transactions[i].Length; ++j) {
    int v = transactions[i][j];
    ++counts[v];
  }
}
...

Then those item values that meet the minimum support count are used to create frequent ItemSet objects of size k = 1, which are then added to the list of frequent items:

for (int i = 0; i < counts.Length; ++i)
{
  if (counts[i] >= minSupportCount) {
    validItems.Add(i); // i is the item-value
    int[] d = new int[1]; // ItemSet ctor wants an array
    d[0] = i;
    ItemSet ci = new ItemSet(N, d, 1); // size 1, ct 1
    frequentList.Add(ci); // it's frequent
    frequentDict.Add(ci.hashValue, true); // record
  } // else skip this item
}
...

The main processing loop, in pseudo-code, is:

loop for size k = 2, 3, until done
  foreach existing freq item-set of size k-1
    foreach valid item
      create a candidate item-set
      count times candidate in transactions
      if candidate meets support threshold
        if candidate not already in frequent list
          add candidate to frequent, rec list
    end each valid item
    update valid items
  end each freq item-set
end main loop

The main processing loop is set up like so:

bool done = false;
for (int k = 2; k <= maxItemSetLength && done == false; ++k)
{
  done = true;
  int numFreq = frequentList.Count;
...

The main loop will exit when all specified sizes have been exam­ined, or when no new item-sets of the current size are found. Because all frequent item-sets are stored in a single list (and that list is added to), the initial size of the list is stored for use by the inner loops, which are set up like so:

for (int i = 0; i < numFreq; ++i)
{
  if (frequentList[i].k != k - 1) continue;
  for (int j = 0; j < validItems.Count; ++j)
  {
    int[] newData = new int[k]; // data for a candidate item-set
...

Two important features of the algorithm are that the algorithm uses only frequent item-sets from the previous iteration to construct new candidate item-sets and it examines only valid item values to complete the candidates. The candidate frequent item-sets are created:

for (int p = 0; p < k - 1; ++p)
  newData[p] = frequentList[i].data[p];
if (validItems[j] <= newData[k - 2]) continue;
newData[k - 1] = validItems[j];
ItemSet ci = new ItemSet(N, newData, -1);
...

This code is mildly tricky. First, data from the current frequent item-set is copied into the candidate. The candidate is completed with the current valid item value and, as described earlier, candidates may be eliminated based on the ordering property of item-sets. The ItemSet constructor accepts a dummy value of -1 for the ct field, because the number of times the candidate appears in the transactions is not yet known.

The two inner loops are tied up with the code shown in Figure 5.

Figure 5 Tying up the Two Inner Loops

...
    if (frequentDict.ContainsKey(ci.hashValue) == true)
      continue;
    int ct = CountTimesInTransactions(ci, transactions);
    if (ct >= minSupportCount)
    {
      ci.ct = ct;
      frequentList.Add(ci);
      frequentDict.Add(ci.hashValue, true);
      done = false;
    }
  } // j
} // i
...

If the candidate already appears in the frequent item-set list, there’s no need to analyze it further. If not, and if the candidate meets the minimum support count, the candidate is a winner and it’s added to the list of frequent item-sets. Boolean done tracks whether or not any new item-sets have been found. For any given value of k, if no new frequent item-sets are found, there’s no possibility that a frequent item-set will ever be found.

After all candidates for the current size k have been constructed and examined, the list of valid items for the next iteration is updated, as shown in Figure 6.

Figure 6 Updating the List of Valid Items

...
  validItems.Clear();
  Dictionary<int, bool> validDict = new Dictionary<int, bool>();
  for (int idx = 0; idx < frequentList.Count; ++idx) {
    if (frequentList[idx].k != k) continue;
    for (int j = 0; j < frequentList[idx].data.Length; ++j) {
      int v = frequentList[idx].data[j]; // item
      if (validDict.ContainsKey(v) == false) {
        validItems.Add(v);
        validDict.Add(v, true);
      }
    }
  }
        validItems.Sort();
} // next k
...

Although not entirely obvious at first, it turns out that when constructing new candidate frequent item-sets of size k using existing frequent item-sets of size k-1, the only valid items to complete the candidates are item values that occur in the frequent item-sets of size k-1. This update process is time-consuming, and in some scenarios you’ll get better performance by skipping it and instead using the original list of valid items created for size k = 1.

The method concludes by filtering the results by minimum item-set length:

...
  List<ItemSet> result = new List<ItemSet>();
  for (int i = 0; i < frequentList.Count; ++i)
  {
    if (frequentList[i].k >= minItemSetLength)
    result.Add(new ItemSet(frequentList[i].N,
      frequentList[i].data, frequentList[i].ct));
  }
  return result;
}

Helper method CountTimesInTransactions, used earlier, is defined:

static int CountTimesInTransactions(ItemSet itemSet,
  List<int[]> transactions)
{
  int ct = 0;
  for (int i = 0; i < transactions.Count; ++i) {
    if (itemSet.IsSubsetOf(transactions[i]) == true)
      ++ct;
  }
  return ct;
}

Wrapping Up

The code and explanation presented in this article should get you up and running if you ever need to determine frequent item-sets from a list of transactions using the Apriori algorithm. Although it’s unlikely you’ll ever be working directly with supermarket transactions, there are many scenarios where extracting frequent item-sets can be useful. In the case of NUI, you can think of transactions as a list of user commands—for example, the text typed into a search text box—and the items as the words that make up the command. Identifying those words that frequently occur together can help generate better search results.


Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Internet Explorer and Bing. He can be reached at jammc@microsoft.com.

Thanks to the following technical expert for reviewing this article: Richard Hughes (Microsoft Research)