Microsoft Association Algorithm
The Microsoft Association algorithm is an association algorithm provided by Microsoft SQL Server 2005 Analysis Services (SSAS) that is useful for recommendation engines. A recommendation engine recommends products to customers based on items they have already bought, or in which they have indicated an interest. The Microsoft Association algorithm is also useful for market basket analysis. For more information about market basket analysis, see Lesson 4: Building the Market Basket Scenario in the Data Mining Tutorial.
Association models are built on datasets that contain identifiers both for individual cases and for the items that the cases contain. A group of items in a case is called an itemset. An association model is made up of a series of itemsets and the rules that describe how those items are grouped together within the cases. The rules that the algorithm identifies can be used to predict a customer's likely future purchases, based on the items that already exist in the customer's shopping cart. The following diagram shows a series of rules in an itemset.
As the diagram illustrates, the Microsoft Association algorithm can potentially find many rules within a dataset. The algorithm uses two parameters, support and probability, to describe the itemsets and rules that it generates. For example, if X and Y represent two items that could be in a shopping cart, the support parameter is the number of cases in the dataset that contain the combination of items, X and Y. By using the support parameter in combination with the user-defined MINIMUM_SUPPORT and MAXIMUM_SUPPORT, parameters the algorithm controls the number of itemsets that are generated. The probability parameter, also called confidence, represents the fraction of cases in the dataset that contain X, that also contain Y. By using the probability parameter in combination with the MINIMUM_PROBABILITY parameter, the algorithm controls the number of rules that are generated.
The Adventure Works Cycles company is redesigning the functionality of its Web site. The goal of the redesign is to increase sell-through of products. Because the company records each sale in a transactional database, they can use the Microsoft Association algorithm to identify sets of products that tend to be purchased together. They can then predict additional items that a customer may be interested in, based on items that are already in the customer's shopping basket.
How the Algorithm Works
The Microsoft Association algorithm traverses a dataset to find items that appear together in a case. The algorithm then groups into itemsets any associated items that appear, at a minimum, in the number of cases that are specified by the MINIMUM_SUPPORT parameter. For example, an itemset could be "Mountain 200=Existing, Sport 100=Existing", and could have a support of 710. The algorithm then generates rules from the itemsets. These rules are used to predict the presence of an item in the database, based on the presence of other specific items that the algorithm identifies as important. For example, a rule could be "if Touring 1000=existing and Road bottle cage=existing, then Water bottle=existing", and could have a probability of 0.812. In this example, the algorithm identifies that the presence in the basket of the Touring 1000 tire and the water bottle cage predicts that a water bottle would also likely be in the basket.
Using the Algorithm
An association model must contain a key column, input columns, and one predictable column. The input columns must be discrete. The input data for an association model often is contained in two tables. For example, one table may contain customer information while another table contains customer purchases. You can input this data into the model by using a nested table. For more information about nested tables, see Nested Tables.
The Microsoft Association algorithm supports specific input column content types, predictable column content types, and modeling flags, which are listed in the following table.
Input column content types
Cyclical, Discrete, Discretized, Key, Table, and Ordered
Predictable column content types
Cyclical, Discrete, Discretized, Table, and Ordered
MODEL_EXISTENCE_ONLY and NOT NULL
All Microsoft algorithms support a common set of functions. However, the Microsoft Association algorithm supports additional functions, listed in the following table.
For a list of the functions that are common to all Microsoft algorithms, see Data Mining Algorithms. For more information about how to use those functions, see Data Mining Extensions (DMX) Function Reference.
The Microsoft Association algorithm does not support using the Predictive Model Markup Language (PMML) to create mining models.
The Microsoft Association algorithm supports several parameters that affect the performance and accuracy of the resulting mining model. The following table describes each parameter.
Specifies the minimum number of cases that must contain the itemset before the algorithm generates a rule. Setting this value to less than 1 specifies the minimum number of cases as a percentage of the total cases. Setting this value to a whole number greater than 1 specifies the minimum number of cases as the absolute number of cases that must contain the itemset. The algorithm may increase the value of this parameter if memory is limited.
The default is 0.03.
Specifies the maximum number of cases in which an itemset can have support. If this value is less than 1, the value represents a percentage of the total cases. Values greater than 1 represent the absolute number of cases that can contain the itemset.
The default is 1.
Specifies the minimum number of items that are allowed in an itemset.
The default is 1.
Specifies the maximum number of items that are allowed in an itemset. Setting this value to 0 specifies that there is no limit to the size of the itemset.
The default is 3.
Specifies the maximum number of itemsets to produce. If no number is specified, the default is used. The default is 200000.
Itemsets are ranked by support only. Among itemsets that have the same support, ordering is arbitrary.
Specifies the minimum probability that a rule is true. For example, setting this value to 0.5 specifies that no rule with less than fifty percent probability is generated.
The default is 0.4.
Defines the number of items to be cached or optimized for prediction.
The default value is 0. When the default is used, the algorithm will produce as many predictions as requested in the query.
17 November 2008
15 September 2007