Cross-Validation (Analysis Services - Data Mining)
Cross-validation is a standard tool in analytics and is an important feature for helping you develop and fine-tune data mining models. You use cross-validation after you have created a mining structure and related mining models to ascertain the validity of the model. Cross-validation has the following applications:
Validating the robustness of a particular mining model.
Evaluating multiple models from a single statement.
Building multiple models and then identifying the best model based on statistics.
This section describes how to use the cross-validation features provided in SQL Server 2008 and how to interpret the results of cross-validation for a particular model or data set. You can execute cross-validation as a set of stored procedures. You can also use cross-validation from the Data Mining Designer, in Business Intelligence Development Studio.
Overview of Cross-Validation Process
Cross-validation consists of two phases, training and result generation. These phases include the following steps:
Selecting a target mining structure.
Specifying the models you want to test.
Specifying the number of folds into which to partition the structure data.
Analysis Services creates and trains as many models as there are folds.
To generate results, you must specify parameters for testing the trained models.
Specifying the source of the testing data. (This feature is available only when you use the stored procedures.)
Specifying the predictable attribute, predicted value, and accuracy threshold.
Analysis Services then returns a set of accuracy metrics for each fold in each model. You can also return accuracy metrics for the data set as a whole.
Using Cross-Validation in Data Mining Designer
If you perform cross-validation by using the Cross-Validation tab of the Mining Accuracy Chart view in Business Intelligence Development Studio, you can configure the training and accuracy results parameters in a single form. This makes it easier to set up and view results. You can measure the accuracy of all mining models that are related to a single mining structure and then immediately view the results in an HTML report.
For more information about the report format, and the accuracy metrics that are provided by cross-validation, see Cross-Validation Report (Analysis Services - Data Mining).
For information about how to configure cross-validation parameters in Business Intelligence Development Studio, see Cross-Validation Tab (Mining Accuracy Chart View).
Using Cross-Validation Stored Procedures
For advanced users, cross-validation is also available as four system stored procedures. You can run the stored procedures by connecting to an instance of Analysis Services 2008 from SQL Server Management Studio, or from any managed code application.
The stored procedures are grouped by mining model type. The first pair of procedures works with clustering models only. The second pair of procedures works with other mining models.
Note
Cross-validation cannot be used with any model that contains a KEY TIME column or a KEY SEQUENCE column.
For each type of mining model, there are two stored procedures. The first procedure creates as many partitions as you specify within the data set, and returns accuracy results for each partition. For each metric, Analysis Services calculates the mean and standard deviation for the partitions.
The second stored procedure does not partition the data set, but generates accuracy results for the specified data set as a whole. You can also use the second stored procedure if the mining structure and its models have already been partitioned and processed.
Partition data and generate metrics for partitions
SystemGetCrossValidationResults (Analysis Services - Data Mining)
SystemGetClusterCrossValidationResults (Analysis Services - Data Mining)
Generate metrics for entire data set
SystemGetAccuracyResults (Analysis Services - Data Mining)
SystemGetClusterAccuracyResults (Analysis Services - Data Mining)
Configuring Cross-Validation
You can customize the way that cross-validation works to control the number of cross-sections, the models that are tested, and the accuracy bar for predictions. If you use the cross-validation stored procedures, you can also specify the data set that is used for validating the models. This wealth of choices means that you can easily produce many sets of different results that must then be compared and analyzed.
This section provides information to help you configure cross-validation appropriately.
Setting the Number of Partitions
When you specify the number of partitions, you determine how many temporary models will be created. For each partition, a cross-section of the data is flagged for use as the test set, and a new model is created by training on the remaining data not in the partition. This process is repeated until Analysis Services has created and tested the specified number of models. The data that you specified as being available for cross-validation is distributed evenly among all partitions.
The example in the diagram illustrates the usage of data if three folds are specified.
In the scenario in the diagram, the mining structure contains a holdout data set that is used for testing, but the test data set has not been included for cross-validation. As a result, all the data in the training data set, 70 percent of the data in the mining structure, is used for cross-validation. The cross-validation report shows the total number of cases used in each partition.
You can also specify the amount of data that is used during cross-validation, by specifying the number of overall cases to use. The cases are distributed evenly across all folds.
If the mining structure is stored in an instance of SQL Server Analysis Services, the maximum value that you can set for the number of folds is 256, or the number of cases, whichever is less. If you are using a session mining structure, the maximum number of folds is 10.
Note
As you increase the number of folds, the time required to perform cross-validation increases accordingly, because a model must be generated and tested for each fold. You may experience performance problems if the number of folds is too high.
Defining the Testing Data
When you run the stored procedures that calculate accuracy, SystemGetAccuracyResults (Analysis Services - Data Mining) or SystemGetClusterAccuracyResults (Analysis Services - Data Mining), you can specify the source of the data that is used for testing during cross-validation, by using combinations of the following options:
Use only the training data.
Include an existing testing data set.
Use only the testing data set.
Apply existing filters to each model.
Any combination of the training set, testing set, and model filters.
You control the composition of the testing data set by providing a value for the DataSet parameter.
If you perform cross-validation by using the Cross-Validation report in theData Mining Designer, you cannot change the data set that is used. By default, the training cases for each model are used. If a filter is associated with a model, the filter is applied.
Cross-Validation of Filtered Mining Models
If you are testing multiple mining models, and the models have filters, each model is filtered separately. You cannot add a filter to a model or change the filter for a model during cross-validation.
Because cross-validation by defaults tests all mining models that are associated with a structure, you may receive inconsistent results if some models have a filter and others do not. To ensure that you compare only those models that have the same filter, you should use the stored procedures and specify a list of mining models. Or, use only the mining structure test set with no filters to ensure that a consistent set of data is used for all models.
Setting the Accuracy Threshold
The state threshold lets you set the accuracy bar for predictions. For each case, the model calculates a probability, called the predict probability, that the predicted state is correct. If the predict probability exceeds the accuracy bar, the prediction is counted as correct; if not, the prediction is counted as incorrect. You control this value by setting State Threshold to a number between 0.0 and 1.0, where numbers closer to 1 indicate a strong level of confidence in the predictions, and numbers closer to 0 indicate that the prediction is less likely to be true. The default value for state threshold is NULL, which means that the predicted state with the highest probability is considered the target value.
Note
You can set a value of 0.0, but it is meaningless, because every prediction will be counted as correct, even those with zero probability. Be careful not to accidentally set State Threshold to 0.0.
For example, you have three models that predict the column [Bike Buyer], and the value that you want to predict is 1, meaning "yes, will buy." The three models return predictions with predict probabilities of 0.05, 0.15, and 0.8. If you set the state threshold to 0.10, two of the predictions are counted as correct. If you set the state threshold to 0.5, only one model is counted as having returned a correct prediction. If you use the default value, null, the most probable prediction is counted as correct. In this case, all three predictions would be counted as correct.
Metrics used in Cross-Validation
Different accuracy metrics are generated depending on the particular type of mining model, the data type of the predictable attribute, and the predictable attribute value, if any. This section defines the principal metrics for reference. For a list of the accuracy metrics that are returned in the report for each model, grouped by type, see Cross-Validation Report (Analysis Services - Data Mining).
Measure |
Applies To |
Implementation |
---|---|---|
Classification: true positive, false positive, true negative, false negative |
Discrete attribute, value is specified |
Count of rows or values in the partition where the predict probability is greater than the state threshold, and where the predicted state matches the target state. Cases with missing values are excluded. |
Classification: pass/fail |
Discrete attribute, no specified target |
Count of rows or values in the partition where the predicted state matches the target state, and where the predict probability value is greater than 0. Cases with missing values are excluded. |
Lift |
Discrete attribute. Target value can be specified but is not required. |
The mean log likelihood for all rows with values for the target attribute, where log likelihood for each case is calculated as Log(ActualProbability/MarginalProbability). To compute the mean, the sum of the log likelihood values is divided by the number of rows in the input dataset, excluding rows with missing values for the target attribute.Lift can be either a negative or positive value. A positive value means an effective model that outperforms the random guess. |
Log score |
Discrete attribute. Target value can be specified but is not required. |
Log of the actual probability for each case, summed, and then divided by the number of rows in the input dataset, excluding rows with missing values for the target attribute. Because probability is represented as a decimal fraction, log scores are always negative number. A number closer to 0 represents a better score. |
Case likelihood |
Cluster |
Sum of the cluster likelihood scores for all cases, divided by the number of cases in the partition, excluding rows with missing values for the target attribute. |
Mean absolute error |
Continuous attribute |
Sum of the absolute error for all cases in the partition, divided by the number of cases in the partition, excluding rows with missing values. |
Root mean square error |
Continuous attribute |
Square root of the mean squared error for the partition. |
Root mean squared error |
Discrete attribute. Target value can be specified but is not required. |
Square root of the mean of the squares of the probability score’s complement, divided by the number of cases in the partition, excluding rows with missing values for the target attribute. |
Root mean squared error |
Discrete attribute, no specified target. |
Square root of the mean of the squares of the probability score’s complement, divided by the number of cases in the partition, excluding cases with missing values for the target attribute. |