Validating Data Mining Models (Analysis Services - Data Mining)
Validation is the process of assessing how well your mining models perform against real data. It is important that you validate your mining models by understanding their quality and characteristics before you deploy them into a production environment.
There are several approaches for assessing the quality and characteristics of a data mining model. The first includes the use of various measures of statistical validity to determine whether there are problems in the data or in the model. Second, you might separate the data into training and testing sets to test the accuracy of predictions. Finally, you might ask business experts to review the results of the data mining model to determine whether the discovered patterns have meaning in the targeted business scenario. All of these methods are useful in data mining methodology and are used iteratively as you create, test, and refine models to answer a specific problem.
This section introduces some basic concepts related to model quality, and introduces the strategies for model validation that are provided in Microsoft SQL Server 2008 Analysis Services. For an overview of how model validation fits into the larger data mining process, see Data Mining Projects (Analysis Services - Data Mining).
Criteria for Measuring Data Mining Models
Most methods for validating a data mining model do not answer business questions directly, but provide the metrics that can be used to guide a business or development decision. There is no comprehensive rule that can tell you when a model is good enough, or when you have enough data.
Measures of data mining generally fall into the categories of accuracy, reliability, and usefulness.
Accuracy, Reliability, and Usefulness
Accuracy is a measure of how well the model correlates an outcome with the attributes in the data that has been provided. There are various measures of accuracy, but all measures of accuracy are dependent on the data that is used. In reality, values might be missing or approximate, or the data might have been changed by multiple processes. Particularly in the phase of exploration and development, you might decide to accept a certain amount of error in the data, especially if the data is fairly uniform in its characteristics. For example, a model that predicts sales for a particular store based on past sales can be strongly correlated and very accurate, even if that store consistently used the wrong accounting method. Therefore, measurements of accuracy must be balanced by assessments of reliability.
Reliability assesses the way that a data mining model performs on different data sets. A data mining model is reliable if it generates the same type of predictions or finds the same general kinds of patterns regardless of the test data that is supplied. For example, the model that you generate for the store that used the wrong accounting method would not generalize well to other stores, and therefore would not be reliable.
Usefulness includes various metrics that tell you whether the model provides useful information. For example, a data mining model that correlates store location with sales might be both accurate and reliable, but might not be useful, because you cannot generalize that result by adding more stores at the same location. Moreover, it does not answer the fundamental business question of why certain locations have more sales. You might also find that a model that appears successful in fact is meaningless, because it is based on cross-correlations in the data.
Microsoft Data Mining Framework
CRISP-DM is a well-known methodology that describes the steps in defining, developing, and implementing a data mining project. However, CRISP-DM is a conceptual framework that does not provide specific guidance in how to scope and schedule a project. To better meet the particular needs of business users who are interested in data mining but do not know where to begin planning, and the needs of developers who might be skilled in .NET application development but are new to data mining, Microsoft has developed a method for implementing a data mining project that includes a comprehensive system of evaluation.
For more information, see the Microsoft Data Mining Resources page.
Approaches to Mining Model Validation in SQL Server Analysis Services
SQL Server 2008 supports multiple approaches to validation of data mining solutions that support all phases of the data mining development methodology.
Partitioning Data into Training and Testing Sets
Partitioning data into training and testing sets is an established technique for preparing data for evaluation. Some portion of data from the training data set is reserved for testing, and the rest of the data is used for training. After the model is complete, the model is used to make predictions against the test set. Because the data in the training set is randomly selected from the same data that is used for training, the metrics of accuracy that you derive from testing are less likely to be affected by data discrepancies and therefore better reflect the characteristics of the model.
For more information, see Partitioning Data into Training and Testing Sets (Analysis Services - Data Mining).
Cross-Validation of Mining Models
Cross-validation lets you partition a dataset into many smaller cross-sections, and create multiple models on the cross-sections to test the validity of the complete data set. Analysis Services then generates detailed accuracy metrics for each partition. By using this information, you can improve the quality of an individual model, or identify the best models for a particular set of data.
For more information, see Cross-Validation (Analysis Services - Data Mining).
Charting Mining Model Accuracy
Microsoft SQL Server Analysis Services provides tools to help you graph prediction accuracy, test the model against new or existing data, or compare multiple models in charts and reports.
A lift chart is a method of visualizing the improvement that you get from using a data mining model, when you compare it to random guessing. You can also create profit charts that associate financial gain or costs with the use of a mining model, and scatter plots for regression models. A classification matrix is a method of sorting good and bad guesses into a table so that you can quickly and easily gauge how accurately the model predicts the target value.
For more information, Tools for Charting Model Accuracy (Analysis Services - Data Mining).
You can also filter models in various ways to train and test different combinations of the same source data. For more information, see Measuring Mining Model Accuracy (Analysis Services - Data Mining).
Browsing and Querying Model Content and Cases
Analysis Services provides a set of data mining viewers for browsing and exploring the model. You can also create content queries that help you understand the model better and find unexpected problems in your approach or in the data. When you create a content query by using Data Mining Extensions (DMX), you can get statistical information about the patterns discovered by the mining model, or retrieve cases that support specific patterns found by the model. You can also drill through to detail in the underlying mining structure, to find or present detailed information that was not included in the model, or take action on patterns found in the data.
For more information about how to query model content, see Querying Data Mining Models (Analysis Services - Data Mining).
For more information about browsing model content, see Viewing a Data Mining Model.
For more information about how to interpret the model content for specific algorithms, see Data Mining Algorithms (Analysis Services - Data Mining).