Processing Structures and Models (Analysis Services - Data Mining)
A data mining object is only an empty container until it has been processed. Processing a data mining model is also called training.
Processing mining structures: A mining structure gets data from an external data source, as defined by the column bindings and usage metadata, and reads the data. This data is read in full and then analyzed to extract various statistics. Analysis Services stores a compact representation of the data, which is suitable for analysis by data mining algorithms, in a local cache. You can either keep this cache or delete it after your models have been processed. By default, the cache is stored. For more information, see How to: Process a Mining Structure.
Processing mining models: A mining model is empty, containing definitions only, until it is processed. To process a mining model, the mining structure that it is based on must have been processed. The mining model gets the data from the mining structure cache, applies any filters that may have been created on the model, and then passes the data set through the algorithm to detect patterns. After the model is processed, the model stores only the results of processing, not the data itself. For more information, see How to: Process a Mining Model.
The following diagram illustrates the flow of data when a mining structure is processed, and when a mining model is processed.
Queries on the Relational Store during Processing
There are three phases to processing: querying the source data, determining raw statistics, and using the model definition and algorithm to train the mining model.
The Analysis Services server issues queries to the database that provides the raw data. This database might be an instance of SQL Server 2008 or an earlier version of the SQL Server database engine. When you process a data mining structure, the data in the source is transferred to the mining structure and persisted on disk in a new, compressed format. Not every column in the data source is processed: only the columns that are included in the mining structure, as defined by the bindings.
Using this data, Analysis Services builds an index of all data and discretized columns, and creates a separate index for continuous columns. One query is issued for each nested table to create the index, and an additional query per nested table is generated to process relationships between each pair of a nested table and case table. The reason for creating multiple queries is to process a special internal Online Analytical Processing (OLAP) cube. You can limit the number of queries that Analysis Services sends to the relational store by setting the server property, DatabaseConnectionPoolMax. For more information, see OLAP Properties.
When you process the model, the model does not reread the data from the data source, but instead gets the summary of the data from the mining structure. Using the cube that was created, together with the cached index and case data has been cached, the server creates independent threads to train the models.
In SQL Server Enterprise, all processing takes place in parallel. In SQL Server Standard, processing is serialized.
Viewing the Results of Processing
After a mining structure has been processed, it contains a compact representation of the data for use in statistical analysis. If the cache has not been cleared, you can access the data in this cache in the following ways:
Creating a Data Mining Extensions (DMX) query on the model and drilling through to the structure. For more information, see SELECT FROM <model>.CASES (DMX).
Browsing a model based on the structure, and using one of the options in the user interface to drill through to structure cases. For more information, see Viewing a Data Mining Model, or How to: Drill Through to Case Data from a Mining Model.
Creating a DMX query on the structure cases. For more information, see SELECT FROM <structure>.CASES.
After a mining model has been processed, it contains only the patterns that were derived from analysis, and mappings from the model results to the cached training data. You can browse or query the model results, called model content, or you can query the model and structure cases, if they have been cached.
The model content for each mining model depends on the algorithm that was used to create it. For example, if one model is a clustering model and another is a decision trees model, the model content is very different even though the models use exactly the same data. For more information, see Mining Model Content (Analysis Services - Data Mining).