Algorithm Parameters (SQL Server Data Mining Add-ins)
Topic Status: Some information in this topic is preview and subject to change in future releases. Preview information describes new features or changes to existing features in Microsoft SQL Server 2016 Community Technology Preview 2 (CTP2).
When you perform data mining by using the Table Analysis Tools for Excel, you do not need to configure the data mining algorithm or parameters; each tool analyzes your data and automatically selects the optimum parameters. However, if you want to modify the model, or create a mining model from scratch, the Data Mining Client for Excel offers several options for customization.
Create a data mining model manually, by clicking Advanced and then clicking Add Model to Structure.
Use any of the modeling wizards in the Data Mining Client, and click Parameters to control the behavior of the Microsoft data mining algorithms.
Click Query to open the Query Model wizard, and then click Advanced to open the Data Mining Advanced Query Editor. In this editor, you can build models by using DMX templates.
You can also modify the behavior of mining models that are already created, or you can filter the results, by setting parameters in the mining model viewer.
List of Algorithm Parameters
All Microsoft algorithms can be customized by setting parameters. Because the best parameter settings depend on the composition of your data, a full explanation of the effects of changing parameters is beyond the scope of this topic.
The following table lists the parameters, describes their functionality, and provides links to more technical information.
Parameter name |
Used in |
Description |
---|---|---|
AUTO_DETECT_PERIODICITY |
Microsoft Time Series Algorithm |
Specifies a numeric value between 0 and 1 that is used to detect periodicity. Setting this value closer to 1 favors the discovery of many near-periodic patterns and the automatic generation of periodicity hints. Dealing with many periodicity hints will likely lead to significantly longer model training times and more accurate models. If the value is closer to 0, periodicity is detected only for strongly periodic data. The default is 0.6. |
CLUSTER_COUNT |
Microsoft Clustering Algorithm Microsoft Sequence Clustering Algorithm |
Specifies the approximate number of clusters to be built by the algorithm. If the approximate number of clusters cannot be built from the data, the algorithm builds as many clusters as possible. Setting the CLUSTER_COUNT to 0 causes the algorithm to use heuristics to best determine the number of clusters to build. The default is 10. |
CLUSTER_SEED |
Microsoft Clustering Algorithm |
Specifies the seed number that is used to randomly generate clusters for the initial stage of model building. The default is 0. |
CLUSTERING_METHOD |
Microsoft Clustering Algorithm |
Specifies the clustering method for the algorithm to use. The following clustering methods are available: scalable EM (1), non-scalable EM (2), scalable K-Means (3), and non-scalable K-Means (4). The default is 1. |
COMPLEXITY_PENALTY |
Microsoft Decision Trees Algorithm Microsoft Time Series Algorithm |
Controls the growth of the decision tree. A low value increases the number of splits, and a high value decreases the number of splits. The default value is based on the number of attributes for a particular model, as described in the following list: For 1 through 9 attributes, the default is 0.5. For 10 through 99 attributes, the default is 0.9. For 100 or more attributes, the default is 0.99. Note In time series models, this parameter applies only to models that are built by using the ARTxp algorithm, or to mixed models. |
FORCED_REGRESSOR |
Microsoft Decision Trees Algorithm Microsoft Linear Regression Algorithm |
Forces the algorithm to use the indicated columns as regressors, regardless of the importance of the columns as calculated by the algorithm. Note This parameter is only used for decision trees that are predicting a continuous attribute. By definition, a linear regression model is a special case of decision trees that predicts continuous attributes. However, any decision tree model can contain a node that represents a linear regression formula. |
FORECAST_METHOD |
Microsoft Time Series Algorithm |
Indicates whether predictions should be made using the ARTxp algorithm, the ARIMA algorithm, or a combination of both. The default is MIXED. |
HIDDEN_NODE_RATIO |
Microsoft Neural Network Algorithm |
Specifies the ratio of hidden neurons to input and output neurons. The following formula determines the initial number of neurons in the hidden layer: HIDDEN_NODE_RATIO * SQRT(Total input neurons * Total output neurons) The default value is 4.0. |
HISTORIC_MODEL_COUNT |
Microsoft Time Series Algorithm |
Specifies the number of historic models that will be built. The default is 1. |
HISTORICAL_MODEL_GAP |
Microsoft Time Series Algorithm |
Specifies the time lag between two consecutive historic models. For example, setting this value to g causes historic models to be built for data that is truncated by time slices at intervals of g, 2*g, 3*g, and so on. The default is 10. |
HOLDOUT_PERCENTAGE |
Microsoft Logistic Regression Algorithm Microsoft Neural Network Algorithm |
Specifies the percentage of cases within the training data used to calculate the holdout error, which is used as part of the stopping criteria while training the mining model. The default value is 30. Note This parameter is different from the holdout percentage value that applies to a mining structure. |
HOLDOUT_SEED |
Microsoft Logistic Regression Algorithm Microsoft Neural Network Algorithm |
Specifies a number that is used to seed the pseudo-random generator when the algorithm randomly determines the holdout data. If this parameter is set to 0, the algorithm generates the seed based on the name of the mining model, to guarantee that the model content remains the same during reprocessing. The default value is 0. Note This parameter is different from the holdout seed value that applies to a mining structure. |
INSTABILITY_SENSITIVITY |
Microsoft Time Series Algorithm |
Controls the point at which prediction variance exceeds a certain threshold and the ARTxp algorithm suppresses predictions. The default value is 1. Note This parameter applies only to mixed models or models that use the ARTxp algorithm. |
MAXIMUM_INPUT_ATTRIBUTES |
Microsoft Clustering Algorithm Microsoft Decision Trees Algorithm Microsoft Linear Regression Algorithm Microsoft Naïve Bayes Algorithm Microsoft Neural Network Algorithm Microsoft Logistic Regression Algorithm |
Defines the number of input attributes that the algorithm can handle before it invokes feature selection. Set this value to 0 to turn off feature selection. The default is 255. |
MAXIMUM_ITEMSET_COUNT |
Microsoft Association Algorithm |
Specifies the maximum number of itemsets to produce. If no number is specified, the algorithm generates all possible itemsets. The default is 200000. |
MAXIMUM_ITEMSET_SIZE |
Microsoft Association Algorithm |
Specifies the maximum number of items that are allowed in an itemset. Setting this value to 0 specifies that there is no limit to the size of the itemset. The default is 3. |
MAXIMUM_OUTPUT_ATTRIBUTES |
Microsoft Decision Trees Algorithm Microsoft Linear Regression Algorithm Microsoft Logistic Regression Algorithm Microsoft Naïve Bayes Algorithm Microsoft Neural Network Algorithm |
Defines the number of output attributes that the algorithm can handle before it invokes feature selection. Set this value to 0 to turn off feature selection. The default is 255. |
MAXIMUM_SEQUENCE_STATES |
Microsoft Sequence Clustering Algorithm |
Specifies the maximum number of states that a sequence can have. Setting this value to a number greater than 100 may cause the algorithm to create a model that does not provide meaningful information. The default is 64. |
MAXIMUM_SERIES_VALUE |
Microsoft Time Series Algorithm |
Specifies the maximum value to use for predictions. This parameter is used, together with MINIMUM_SERIES_VALUE, to constrain the predictions to some expected range. For example, you can specify that the predicted sales quantity for any day should never exceed the number of products in inventory. |
MAXIMUM_STATES |
Microsoft Clustering Algorithm Microsoft Neural Network Algorithm Microsoft Sequence Clustering Algorithm |
Specifies the maximum number of attribute states that the algorithm supports. If the number of states that an attribute has is larger than the maximum number of states, the algorithm uses the attribute’s most popular states and ignores the remaining states. The default is 100. |
MAXIMUM_SUPPORT |
Microsoft Association Algorithm |
Specifies the maximum number of cases in which an itemset can have support. If this value is less than 1, the value represents a percentage of the total cases. If this value is greater than 1, the value represents the absolute number of cases that can contain the itemset. The default is 1. |
MINIMUM_IMPORTANCE |
Microsoft Association Algorithm |
Specifies the importance threshold for association rules. Rules with importance less than this value are filtered out. |
MINIMUM_ITEMSET_SIZE |
Microsoft Association Algorithm |
Specifies the minimum number of items that are allowed in an itemset. The default is 1. |
MINIMUM_DEPENDENCY_PROBABILITY |
Microsoft Naïve Bayes Algorithm |
Specifies the minimum dependency probability between input and output attributes. This value is used to limit the size of the content that is generated by the algorithm. This property can be set from 0 to 1. Larger values reduce the number of attributes in the content of the model. The default is 0.5. |
MINIMUM_PROBABILITY |
Microsoft Association Algorithm |
Specifies the minimum probability that a rule is true. For example, setting this value to 0.5 specifies that no rule with less than fifty percent probability is generated. The default is 0.4. |
MINIMUM_SERIES_VALUE |
Microsoft Time Series Algorithm |
Specifies the lower constraint for any time series prediction. Predicted values will never be smaller than this constraint. |
MINIMUM_SUPPORT |
Microsoft Association Algorithm |
Specifies the minimum number of cases that must contain the itemset before the algorithm generates a rule. Setting this value to less than 1 specifies the minimum number of cases as a percentage of the total cases. Setting this value to a whole number greater than 1 specifies the minimum number of cases as the absolute number of cases that must contain the itemset. The algorithm may increase the value of this parameter, if memory is limited. The default is 0.03. |
MINIMUM_SUPPORT |
Microsoft Clustering Algorithm |
Specifies the minimum number of cases in each cluster. The default is 1. |
MINIMUM_SUPPORT |
Microsoft Decision Trees Algorithm |
Determines the minimum number of leaf cases that is required to generate a split in the decision tree. The default is 10. |
MINIMUM_SUPPORT |
Microsoft Sequence Clustering Algorithm |
Specifies the minimum number of cases in each cluster. The default is 10. |
MINIMUM_SUPPORT |
Microsoft Time Series Algorithm |
Specifies the minimum number of time slices that are required to generate a split in each time series tree. The default is 10. |
MISSING_VALUE_SUBSTITUTION |
Microsoft Time Series Algorithm |
Specifies the method that is used to fill the gaps in historical data. By default, irregular gaps or ragged edges in data are not allowed. The following methods can be used to fill in irregular gaps or edges: use the previous value, use the mean value, or use a specific numeric constant. |
MODELLING_CARDINALITY |
Microsoft Clustering Algorithm |
Specifies the number of sample models that are constructed during the clustering process. The default is 10. |
PERIODICITY_HINT |
Microsoft Time Series Algorithm |
Provides a hint to the algorithm as to the periodicity of the data. For example, if sales vary by year, and the unit of measurement in the series is months, the periodicity is 12. This parameter takes the format of {n [, n]}, where n is any positive number. The n within the brackets [] is optional and can be repeated as frequently as needed. The default is {1}. |
PREDICTION_SMOOTHING |
Microsoft Time Series Algorithm |
Controls the blend of ARTXP and ARIMA time series algorithms. The specified value is only valid when the FORECAST_METHOD parameter is set to MIXED. Values must be between 0 and 1. If the value is 0, the model uses only ARTXP. If the value is 1, the model uses only ARIMA. A value closer to 0 is more heavily weighted to ARTXP. A value closer to 1 is more heavily weighted to ARIMA. |
SAMPLE_SIZE |
Microsoft Clustering Algorithm |
Specifies the number of cases that the algorithm uses on each pass if the CLUSTERING_METHOD parameter is set to one of the scalable clustering methods. Setting the SAMPLE_SIZE parameter to 0 will cause the whole dataset to be clustered in a single pass. This can cause memory and performance issues. The default is 50000. |
SAMPLE_SIZE |
Microsoft Logistic Regression Algorithm Microsoft Neural Network Algorithm |
Specifies the number of cases to be used to train the model. The algorithm provider uses either this number or the percentage of total of cases that are not included in the holdout percentage as specified by the HOLDOUT_PERCENTAGE parameter, whichever value is smaller. In other words, if HOLDOUT_PERCENTAGE is set to 30, the algorithm will use either the value of this parameter, or a value that is equal to 70 percent of the total number of cases, whichever is smaller. The default is 10000. |
SCORE_METHOD |
Microsoft Decision Trees Algorithm |
Determines the method that is used to calculate the split score. The following options are available: (1) Entropy, (2) Bayesian with K2 Prior, or (3) Bayesian Dirichlet Equivalent (BDE) Prior. The default is 3. |
SPLIT_METHOD |
Microsoft Decision Trees Algorithm |
Determines the method that is used to split the node. The following options are available: Binary (1), Complete (2), or Both (3). The default is 3. |
STOPPING_TOLERANCE |
Microsoft Clustering Algorithm Technical Reference |
Specifies the value that is used to determine when convergence is reached and the algorithm is finished building the model. Convergence is reached when the overall change in cluster probabilities is less than the ratio of the STOPPING_TOLERANCE parameter divided by the size of the model. The default is 10. |
Comments
For additional detail about the algorithms, see SQL Server Books Online.