EnsembleRegressor Class

Description Train a regression ensemble model

Constructor

EnsembleRegressor(sampling_type={'Name': 'BootstrapSelector', 'Settings': {'FeatureSelector': {'Name': 'AllFeatureSelector', 'Settings': {}}}}, num_models=None, sub_model_selector_type=None, output_combiner=None, normalize='Auto', caching='Auto', train_parallel=False, batch_size=-1, show_metrics=False, feature=None, label=None, **params)

Parameters

Name	Description
feature	see Columns.
label	see Columns.
sampling_type	Specifies how the training samples are created: `BootstrapSelector`: takes a bootstrap sample of the training set (sampling with replacement). This is the default method. `RandomPartitionSelector`: randomly partitions the training set into subsets. `AllSelector`: every model is trained using the whole training set. Each of these Subset Selectors has two options for selecting features: `AllFeatureSelector`: selects all the features. This is the default method. `RandomFeatureSelector`: selects a random subset of the features for each model.
num_models	Indicates the number models to train, i.e. the number of subsets of the training set to sample. The default value is 50. If batches are used then this indicates the number of models per batch.
sub_model_selector_type	Determines the efficient set of models the `output_combiner` uses, and removes the least significant models. This is used to improve the accuracy and reduce the model size. This is also called pruning. `RegressorAllSelector`: does not perform any pruning and selects all models in the ensemble to combine to create the output. This is the default submodel selector. `RegressorBestDiverseSelector`: combines models whose predictions are as diverse as possible. Currently, only diagreement diversity is supported. `RegressorBestPerformanceSelector`: combines only the models with the best performance according to the specified metric. The metric can be `"L1"`, `"L2"`, `"Rms"`, or `"Loss"`, or `"RSquared"`.
output_combiner	Indicates how to combine the predictions of the different models into a single prediction. There are five available output combiners for clasification: `RegressorAverage`: computes the average of the scores produced by the trained models. `RegressorMedian`: computes the median of the scores produced by the trained models. `RegressorStacking`: computes the output by training a model on a training set where each instance is a vector containing the outputs of the different models on a training instance, and the instance's label.
normalize	Specifies the type of automatic normalization used: `"Auto"`: if normalization is needed, it is performed automatically. This is the default choice. `"No"`: no normalization is performed. `"Yes"`: normalization is performed. `"Warn"`: if normalization is needed, a warning message is displayed, but normalization is not performed. Normalization rescales disparate data ranges to a standard scale. Feature scaling ensures the distances between data points are proportional and enables various optimization methods such as gradient descent to converge much faster. If normalization is performed, a `MinMax` normalizer is used. It normalizes values in an interval [a, b] where `-1 <= a <= 0` and `0 <= b <= 1` and `b - a = 1`. This normalizer preserves sparsity by mapping zero to zero.
caching	Whether trainer should cache input training data.
train_parallel	All the base learners will run asynchronously if the value is true.
batch_size	Train the models iteratively on subsets of the training set of this size. When using this option, it is assumed that the training set is randomized enough so that every batch is a random sample of instances. The default value is -1, indicating using the whole training set. If the value is changed to an integer greater than 0, the number of trained models is the number of batches (the size of the training set divided by the batch size), times `num_models`.
show_metrics	True, if metrics for each model need to be evaluated and shown in comparison table. This is done by using validation set if available or the training set.
params	Additional arguments sent to compute engine.

Examples


   ###############################################################################
   # EnsembleRegressor
   from nimbusml import Pipeline, FileDataStream
   from nimbusml.datasets import get_dataset
   from nimbusml.feature_extraction.categorical import OneHotVectorizer
   from nimbusml.ensemble import EnsembleRegressor
   from nimbusml.ensemble.feature_selector import RandomFeatureSelector
   from nimbusml.ensemble.output_combiner import RegressorMedian
   from nimbusml.ensemble.subset_selector import RandomPartitionSelector
   from nimbusml.ensemble.sub_model_selector import RegressorBestDiverseSelector

   # data input (as a FileDataStream)
   path = get_dataset('infert').as_filepath()
   data = FileDataStream.read_csv(path)
   print(data.head())
   #   age  case education  induced  parity  ... row_num  spontaneous  ...
   # 0   26     1    0-5yrs        1       6 ...       1            2  ...
   # 1   42     1    0-5yrs        1       1 ...       2            0  ...
   # 2   39     1    0-5yrs        2       6 ...       3            0  ...
   # 3   34     1    0-5yrs        2       4 ...       4            0  ...
   # 4   35     1   6-11yrs        1       3 ...       5            1  ...

   # define the training pipeline using default sampling and ensembling parameters
   pipeline_with_defaults = Pipeline([
       OneHotVectorizer(columns={'edu': 'education'}),
       EnsembleRegressor(feature=['induced', 'edu'], label='age', num_models=3)
   ])

   # train, predict, and evaluate
   metrics, predictions = pipeline_with_defaults.fit(data).test(data, output_scores=True)

   # print predictions
   print(predictions.head())
   #        Score
   # 0  26.046741
   # 1  26.046741
   # 2  29.225840
   # 3  29.225840
   # 4  33.849384

   # print evaluation metrics
   print(metrics)
   #    L1(avg)    L2(avg)  RMS(avg)  Loss-fn(avg)  R Squared
   # 0  4.69884  33.346123   5.77461     33.346124  -0.214011


   # define the training pipeline with specific sampling and ensembling options
   pipeline_with_options = Pipeline([
       OneHotVectorizer(columns={'edu': 'education'}),
       EnsembleRegressor(feature=['induced', 'edu'],
                         label='age',
                         num_models=3,
                         sampling_type = RandomPartitionSelector(
                             feature_selector=RandomFeatureSelector(
                                  features_selction_proportion=0.7)),
                          sub_model_selector_type=RegressorBestDiverseSelector(),
                          output_combiner=RegressorMedian())
   ])

   # train, predict, and evaluate
   metrics, predictions = pipeline_with_options.fit(data).test(data, output_scores=True)

   # print predictions
   print(predictions.head())
   #        Score
   # 0  37.122200
   # 1  37.122200
   # 2  41.296204
   # 3  41.296204
   # 4  33.591423

   # print evaluation metrics
   # note that the converged loss function values are worse than with defaults as
   # this is a small dataset that we partition into 3 chunks for each regressor,
   # which decreases model quality
   print(metrics)
   #     L1(avg)    L2(avg)  RMS(avg)  Loss-fn(avg)  R Squared
   # 0  5.481676  44.924838  6.702599     44.924838   -0.63555

Remarks

An Ensemble is a set of models, each trained on a sample of the training set. Training an ensemble instead of a single model can boost the accuracy of a given algorithm.

The quality of an Ensemble depends on two factors; Accuracy and Diversity. Ensemble can be analogous to Teamwork. If every team member is diverse and competent, then the team can perform very well. Here a team member is a base learner and the team is the Ensemble. In the case of regression ensembles, the base learner is an OnlineGradientDescentRegressor.

Methods

get_params

Get the parameters for this operator.

get_params

Get the parameters for this operator.

get_params(deep=False)

Parameters

Name	Description
deep	Default value: False

Share via

EnsembleRegressor Class

Constructor

Parameters

Examples

Remarks

Methods

get_params

Parameters