EnsembleClassifier Class
Description Train a multi class ensemble model
- Inheritance
-
nimbusml.internal.core.ensemble._ensembleclassifier.EnsembleClassifierEnsembleClassifiernimbusml.base_predictor.BasePredictorEnsembleClassifiersklearn.base.ClassifierMixinEnsembleClassifier
Constructor
EnsembleClassifier(sampling_type={'Name': 'BootstrapSelector', 'Settings': {'FeatureSelector': {'Name': 'AllFeatureSelector', 'Settings': {}}}}, num_models=None, sub_model_selector_type=None, output_combiner=None, normalize='Auto', caching='Auto', train_parallel=False, batch_size=-1, show_metrics=False, feature=None, label=None, **params)
Parameters
Name | Description |
---|---|
feature
|
see Columns. |
label
|
see Columns. |
sampling_type
|
Specifies how the training samples are created:
Each of these Subset Selectors has two options for selecting features:
|
num_models
|
Indicates the number models to train, i.e. the number of subsets of the training set to sample. The default value is 50. If batches are used then this indicates the number of models per batch. |
sub_model_selector_type
|
Determines the efficient set of models the
|
output_combiner
|
Indicates how to combine the predictions of the different models into a single prediction. There are five available outputcombiners for clasification:
outputs of the trained models, weighted by the specified metric. The
metric can be |
normalize
|
Specifies the type of automatic normalization used:
Normalization rescales disparate data ranges to a standard scale.
Feature
scaling ensures the distances between data points are proportional
and
enables various optimization methods such as gradient descent to
converge
much faster. If normalization is performed, a |
caching
|
Whether trainer should cache input training data. |
train_parallel
|
All the base learners will run asynchronously if the value is true. |
batch_size
|
Train the models iteratively on subsets of the training
set of this size. When using this option, it is assumed that the
training set is randomized enough so that every batch is a random
sample of instances. The default value is -1, indicating using the
whole training set. If the value is changed to an integer greater than
0, the number of trained models is the number of batches (the size of
the training set divided by the batch size), times |
show_metrics
|
True, if metrics for each model need to be evaluated and shown in comparison table. This is done by using validation set if available or the training set. |
params
|
Additional arguments sent to compute engine. |
Examples
###############################################################################
# EnsembleClassifier
from nimbusml import Pipeline, FileDataStream
from nimbusml.datasets import get_dataset
from nimbusml.feature_extraction.categorical import OneHotVectorizer
from nimbusml.ensemble import EnsembleClassifier
from nimbusml.ensemble.feature_selector import RandomFeatureSelector
from nimbusml.ensemble.output_combiner import ClassifierVoting
from nimbusml.ensemble.subset_selector import RandomPartitionSelector
from nimbusml.ensemble.sub_model_selector import ClassifierBestDiverseSelector
# data input (as a FileDataStream)
path = get_dataset('infert').as_filepath()
data = FileDataStream.read_csv(path)
print(data.head())
# age case education induced parity ... row_num spontaneous ...
# 0 26 1 0-5yrs 1 6 ... 1 2 ...
# 1 42 1 0-5yrs 1 1 ... 2 0 ...
# 2 39 1 0-5yrs 2 6 ... 3 0 ...
# 3 34 1 0-5yrs 2 4 ... 4 0 ...
# 4 35 1 6-11yrs 1 3 ... 5 1 ...
# define the training pipeline using default sampling and ensembling parameters
pipeline_with_defaults = Pipeline([
OneHotVectorizer(columns={'edu': 'education'}),
EnsembleClassifier(feature=['age', 'edu', 'parity'],
label='induced',
num_models=3)
])
# train, predict, and evaluate
metrics, predictions = pipeline_with_defaults.fit(data).test(data, output_scores=True)
# print predictions
print(predictions.head())
# PredictedLabel Score.0 Score.1 Score.2
# 0 2 0.202721 0.186598 0.628115
# 1 0 0.716737 0.190289 0.092974
# 2 2 0.201026 0.185602 0.624761
# 3 0 0.423328 0.235074 0.365649
# 4 0 0.577509 0.220827 0.201664
# print evaluation metrics
print(metrics)
# Accuracy(micro-avg) Accuracy(macro-avg) Log-loss ... (class 0) ...
# 0 0.612903 0.417519 0.846467 ... 0.504007 ...
# (class 1) (class 2)
# 1.244033 1.439364
# define the training pipeline with specific sampling and ensembling options
pipeline_with_options = Pipeline([
OneHotVectorizer(columns={'edu': 'education'}),
EnsembleClassifier(feature=['age', 'edu', 'parity'],
label='induced',
num_models=3,
sampling_type = RandomPartitionSelector(
feature_selector=RandomFeatureSelector(
features_selction_proportion=0.7)),
sub_model_selector_type=ClassifierBestDiverseSelector(),
output_combiner=ClassifierVoting())
])
# train, predict, and evaluate
metrics, predictions = pipeline_with_options.fit(data).test(data, output_scores=True)
# print predictions
print(predictions.head())
# PredictedLabel Score.0 Score.1 Score.2
# 0 2 0.0 0.0 1.0
# 1 0 1.0 0.0 0.0
# 2 2 0.0 0.0 1.0
# 3 0 1.0 0.0 0.0
# 4 0 1.0 0.0 0.0
# print evaluation metrics
# note that accuracy metrics are lower than with defaults as this is a small
# dataset that we partition into 3 chunks for each classifier, which decreases
# model quality.
print(metrics)
# Accuracy(micro-avg) Accuracy(macro-avg) Log-loss ... (class 0) ...
# 0 0.596774 0.38352 13.926926 ... 0.48306 ...
# (class 1) (class 2)
# 33.52293 29.871374
Remarks
An Ensemble is a set of models, each trained on a sample of the training set. Training an ensemble instead of a single model can boost the accuracy of a given algorithm.
The quality of an Ensemble depends on two factors; Accuracy and
Diversity. Ensemble can be analogous to Teamwork. If every team member
is diverse and competent, then the team can perform very well. Here a
team member is a base learner and the team is the Ensemble. In the case
of classification ensembles, the base learner is a
LogisticRegressionClassifier
.
Methods
decision_function |
Returns score values |
get_params |
Get the parameters for this operator. |
predict_proba |
Returns probabilities |
decision_function
Returns score values
decision_function(X, **params)
get_params
Get the parameters for this operator.
get_params(deep=False)
Parameters
Name | Description |
---|---|
deep
|
Default value: False
|
predict_proba
Returns probabilities
predict_proba(X, **params)