FastForestBinaryClassifier Class
Machine Learning Fast Forest
- Inheritance
-
nimbusml.internal.core.ensemble._fastforestbinaryclassifier.FastForestBinaryClassifierFastForestBinaryClassifiernimbusml.base_predictor.BasePredictorFastForestBinaryClassifiersklearn.base.ClassifierMixinFastForestBinaryClassifier
Constructor
FastForestBinaryClassifier(number_of_trees=100, number_of_leaves=20, minimum_example_count_per_leaf=10, normalize='Auto', caching='Auto', maximum_output_magnitude_per_tree=100.0, number_of_quantile_samples=100, parallel_trainer=None, number_of_threads=None, random_state=123, feature_selection_seed=123, entropy_coefficient=0.0, histogram_pool_size=-1, disk_transpose=None, feature_flocks=True, categorical_split=False, maximum_categorical_group_count_per_node=64, maximum_categorical_split_point_count=64, minimum_example_fraction_for_categorical_split=0.001, minimum_examples_for_categorical_split=100, bias=0.0, bundling='None', maximum_bin_count_per_feature=255, sparsify_threshold=0.7, first_use_penalty=0.0, feature_reuse_penalty=0.0, gain_conf_level=0.0, softmax_temperature=0.0, execution_time=False, feature_fraction=0.7, bagging_size=1, bagging_example_fraction=0.7, feature_fraction_per_split=0.7, smoothing=0.0, allow_empty_trees=True, feature_compression_level=1, compress_ensemble=False, test_frequency=2147483647, feature=None, group_id=None, label=None, weight=None, **params)
Parameters
Name | Description |
---|---|
feature
|
see Columns. |
group_id
|
see Columns. |
label
|
see Columns. |
weight
|
see Columns. |
number_of_trees
|
Specifies the total number of decision trees to create in the ensemble. By creating more decision trees, you can potentially get better coverage, but the training time increases. |
number_of_leaves
|
The maximum number of leaves (terminal nodes) that can be created in any tree. Higher values potentially increase the size of the tree and get better precision, but risk overfitting and requiring longer training times. |
minimum_example_count_per_leaf
|
Minimum number of training instances required to form a leaf. That is, the minimal number of documents allowed in a leaf of regression tree, out of the sub-sampled data. A 'split' means that features in each level of the tree (node) are randomly divided. |
normalize
|
If |
caching
|
Whether trainer should cache input training data. |
maximum_output_magnitude_per_tree
|
Upper bound on absolute value of single tree output. |
number_of_quantile_samples
|
Number of labels to be sampled from each leaf to make the distribution. |
parallel_trainer
|
Allows to choose Parallel FastTree Learning Algorithm. |
number_of_threads
|
The number of threads to use. |
random_state
|
The seed of the random number generator. |
feature_selection_seed
|
The seed of the active feature selection. |
entropy_coefficient
|
The entropy (regularization) coefficient between 0 and 1. |
histogram_pool_size
|
The number of histograms in the pool (between 2 and numLeaves). |
disk_transpose
|
Whether to utilize the disk or the data's native transposition facilities (where applicable) when performing the transpose. |
feature_flocks
|
Whether to collectivize features during dataset preparation to speed up training. |
categorical_split
|
Whether to do split based on multiple categorical feature values. |
maximum_categorical_group_count_per_node
|
Maximum categorical split groups to consider when splitting on a categorical feature. Split groups are a collection of split points. This is used to reduce overfitting when there many categorical features. |
maximum_categorical_split_point_count
|
Maximum categorical split points to consider when splitting on a categorical feature. |
minimum_example_fraction_for_categorical_split
|
Minimum categorical example percentage in a bin to consider for a split. |
minimum_examples_for_categorical_split
|
Minimum categorical example count in a bin to consider for a split. |
bias
|
Bias for calculating gradient for each feature bin for a categorical feature. |
bundling
|
Bundle low population bins. Bundle.None(0): no bundling, Bundle.AggregateLowPopulation(1): Bundle low population, Bundle.Adjacent(2): Neighbor low population bundle. |
maximum_bin_count_per_feature
|
Maximum number of distinct values (bins) per feature. |
sparsify_threshold
|
Sparsity level needed to use sparse feature representation. |
first_use_penalty
|
The feature first use penalty coefficient. This is a form of regularization that incurs a penalty for using a new feature when creating the tree. Increase this value to create trees that don't use many features. |
feature_reuse_penalty
|
The feature re-use penalty (regularization) coefficient. |
gain_conf_level
|
Tree fitting gain confidence requirement (should be in the range [0,1) ). |
softmax_temperature
|
The temperature of the randomized softmax distribution for choosing the feature. |
execution_time
|
Print execution time breakdown to stdout. |
feature_fraction
|
The fraction of features (chosen randomly) to use on each iteration. |
bagging_size
|
Number of trees in each bag (0 for disabling bagging). |
bagging_example_fraction
|
Percentage of training examples used in each bag. |
feature_fraction_per_split
|
The fraction of features (chosen randomly) to use on each split. |
smoothing
|
Smoothing paramter for tree regularization. |
allow_empty_trees
|
When a root split is impossible, allow training to proceed. |
feature_compression_level
|
The level of feature compression to use. |
compress_ensemble
|
Compress the tree Ensemble. |
test_frequency
|
Calculate metric values for train/valid/test every k rounds. |
params
|
Additional arguments sent to compute engine. |
Examples
###############################################################################
# FastForestBinaryClassifier
import numpy
from nimbusml import Pipeline, FileDataStream
from nimbusml.datasets import get_dataset
from nimbusml.ensemble import FastForestBinaryClassifier
from nimbusml.feature_extraction.categorical import OneHotVectorizer
# data input (as a FileDataStream)
path = get_dataset('infert').as_filepath()
data = FileDataStream.read_csv(path, sep=',',
numeric_dtype=numpy.float32,
names={0: 'row_num', 5: 'case'})
print(data.head())
# age case education induced parity pooled.stratum row_num ...
# 0 26.0 1.0 0-5yrs 1.0 6.0 3.0 1.0 ...
# 1 42.0 1.0 0-5yrs 1.0 1.0 1.0 2.0 ...
# 2 39.0 1.0 0-5yrs 2.0 6.0 4.0 3.0 ...
# 3 34.0 1.0 0-5yrs 2.0 4.0 2.0 4.0 ...
# 4 35.0 1.0 6-11yrs 1.0 3.0 32.0 5.0 ...
# define the training pipeline
pipeline = Pipeline([
OneHotVectorizer(columns={'edu': 'education'}),
FastForestBinaryClassifier(feature=['age', 'edu', 'induced'], label='case')
])
# train, predict, and evaluate
metrics, predictions = pipeline.fit(data).test(data, output_scores=True)
# print predictions
print(predictions.head())
# PredictedLabel Score
# 0 0.0 -26.985743
# 1 0.0 -26.562090
# 2 0.0 -24.832508
# 3 0.0 -23.799389
# 4 0.0 -19.612534
# print evaluation metrics
print(metrics)
# AUC Accuracy Positive precision Positive recall ...
# 0 0.655714 0.665323 0 0 ...
Remarks
Decision trees are non-parametric models that perform a sequence of simple tests on inputs. This decision procedure maps them to outputs found in the training dataset whose inputs were similar to the instance being processed. A decision is made at each node of the binary tree data structure based on a measure of similarity that maps each instance recursively through the branches of the tree until the appropriate leaf node is reached and the output decision returned.
Decision trees have several advantages:
They are efficient in both computation and memory usage during training and prediction.
They can represent non-linear decision boundaries.
They perform integrated feature selection and classification.
They are resilient in the presence of noisy features.
Fast forest classifier is a random forest and quantile regression forest implementation using the tree learner in FastTreesBinaryClassifier. The model consists of an ensemble of decision trees.
Reference
From Stumps to Trees to Forests
Methods
decision_function |
Returns score values |
get_params |
Get the parameters for this operator. |
predict_proba |
Returns probabilities |
decision_function
Returns score values
decision_function(X, **params)
get_params
Get the parameters for this operator.
get_params(deep=False)
Parameters
Name | Description |
---|---|
deep
|
Default value: False
|
predict_proba
Returns probabilities
predict_proba(X, **params)