FastForestBinaryClassifier Class

Reference

Machine Learning Fast Forest

Inheritance: nimbusml.internal.core.ensemble._fastforestbinaryclassifier.FastForestBinaryClassifier

FastForestBinaryClassifier

nimbusml.base_predictor.BasePredictor

FastForestBinaryClassifier

sklearn.base.ClassifierMixin

FastForestBinaryClassifier

Constructor

FastForestBinaryClassifier(number_of_trees=100, number_of_leaves=20, minimum_example_count_per_leaf=10, normalize='Auto', caching='Auto', maximum_output_magnitude_per_tree=100.0, number_of_quantile_samples=100, parallel_trainer=None, number_of_threads=None, random_state=123, feature_selection_seed=123, entropy_coefficient=0.0, histogram_pool_size=-1, disk_transpose=None, feature_flocks=True, categorical_split=False, maximum_categorical_group_count_per_node=64, maximum_categorical_split_point_count=64, minimum_example_fraction_for_categorical_split=0.001, minimum_examples_for_categorical_split=100, bias=0.0, bundling='None', maximum_bin_count_per_feature=255, sparsify_threshold=0.7, first_use_penalty=0.0, feature_reuse_penalty=0.0, gain_conf_level=0.0, softmax_temperature=0.0, execution_time=False, feature_fraction=0.7, bagging_size=1, bagging_example_fraction=0.7, feature_fraction_per_split=0.7, smoothing=0.0, allow_empty_trees=True, feature_compression_level=1, compress_ensemble=False, test_frequency=2147483647, feature=None, group_id=None, label=None, weight=None, **params)

Parameters

Name	Description
feature	see Columns.
group_id	see Columns.
label	see Columns.
weight	see Columns.
number_of_trees	Specifies the total number of decision trees to create in the ensemble. By creating more decision trees, you can potentially get better coverage, but the training time increases.
number_of_leaves	The maximum number of leaves (terminal nodes) that can be created in any tree. Higher values potentially increase the size of the tree and get better precision, but risk overfitting and requiring longer training times.
minimum_example_count_per_leaf	Minimum number of training instances required to form a leaf. That is, the minimal number of documents allowed in a leaf of regression tree, out of the sub-sampled data. A 'split' means that features in each level of the tree (node) are randomly divided.
normalize	If `Auto`, the choice to normalize depends on the preference declared by the algorithm. This is the default choice. If `No`, no normalization is performed. If `Yes`, normalization always performed. If `Warn`, if normalization is needed by the algorithm, a warning message is displayed but normalization is not performed. If normalization is performed, a `MaxMin` normalizer is used. This normalizer preserves sparsity by mapping zero to zero.
caching	Whether trainer should cache input training data.
maximum_output_magnitude_per_tree	Upper bound on absolute value of single tree output.
number_of_quantile_samples	Number of labels to be sampled from each leaf to make the distribution.
parallel_trainer	Allows to choose Parallel FastTree Learning Algorithm.
number_of_threads	The number of threads to use.
random_state	The seed of the random number generator.
feature_selection_seed	The seed of the active feature selection.
entropy_coefficient	The entropy (regularization) coefficient between 0 and 1.
histogram_pool_size	The number of histograms in the pool (between 2 and numLeaves).
disk_transpose	Whether to utilize the disk or the data's native transposition facilities (where applicable) when performing the transpose.
feature_flocks	Whether to collectivize features during dataset preparation to speed up training.
categorical_split	Whether to do split based on multiple categorical feature values.
maximum_categorical_group_count_per_node	Maximum categorical split groups to consider when splitting on a categorical feature. Split groups are a collection of split points. This is used to reduce overfitting when there many categorical features.
maximum_categorical_split_point_count	Maximum categorical split points to consider when splitting on a categorical feature.
minimum_example_fraction_for_categorical_split	Minimum categorical example percentage in a bin to consider for a split.
minimum_examples_for_categorical_split	Minimum categorical example count in a bin to consider for a split.
bias	Bias for calculating gradient for each feature bin for a categorical feature.
bundling	Bundle low population bins. Bundle.None(0): no bundling, Bundle.AggregateLowPopulation(1): Bundle low population, Bundle.Adjacent(2): Neighbor low population bundle.
maximum_bin_count_per_feature	Maximum number of distinct values (bins) per feature.
sparsify_threshold	Sparsity level needed to use sparse feature representation.
first_use_penalty	The feature first use penalty coefficient. This is a form of regularization that incurs a penalty for using a new feature when creating the tree. Increase this value to create trees that don't use many features.
feature_reuse_penalty	The feature re-use penalty (regularization) coefficient.
gain_conf_level	Tree fitting gain confidence requirement (should be in the range [0,1) ).
softmax_temperature	The temperature of the randomized softmax distribution for choosing the feature.
execution_time	Print execution time breakdown to stdout.
feature_fraction	The fraction of features (chosen randomly) to use on each iteration.
bagging_size	Number of trees in each bag (0 for disabling bagging).
bagging_example_fraction	Percentage of training examples used in each bag.
feature_fraction_per_split	The fraction of features (chosen randomly) to use on each split.
smoothing	Smoothing paramter for tree regularization.
allow_empty_trees	When a root split is impossible, allow training to proceed.
feature_compression_level	The level of feature compression to use.
compress_ensemble	Compress the tree Ensemble.
test_frequency	Calculate metric values for train/valid/test every k rounds.
params	Additional arguments sent to compute engine.

Examples


   ###############################################################################
   # FastForestBinaryClassifier
   import numpy
   from nimbusml import Pipeline, FileDataStream
   from nimbusml.datasets import get_dataset
   from nimbusml.ensemble import FastForestBinaryClassifier
   from nimbusml.feature_extraction.categorical import OneHotVectorizer

   # data input (as a FileDataStream)
   path = get_dataset('infert').as_filepath()
   data = FileDataStream.read_csv(path, sep=',',
                                  numeric_dtype=numpy.float32,
                                  names={0: 'row_num', 5: 'case'})
   print(data.head())
   #    age  case education  induced  parity  pooled.stratum  row_num  ...
   # 0  26.0   1.0    0-5yrs      1.0     6.0             3.0      1.0  ...
   # 1  42.0   1.0    0-5yrs      1.0     1.0             1.0      2.0  ...
   # 2  39.0   1.0    0-5yrs      2.0     6.0             4.0      3.0  ...
   # 3  34.0   1.0    0-5yrs      2.0     4.0             2.0      4.0  ...
   # 4  35.0   1.0   6-11yrs      1.0     3.0            32.0      5.0  ...
   # define the training pipeline
   pipeline = Pipeline([
       OneHotVectorizer(columns={'edu': 'education'}),
       FastForestBinaryClassifier(feature=['age', 'edu', 'induced'], label='case')
   ])

   # train, predict, and evaluate
   metrics, predictions = pipeline.fit(data).test(data, output_scores=True)

   # print predictions
   print(predictions.head())
   #   PredictedLabel      Score
   # 0             0.0 -26.985743
   # 1             0.0 -26.562090
   # 2             0.0 -24.832508
   # 3             0.0 -23.799389
   # 4             0.0 -19.612534
   # print evaluation metrics
   print(metrics)
   #        AUC  Accuracy  Positive precision  Positive recall  ...
   # 0  0.655714  0.665323                   0                0  ...

Remarks

Decision trees are non-parametric models that perform a sequence of simple tests on inputs. This decision procedure maps them to outputs found in the training dataset whose inputs were similar to the instance being processed. A decision is made at each node of the binary tree data structure based on a measure of similarity that maps each instance recursively through the branches of the tree until the appropriate leaf node is reached and the output decision returned.

Decision trees have several advantages:

They are efficient in both computation and memory usage during training and prediction.
They can represent non-linear decision boundaries.
They perform integrated feature selection and classification.
They are resilient in the presence of noisy features.

Fast forest classifier is a random forest and quantile regression forest implementation using the tree learner in FastTreesBinaryClassifier. The model consists of an ensemble of decision trees.

Reference

Wikipedia: Random forest

Quantile regression forest

From Stumps to Trees to Forests

Methods

decision_function	Returns score values
get_params	Get the parameters for this operator.
predict_proba	Returns probabilities

decision_function

Returns score values

decision_function(X, **params)

get_params

Get the parameters for this operator.

get_params(deep=False)

Parameters

Name	Description
deep	Default value: False

predict_proba

Returns probabilities

predict_proba(X, **params)

Compartir a través de

FastForestBinaryClassifier Class

Constructor

Parameters

Examples

Remarks

Methods

decision_function

get_params

Parameters

predict_proba

Recursos adicionales