Pipeline Class

Implementation of a pipeline.

Constructor

Pipeline(steps=None, model=None, random_state=None)

Parameters

Name	Description
steps	the list of operator or (name, operator) tuples that are chained in the appropriate order.
model	the path to the model file (".zip") if want to load a model directly from file (such as a trained model from ML.NET).
random_state	the integer used as the random seed.

Remarks

The Pipeline class assembles a pipeline of transforms, followed optionally by a trainer. The transforms need to implement fit() and transform() methods. The final trainer only needs to implement the fit() method.

The Pipeline class only accepts trainers and transforms implemented in this package.

The data sources for the methods may be a list, numpy.array, scipy.sparse_csr, pandas.DataFrame or a FileDataStream.

By default, the first transform will take all columns as input ( i.e. will transform all columns), unless specific columns are requested (see Columns for how to specify columns to transform). The output column of the first transform is passed as the input column into the second transform for processing by default, unless the second transform requests a different column to operate on.

The final trainer (if one exists) can select which columns to use for feature, labels, weights etc. See Roles for more details on how to select these.

Methods

append	Extends the pipeline with a new transform/learner at the end. Note that a fitted pipeline cannot be modified. Example: `pipe.append(FastLinearRegressor())`. Example: `pipe.append(("learner", FastLinearRegressor()))`.
clone	Clones the pipeline and returns it in a non-trained state if the trained model was stored in a file on disk. You can clone the trained pipeline by running: `Pipeline(**pipe.get_params())`.
combine_models	Combine the models of multiple pipelines, transforms and/or predictors in to a single model. The models are combined in the order they are seen.
decision_function	Apply transforms and generate decision values
fit	Fit the pipeline.
fit_transform	If a pipeline only has transforms, returns transformed data as a pandas dataframe
get_feature_contributions	Calculates observation level feature contributions. Returns dataframe with raw data, predictions, and feature contributiuons for each prediction. Feature contributions are not supported for transforms, so make sure that the last step in a pipeline is a model. Feature contriutions are supported for the following models: Regression: OrdinaryLeastSquaresRegressor FastLinearRegressor OnlineGradientDescentRegressor PoissonRegressionRegressor GamRegressor LightGbmRegressor FastTreesRegressor FastForestRegressor FastTreesTweedieRegressor Binary Classification: AveragedPerceptronBinaryClassifier LinearSvmBinaryClassifier LogisticRegressionBinaryClassifier FastLinearBinaryClassifier SgdBinaryClassifier SymSgdBinaryClassifier GamBinaryClassifier FastForestBinaryClassifier FastTreesBinaryClassifier LightGbmBinaryClassifier Ranking: LightGbmRanker
get_fit_info	Returns information about the pipeline. Example: `pipe.get_fit_info(X,Y)`.
get_output_columns	Returns the output list of columns for the fitted model. :return: list .
get_params	Returns pipeline parameters
insert	Inserts a transform/learner into the pipeline. Example: `pipe.insert(1, FastLinearRegressor())`. Example: `pipe.insert(1, ("learner", FastLinearRegressor()))`.
load_model	Load model from file. The model can be generated from ML.NET in .zip format. For more details, please refer to load/save model
permutation_feature_importance	Permutation feature importance (PFI) is a technique to determine the global importance of features in a trained machine learning model. PFI is a simple yet powerful technique motivated by Breiman in section 10 of his Random Forests paper (Machine Learning, 2001). The advantage of the PFI method is that it is model agnostic - it works with any model that can be evaluated - and it can use any dataset, not just the training set, to compute feature importance metrics. PFI works by taking a labeled dataset, choosing a feature, and permuting the values for that feature across all the examples, so that each example now has a random value for the feature and the original values for all other features. The evaluation metric (e.g. NDCG) is then calculated for this modified dataset, and the change in the evaluation metric from the original dataset is computed. The larger the change in the evaluation metric, the more important the feature is to the model, i.e. the most important features are those that the model is most sensitive to. PFI works by performing this permutation analysis across all the features of a model, one after another. Note that for increasing metrics (e.g. AUC, accuracy, R-Squared, NDCG), the most important features will be those with the highest negative mean change in the metric. Conversely, for decreasing metrics (e.g. Mean Squared Error, Log loss), the most important features will be those with the highest positive mean change in the metric. PFI is supported for binary classifiers, classifiers, regressors, and rankers. The mean changes and statndard errors of the means are evaluated for the following metrics are evaluated for PFI: Binary Classification: Area under ROC curve (AUC) Accuracy Positive precision Positive recall Negative precision Negative recall F1 score Area under Precision-Recall curve (AUPRC) Multiclass classification: Macro accuracy Micro accuracy Log loss Log loss reduction Top k accuracy Per-class log loss Regression: Mean absolute error (MAE) Mean squared error (MSE) Root mean squared error (RMSE) Loss function R-Squared Ranking Discounted cumulative gains (DCG) @1, @2, and @3 Normalized discounted cumulative gains (NDCG) @1, @2, and @3 Reference Breiman, L. Random Forests. Machine Learning (2001) 45: 5.
predict	Predict based on the input data
predict_proba	Apply transforms and predict probabilities
save_model	Save model to file. For more details, please refer to load/save model
score	Return performance metrics for the corresponding problem
set_params	Set parameters to the pipeline.
summary	Return summary for fitted model.
test	Return both predictions and performance metrics. For more details please refer to Metrics.
transform	Apply transforms

append

Extends the pipeline with a new transform/learner at the end. Note that a fitted pipeline cannot be modified. Example: pipe.append(FastLinearRegressor()).

Example: pipe.append(("learner", FastLinearRegressor())).

append(step)

Parameters

Name	Description
step Required	the transform/learner to append

clone

Clones the pipeline and returns it in a non-trained state if the trained model was stored in a file on disk. You can clone the trained pipeline by running: Pipeline(**pipe.get_params()).

clone()

combine_models

Combine the models of multiple pipelines, transforms and/or predictors in to a single model. The models are combined in the order they are seen.

combine_models(*items, **params)

Parameters

Name	Description
items Required	the fitted pipelines, transforms and/or predictors which contain the models to join.
contains_predictor Required	Set to True if the last item contains or is a predictor. Set to False if items only contains transforms. The default is True.

Returns

Type	Description
	A new Pipeline which is backed by a model that is the combination of all the models passed in through items.

decision_function

Apply transforms and generate decision values

decision_function(X, verbose=0, **params)

Parameters

Name	Description
X Required	{array-like [n_samples, n_features], FileDataStream }

Returns

Type	Description
	array, shape=(n_samples,) if n_classes == 2 else ( n_samples, n_classes)

fit

Fit the pipeline.

fit(X, y=None, verbose=1, **params)

Parameters

Name	Description
X Required	{array-like [n_samples, n_features], FileDataStream }
y Required	{array-like [n_samples]}

fit_transform

If a pipeline only has transforms, returns transformed data as a pandas dataframe

fit_transform(X, y=None, verbose=0, as_binary_data_stream=False, **params)

Parameters

Name	Description
X Required	{array-like [n_samples, n_features], FileDataStream }
y Required	{array-like [n_samples]}
as_binary_data_stream Required	If `True` then output an IDV file. See here for more information.
params Required	Additional arguments. If `as_csr=True` and `as_binary_data_stream=False` then return the transformed data in CSR (sparse matrix) format. If `as_binary_data_stream` is also true then that parameter takes precedence over `as_csr` and the output will be an IDV file.

Returns

Type	Description
	Returns a pandas DataFrame if no other output format is specified. See `as_binary_data_stream` and `as_csr` for other available output formats.

get_feature_contributions

Calculates observation level feature contributions. Returns dataframe with raw data, predictions, and feature contributiuons for each prediction. Feature contributions are not supported for transforms, so make sure that the last step in a pipeline is a model. Feature contriutions are supported for the following models:

Regression:
- OrdinaryLeastSquaresRegressor
- FastLinearRegressor
- OnlineGradientDescentRegressor
- PoissonRegressionRegressor
- GamRegressor
- LightGbmRegressor
- FastTreesRegressor
- FastForestRegressor
- FastTreesTweedieRegressor
Binary Classification:
- AveragedPerceptronBinaryClassifier
- LinearSvmBinaryClassifier
- LogisticRegressionBinaryClassifier
- FastLinearBinaryClassifier
- SgdBinaryClassifier
- SymSgdBinaryClassifier
- GamBinaryClassifier
- FastForestBinaryClassifier
- FastTreesBinaryClassifier
- LightGbmBinaryClassifier
Ranking:
- LightGbmRanker

get_feature_contributions(X, top=10, bottom=10, verbose=0, as_binary_data_stream=False, **params)

Parameters

Name	Description
X Required	{array-like [n_samples, n_features], FileDataStream }
top Required	the number of positive contributions with highest magnitude to report.
bottom Required	The number of negative contributions with highest magnitude to report.

Returns

Type	Description
	dataframe containing the raw data, predicted label, score, probabilities, and feature contributions.

get_fit_info

Returns information about the pipeline.

Example: pipe.get_fit_info(X,Y).

get_fit_info(X, y=None, **params)

Parameters

Name	Description
X Required	{array-like [n_samples, n_features], FileDataStream }
y	{array-like [n_samples]} Default value: None

Returns

Type	Description
	tuple (list of dictonaries, list of entrypoints), both lists do not necessarily have the same length

Remarks

The first list is the list of operators the user defines. In that case, the result is a list of dictionaries with keys operator, name, inputs, outputs, type, current_schema. The last is the schema after the transform or a learner is applied.

The second list is what nimbusml internally uses. The number of entrypoint may be different from the list of operators. This information is mostly used by contributors.

get_output_columns

Returns the output list of columns for the fitted model. :return: list .

get_output_columns(verbose=0, **params)

Parameters

Name	Description
verbose	Default value: 0

get_params

Returns pipeline parameters

get_params(deep=False)

Parameters

Name	Description
deep	boolean, optional If True, will return the parameters for this pipeline and contained subobjects that are estimators. Default value: False

insert

Inserts a transform/learner into the pipeline.

Example: pipe.insert(1, FastLinearRegressor()).

Example: pipe.insert(1, ("learner", FastLinearRegressor())).

insert(pos, step)

Parameters

Name	Description
pos Required	position to insert, should be integers
step Required	the transform/learner to insert

load_model

Load model from file. The model can be generated from ML.NET in .zip format. For more details, please refer to load/save model

load_model(src)

Parameters

Name	Description
dst Required	source filename to be loaded

permutation_feature_importance

Permutation feature importance (PFI) is a technique to determine the global importance of features in a trained machine learning model. PFI is a simple yet powerful technique motivated by Breiman in section 10 of his Random Forests paper (Machine Learning, 2001). The advantage of the PFI method is that it is model agnostic - it works with any model that can be evaluated - and it can use any dataset, not just the training set, to compute feature importance metrics.

PFI works by taking a labeled dataset, choosing a feature, and permuting the values for that feature across all the examples, so that each example now has a random value for the feature and the original values for all other features. The evaluation metric (e.g. NDCG) is then calculated for this modified dataset, and the change in the evaluation metric from the original dataset is computed. The larger the change in the evaluation metric, the more important the feature is to the model, i.e. the most important features are those that the model is most sensitive to. PFI works by performing this permutation analysis across all the features of a model, one after another.

Note that for increasing metrics (e.g. AUC, accuracy, R-Squared, NDCG), the most important features will be those with the highest negative mean change in the metric. Conversely, for decreasing metrics (e.g. Mean Squared Error, Log loss), the most important features will be those with the highest positive mean change in the metric.

PFI is supported for binary classifiers, classifiers, regressors, and rankers.

The mean changes and statndard errors of the means are evaluated for the following metrics are evaluated for PFI:

Binary Classification:
- Area under ROC curve (AUC)
- Accuracy
- Positive precision
- Positive recall
- Negative precision
- Negative recall
- F1 score
- Area under Precision-Recall curve (AUPRC)
Multiclass classification:
- Macro accuracy
- Micro accuracy
- Log loss
- Log loss reduction
- Top k accuracy
- Per-class log loss
Regression:
- Mean absolute error (MAE)
- Mean squared error (MSE)
- Root mean squared error (RMSE)
- Loss function
- R-Squared
Ranking
- Discounted cumulative gains (DCG) @1, @2, and @3
- Normalized discounted cumulative gains (NDCG) @1, @2, and @3

Reference

Breiman, L. Random Forests. Machine Learning (2001) 45: 5.

permutation_feature_importance(X, number_of_examples=None, permutation_count=1, filter_zero_weight_features=False, verbose=0, as_binary_data_stream=False, **params)

Parameters

Name	Description
X Required	{array-like [n_samples, n_features], FileDataStream }
number_of_examples Required	Limit the number of examples to evaluate on. `'None'` means all examples in the dataset are used.
permutation_count Required	The number of permutations to perform.

Returns

Type	Description
	dataframe containing the mean change in evaluation metrics and standard error of the mean for each feature. Features with the largest change in a metric are the most important in the model with respect to that metric.

predict

Predict based on the input data

predict(X, verbose=0, as_binary_data_stream=False, **params)

Parameters

Name	Description
X Required	{array-like [n_samples, n_features], FileDataStream }

predict_proba

Apply transforms and predict probabilities

predict_proba(X, verbose=0, **params)

Parameters

Name	Description
X Required	{array-like [n_samples, n_features], FileDataStream }

Returns

Type	Description
	array, shape = [n_samples, n_classes]

save_model

Save model to file. For more details, please refer to load/save model

save_model(dst)

Parameters

Name	Description
dst Required	filename to be saved with

score

Return performance metrics for the corresponding problem

score(X, y, evaltype='auto', group_id=None, weight=None, verbose=0, **params)

Parameters

Name	Description
X Required	{array-like [n_samples, n_features], FileDataStream }
y Required	{array-like [n_samples]}
evaltype Required	the evaluation type for the problem, can be { 'binary', 'multiclass', 'regression', 'cluster', 'anomaly', 'ranking'}. The default is 'auto'. If model is loaded using the load_model() method, evaltype cannot be 'auto', and therefore must be explicitly specified.

set_params

Set parameters to the pipeline.

set_params(**params)

summary

Return summary for fitted model.

summary(verbose=0, **params)

Examples


   ###############################################################################
   # Pipeline
   import numpy as np
   import pandas as pd
   from nimbusml import Pipeline, FileDataStream
   from nimbusml.linear_model import FastLinearRegressor
   from nimbusml.preprocessing.normalization import MeanVarianceScaler

   X = np.array([[1, 2.0], [2, 4], [3, 0.7]])
   Y = np.array([2, 3, 1.5])

   df = pd.DataFrame(dict(y=Y, x1=X[:, 0], x2=X[:, 1]))

   pipe = Pipeline([
       MeanVarianceScaler(),
       FastLinearRegressor()
   ])

   # fit with pandas dataframe
   pipe.fit(X, Y)

   # Fit with FileDataStream
   df.to_csv('data.csv', index=False)
   ds = FileDataStream.read_csv('data.csv', sep=',')

   pipe = Pipeline([
       MeanVarianceScaler(),
       FastLinearRegressor()
   ])
   pipe.fit(ds, 'y')
   print(pipe.summary())
   #       Bias  Weights.x1  Weights.x2
   # 0  1.032946    0.111758    1.210791

test

Return both predictions and performance metrics. For more details please refer to Metrics.

test(X, y=None, evaltype='auto', group_id=None, weight=None, verbose=0, output_scores=False, as_binary_data_stream=False, **params)

Parameters

Name	Description
X Required	{array-like [n_samples, n_features], FileDataStream }
y Required	{array-like [n_samples]}
evaltype Required	the evaluation type for the problem, can be { 'binary', 'multiclass', 'regression', 'cluster', 'anomaly', 'ranking'}. The default is 'auto'. If model is loaded using the load_model() method, evaltype cannot be 'auto', and therefore must be explicitly specified.
group_id Required	the column name for group_id for ranking problem
weight Required	the column name for the weight column for each sample
output_scores Required	if set to True will return raw scores, otherwise None in the returned tuple.

Returns

Type	Description
	tuple (dataframe of evaluation metrics, dataframe of scores). If scores are required, set >>`<<output_scores`=True, otherwise None is returned by default.

transform

Apply transforms

transform(X, verbose=0, as_binary_data_stream=False, **params)

Parameters

Name	Description
X Required	{array-like [n_samples, n_features], FileDataStream }
y Required	{array-like [n_samples]}
as_binary_data_stream Required	If `True` then output an IDV file. See here for more information.
params Required	Additional arguments. If `as_csr=True` and `as_binary_data_stream=False` then return the transformed data in CSR (sparse matrix) format. If `as_binary_data_stream` is also true then that parameter takes precedence over `as_csr` and the output will be an IDV file.

Returns

Type	Description
	Returns a pandas DataFrame if no other output format is specified. See `as_binary_data_stream` and `as_csr` for other available output formats.

Share via

Pipeline Class

Constructor

Parameters

Remarks

Methods

append

Parameters

clone

combine_models

Parameters

Returns

decision_function

Parameters

Returns

fit

Parameters

fit_transform

Parameters

Returns

get_feature_contributions

Parameters

Returns

get_fit_info

Parameters

Returns

Remarks

get_output_columns

Parameters

get_params

Parameters

insert

Parameters

load_model

Parameters

permutation_feature_importance

Parameters

Returns

predict

Parameters

predict_proba

Parameters

Returns

save_model

Parameters

score

Parameters

set_params

summary

Examples

test

Parameters

Returns

transform

Parameters

Returns