Pipeline Class
Implementation of a pipeline.
- Inheritance
-
builtins.objectPipeline
Constructor
Pipeline(steps=None, model=None, random_state=None)
Parameters
Name | Description |
---|---|
steps
|
the list of operator or (name, operator) tuples that are chained in the appropriate order. |
model
|
the path to the model file (".zip") if want to load a model directly from file (such as a trained model from ML.NET). |
random_state
|
the integer used as the random seed. |
Remarks
The Pipeline class assembles a pipeline of transforms, followed optionally by a trainer. The transforms need to implement fit() and transform() methods. The final trainer only needs to implement the fit() method.
The Pipeline class only accepts trainers and transforms implemented in this package.
The data sources for the methods may be a list, numpy.array, scipy.sparse_csr, pandas.DataFrame or a FileDataStream.
By default, the first transform will take all columns as input ( i.e. will transform all columns), unless specific columns are requested (see Columns for how to specify columns to transform). The output column of the first transform is passed as the input column into the second transform for processing by default, unless the second transform requests a different column to operate on.
The final trainer (if one exists) can select which columns to use for feature, labels, weights etc. See Roles for more details on how to select these.
Methods
append |
Extends the pipeline with a new transform/learner at the end.
Note that a fitted pipeline cannot be modified.
Example: Example: |
clone |
Clones the pipeline and returns it in a non-trained state
if the trained model was stored in a file on disk.
You can clone the trained pipeline by running:
|
combine_models |
Combine the models of multiple pipelines, transforms and/or predictors in to a single model. The models are combined in the order they are seen. |
decision_function |
Apply transforms and generate decision values |
fit |
Fit the pipeline. |
fit_transform |
If a pipeline only has transforms, returns transformed data as a pandas dataframe |
get_feature_contributions |
Calculates observation level feature contributions. Returns dataframe with raw data, predictions, and feature contributiuons for each prediction. Feature contributions are not supported for transforms, so make sure that the last step in a pipeline is a model. Feature contriutions are supported for the following models:
|
get_fit_info |
Returns information about the pipeline. Example: |
get_output_columns |
Returns the output list of columns for the fitted model. :return: list . |
get_params |
Returns pipeline parameters |
insert |
Inserts a transform/learner into the pipeline. Example: Example: |
load_model |
Load model from file. The model can be generated from ML.NET in .zip format. For more details, please refer to load/save model |
permutation_feature_importance |
Permutation feature importance (PFI) is a technique to determine the global importance of features in a trained machine learning model. PFI is a simple yet powerful technique motivated by Breiman in section 10 of his Random Forests paper (Machine Learning, 2001). The advantage of the PFI method is that it is model agnostic - it works with any model that can be evaluated - and it can use any dataset, not just the training set, to compute feature importance metrics. PFI works by taking a labeled dataset, choosing a feature, and permuting the values for that feature across all the examples, so that each example now has a random value for the feature and the original values for all other features. The evaluation metric (e.g. NDCG) is then calculated for this modified dataset, and the change in the evaluation metric from the original dataset is computed. The larger the change in the evaluation metric, the more important the feature is to the model, i.e. the most important features are those that the model is most sensitive to. PFI works by performing this permutation analysis across all the features of a model, one after another. Note that for increasing metrics (e.g. AUC, accuracy, R-Squared, NDCG), the most important features will be those with the highest negative mean change in the metric. Conversely, for decreasing metrics (e.g. Mean Squared Error, Log loss), the most important features will be those with the highest positive mean change in the metric. PFI is supported for binary classifiers, classifiers, regressors, and rankers. The mean changes and statndard errors of the means are evaluated for the following metrics are evaluated for PFI:
Reference |
predict |
Predict based on the input data |
predict_proba |
Apply transforms and predict probabilities |
save_model |
Save model to file. For more details, please refer to load/save model |
score |
Return performance metrics for the corresponding problem |
set_params |
Set parameters to the pipeline. |
summary |
Return summary for fitted model. |
test |
Return both predictions and performance metrics. For more details please refer to Metrics. |
transform |
Apply transforms |
append
Extends the pipeline with a new transform/learner at the end.
Note that a fitted pipeline cannot be modified.
Example: pipe.append(FastLinearRegressor())
.
Example: pipe.append(("learner", FastLinearRegressor()))
.
append(step)
Parameters
Name | Description |
---|---|
step
Required
|
the transform/learner to append |
clone
Clones the pipeline and returns it in a non-trained state
if the trained model was stored in a file on disk.
You can clone the trained pipeline by running:
Pipeline(**pipe.get_params())
.
clone()
combine_models
Combine the models of multiple pipelines, transforms and/or predictors in to a single model. The models are combined in the order they are seen.
combine_models(*items, **params)
Parameters
Name | Description |
---|---|
items
Required
|
the fitted pipelines, transforms and/or predictors which contain the models to join. |
contains_predictor
Required
|
Set to True if the last item contains or is a predictor. Set to False if items only contains transforms. The default is True. |
Returns
Type | Description |
---|---|
A new Pipeline which is backed by a model that is the combination of all the models passed in through items. |
decision_function
Apply transforms and generate decision values
decision_function(X, verbose=0, **params)
Parameters
Name | Description |
---|---|
X
Required
|
{array-like [n_samples, n_features], FileDataStream } |
Returns
Type | Description |
---|---|
array, shape=(n_samples,) if n_classes == 2 else ( n_samples, n_classes) |
fit
Fit the pipeline.
fit(X, y=None, verbose=1, **params)
Parameters
Name | Description |
---|---|
X
Required
|
{array-like [n_samples, n_features], FileDataStream } |
y
Required
|
{array-like [n_samples]} |
fit_transform
If a pipeline only has transforms, returns transformed data as a pandas dataframe
fit_transform(X, y=None, verbose=0, as_binary_data_stream=False, **params)
Parameters
Name | Description |
---|---|
X
Required
|
{array-like [n_samples, n_features], FileDataStream } |
y
Required
|
{array-like [n_samples]} |
as_binary_data_stream
Required
|
If |
params
Required
|
Additional arguments.
If |
Returns
Type | Description |
---|---|
Returns a pandas DataFrame if no other output format
is specified. See |
get_feature_contributions
Calculates observation level feature contributions. Returns dataframe with raw data, predictions, and feature contributiuons for each prediction. Feature contributions are not supported for transforms, so make sure that the last step in a pipeline is a model. Feature contriutions are supported for the following models:
Regression:
OrdinaryLeastSquaresRegressor
FastLinearRegressor
OnlineGradientDescentRegressor
PoissonRegressionRegressor
GamRegressor
LightGbmRegressor
FastTreesRegressor
FastForestRegressor
FastTreesTweedieRegressor
Binary Classification:
AveragedPerceptronBinaryClassifier
LinearSvmBinaryClassifier
LogisticRegressionBinaryClassifier
FastLinearBinaryClassifier
SgdBinaryClassifier
SymSgdBinaryClassifier
GamBinaryClassifier
FastForestBinaryClassifier
FastTreesBinaryClassifier
LightGbmBinaryClassifier
Ranking:
- LightGbmRanker
get_feature_contributions(X, top=10, bottom=10, verbose=0, as_binary_data_stream=False, **params)
Parameters
Name | Description |
---|---|
X
Required
|
{array-like [n_samples, n_features], FileDataStream } |
top
Required
|
the number of positive contributions with highest magnitude to report. |
bottom
Required
|
The number of negative contributions with highest magnitude to report. |
Returns
Type | Description |
---|---|
dataframe containing the raw data, predicted label, score, probabilities, and feature contributions. |
get_fit_info
Returns information about the pipeline.
Example: pipe.get_fit_info(X,Y)
.
get_fit_info(X, y=None, **params)
Parameters
Name | Description |
---|---|
X
Required
|
{array-like [n_samples, n_features], FileDataStream } |
y
|
{array-like [n_samples]} Default value: None
|
Returns
Type | Description |
---|---|
tuple (list of dictonaries, list of entrypoints), both lists do not necessarily have the same length |
Remarks
The first list is the list of operators the user defines. In that case, the result is a list of dictionaries with keys operator, name, inputs, outputs, type, current_schema. The last is the schema after the transform or a learner is applied.
The second list is what nimbusml internally uses. The number of entrypoint may be different from the list of operators. This information is mostly used by contributors.
get_output_columns
Returns the output list of columns for the fitted model. :return: list .
get_output_columns(verbose=0, **params)
Parameters
Name | Description |
---|---|
verbose
|
Default value: 0
|
get_params
Returns pipeline parameters
get_params(deep=False)
Parameters
Name | Description |
---|---|
deep
|
boolean, optional If True, will return the parameters for this pipeline and contained subobjects that are estimators. Default value: False
|
insert
Inserts a transform/learner into the pipeline.
Example: pipe.insert(1, FastLinearRegressor())
.
Example: pipe.insert(1, ("learner", FastLinearRegressor()))
.
insert(pos, step)
Parameters
Name | Description |
---|---|
pos
Required
|
position to insert, should be integers |
step
Required
|
the transform/learner to insert |
load_model
Load model from file. The model can be generated from ML.NET in .zip format. For more details, please refer to load/save model
load_model(src)
Parameters
Name | Description |
---|---|
dst
Required
|
source filename to be loaded |
permutation_feature_importance
Permutation feature importance (PFI) is a technique to determine the global importance of features in a trained machine learning model. PFI is a simple yet powerful technique motivated by Breiman in section 10 of his Random Forests paper (Machine Learning, 2001). The advantage of the PFI method is that it is model agnostic - it works with any model that can be evaluated - and it can use any dataset, not just the training set, to compute feature importance metrics.
PFI works by taking a labeled dataset, choosing a feature, and permuting the values for that feature across all the examples, so that each example now has a random value for the feature and the original values for all other features. The evaluation metric (e.g. NDCG) is then calculated for this modified dataset, and the change in the evaluation metric from the original dataset is computed. The larger the change in the evaluation metric, the more important the feature is to the model, i.e. the most important features are those that the model is most sensitive to. PFI works by performing this permutation analysis across all the features of a model, one after another.
Note that for increasing metrics (e.g. AUC, accuracy, R-Squared, NDCG), the most important features will be those with the highest negative mean change in the metric. Conversely, for decreasing metrics (e.g. Mean Squared Error, Log loss), the most important features will be those with the highest positive mean change in the metric.
PFI is supported for binary classifiers, classifiers, regressors, and rankers.
The mean changes and statndard errors of the means are evaluated for the following metrics are evaluated for PFI:
Binary Classification:
Area under ROC curve (AUC)
Accuracy
Positive precision
Positive recall
Negative precision
Negative recall
F1 score
Area under Precision-Recall curve (AUPRC)
Multiclass classification:
Macro accuracy
Micro accuracy
Log loss
Log loss reduction
Top k accuracy
Per-class log loss
Regression:
Mean absolute error (MAE)
Mean squared error (MSE)
Root mean squared error (RMSE)
Loss function
R-Squared
Ranking
Discounted cumulative gains (DCG) @1, @2, and @3
Normalized discounted cumulative gains (NDCG) @1, @2, and @3
Reference
Breiman, L. Random Forests. Machine Learning (2001) 45: 5.
permutation_feature_importance(X, number_of_examples=None, permutation_count=1, filter_zero_weight_features=False, verbose=0, as_binary_data_stream=False, **params)
Parameters
Name | Description |
---|---|
X
Required
|
{array-like [n_samples, n_features], FileDataStream } |
number_of_examples
Required
|
Limit the number of examples to evaluate on.
|
permutation_count
Required
|
The number of permutations to perform. |
Returns
Type | Description |
---|---|
dataframe containing the mean change in evaluation metrics and standard error of the mean for each feature. Features with the largest change in a metric are the most important in the model with respect to that metric. |
predict
Predict based on the input data
predict(X, verbose=0, as_binary_data_stream=False, **params)
Parameters
Name | Description |
---|---|
X
Required
|
{array-like [n_samples, n_features], FileDataStream } |
predict_proba
Apply transforms and predict probabilities
predict_proba(X, verbose=0, **params)
Parameters
Name | Description |
---|---|
X
Required
|
{array-like [n_samples, n_features], FileDataStream } |
Returns
Type | Description |
---|---|
array, shape = [n_samples, n_classes] |
save_model
Save model to file. For more details, please refer to load/save model
save_model(dst)
Parameters
Name | Description |
---|---|
dst
Required
|
filename to be saved with |
score
Return performance metrics for the corresponding problem
score(X, y, evaltype='auto', group_id=None, weight=None, verbose=0, **params)
Parameters
Name | Description |
---|---|
X
Required
|
{array-like [n_samples, n_features], FileDataStream } |
y
Required
|
{array-like [n_samples]} |
evaltype
Required
|
the evaluation type for the problem, can be { 'binary', 'multiclass', 'regression', 'cluster', 'anomaly', 'ranking'}. The default is 'auto'. If model is loaded using the load_model() method, evaltype cannot be 'auto', and therefore must be explicitly specified. |
set_params
Set parameters to the pipeline.
set_params(**params)
summary
Return summary for fitted model.
summary(verbose=0, **params)
Examples
###############################################################################
# Pipeline
import numpy as np
import pandas as pd
from nimbusml import Pipeline, FileDataStream
from nimbusml.linear_model import FastLinearRegressor
from nimbusml.preprocessing.normalization import MeanVarianceScaler
X = np.array([[1, 2.0], [2, 4], [3, 0.7]])
Y = np.array([2, 3, 1.5])
df = pd.DataFrame(dict(y=Y, x1=X[:, 0], x2=X[:, 1]))
pipe = Pipeline([
MeanVarianceScaler(),
FastLinearRegressor()
])
# fit with pandas dataframe
pipe.fit(X, Y)
# Fit with FileDataStream
df.to_csv('data.csv', index=False)
ds = FileDataStream.read_csv('data.csv', sep=',')
pipe = Pipeline([
MeanVarianceScaler(),
FastLinearRegressor()
])
pipe.fit(ds, 'y')
print(pipe.summary())
# Bias Weights.x1 Weights.x2
# 0 1.032946 0.111758 1.210791
test
Return both predictions and performance metrics. For more details please refer to Metrics.
test(X, y=None, evaltype='auto', group_id=None, weight=None, verbose=0, output_scores=False, as_binary_data_stream=False, **params)
Parameters
Name | Description |
---|---|
X
Required
|
{array-like [n_samples, n_features], FileDataStream } |
y
Required
|
{array-like [n_samples]} |
evaltype
Required
|
the evaluation type for the problem, can be { 'binary', 'multiclass', 'regression', 'cluster', 'anomaly', 'ranking'}. The default is 'auto'. If model is loaded using the load_model() method, evaltype cannot be 'auto', and therefore must be explicitly specified. |
group_id
Required
|
the column name for group_id for ranking problem |
weight
Required
|
the column name for the weight column for each sample |
output_scores
Required
|
if set to True will return raw scores, otherwise None in the returned tuple. |
Returns
Type | Description |
---|---|
tuple (dataframe of evaluation metrics, dataframe of
scores). If scores are
required, set >> |
transform
Apply transforms
transform(X, verbose=0, as_binary_data_stream=False, **params)
Parameters
Name | Description |
---|---|
X
Required
|
{array-like [n_samples, n_features], FileDataStream } |
y
Required
|
{array-like [n_samples]} |
as_binary_data_stream
Required
|
If |
params
Required
|
Additional arguments.
If |
Returns
Type | Description |
---|---|
Returns a pandas DataFrame if no other output format
is specified. See |