Column Roles for Trainers
Columns play different roles in the context of trainers. NimbusML supports the following roles, as defined in nimbusml.Role
Role.Label - the column representing the dependent variable.
Role.Feature - the column(s) representing the independent variable(s).
Role.Weight - the weights column.
Role.GroupId - the column containing grouping values for ranking.
The <<
operator is used to tell the trainer which columns should play which role. When roles
are assigned to the trainer in a pipeline, they take precendence over the position of arguments in
the fit()
method of the pipeline. Typically fit(X, y)
denotes that X are the features and y
are the labels. However, if roles are set for a trainer, then you can simply invoke fit(X)
, and
the trainer will use the columns in X as per the defined roles. Note that X can be any valid
Data Sources, including nimbusml.FileDataStream
, as long as the
columns can be referenced by name.
The trainer will exclude columns with roles Role.Label, Role.Weight, Role.GroupId (if any are specified), and use all remaining columns of the input data as features. If Role.Feature is specified, only those columns will be used as features, and the remaining columns will be ignored.
Roles are especially useful when the modeling data needs to be generated dynamically. The example below creates a column new_y and assigns it as the target variable, using normalized values of the orginal y.
from nimbusml import Role
from nimbusml import Pipeline
from nimbusml.feature_extraction.categorical import OneHotVectorizer
from nimbusml.linear_model import FastLinearRegressor
from nimbusml.preprocessing.normalization import MeanVarianceScaler
import pandas
df = pandas.DataFrame(dict(education=['A', 'B', 'A', 'B', 'A'],
workclass=['X', 'X', 'Y', 'Y', 'Y'],
y=[1.1, 2.2, 1.24, 3.4, 3.4]))
pipe = Pipeline([
MeanVarianceScaler() << {'new_y': 'y'},
OneHotVectorizer() << ['workclass', 'education'],
FastLinearRegressor() << {Role.Label:'new_y', Role.Feature:['workclass', 'education']}
#Equivalent to << {'Label':'new_y', 'Feature':['workclass', 'education']}, no need to import Role class
])
pipe.fit(df)
scores = pipe.predict(df)
The roles can be also be set using arguments to the trainer explicitly, instead of using the
<<
operator, as in the example below.
df = pandas.DataFrame(dict(education=['A', 'B', 'A', 'B', 'A'],
workclass=['X', 'X', 'Y', 'Y', 'Y'],
y=[1.1, 2.2, 1.24, 3.4, 3.4]))
pipe = Pipeline([
MeanVarianceScaler(columns={'new_y': 'y'}), # renaming output column
OneHotVectorizer(columns=['workclass', 'education']), # keep the same name
FastLinearRegressor(label='new_y', feature=['workclass', 'education'])
])
pipe.fit(df)
scores = pipe.predict(df)
Most of the learners can make use of observation weights. This allows each instance in the dataset
to be assigned an individual weight. The weight is a non-negative real number indicating the relative
importance of this instance over the others. The following example illustrates how to use weights
without using the <<
operator.
df = pandas.DataFrame(dict(education=['A', 'B', 'A', 'B', 'A'],
workclass=['X', 'X', 'Y', 'Y', 'Y'],
weights=[1., 1., 1., 2., 1.],
y=[1.1, 2.2, 1.24, 3.4, 3.4]))
exp = Pipeline([
MeanVarianceScaler(columns={'new_y': 'y'}),
OneHotVectorizer(columns=['workclass', 'education']),
FastTreesRegressor(feature=['workclass', 'education'], label='new_y', weight='weights')
])
exp.fit(df)
prediction = exp.predict(df)
It can indicated to the learner by assigning the column a role using the <<
operator as follows.
exp = Pipeline([
MeanVarianceScaler() << {'new_y': 'y'},
OneHotVectorizer() << ['workclass', 'education'],
FastTreesRegressor() << {Role.Feature:['workclass', 'education'], Role.Label: 'new_y', Role.Weight: 'weights'}
#Equivalent to << {'Feature':['workclass', 'education'], 'Label': 'new_y', 'Weight': 'weights'}
])
exp.fit(df)
prediction = exp.predict(df)
Same goes for the group. Rankers needs the GroupId to link rows to rank. A ranker for search engine needs a dataset with a row per displayed result. The GroupId is ued to tell the learner which results belong to the same query, to group together the candidate set of documents for a single query. NimbusML needs features, a target (relevance label of the result) and a GroupId.
Below is an example of using GroupId at the trainer.
df = pandas.DataFrame(dict(education=['A', 'B', 'A', 'B', 'A'],
workclass=['X', 'X', 'Y', 'Y', 'Y'],
group=[1, 1, 2, 2, 2],
y=[1.1, 2.2, 1.24, 3.4, 3.4]))
exp = Pipeline([
OneHotVectorizer() << ['workclass', 'education'],
ToKey() << 'group',
LightGbmRanker(minimum_example_count_per_leaf = 1) << {Role.Feature: ['workclass', 'education'], Role.Label:'y', Role.GroupId:'group'}
#Equivalent to LightGbmRanker(minimum_example_count_per_leaf = 1) << {'Feature': ['workclass', 'education'], 'Label':'y', 'GroupId':'group'}
#Equivalent to LightGbmRanker(minimum_example_count_per_leaf = 1, feature = ['workclass', 'education'], label = 'y', group_id = 'group')
])
exp.fit(df)
prediction = exp.predict(df)