Share via


Column Operations for Transforms

How To Select Columns to Transform

nimbusml is compatible with the scikit-learn convention for column processing in fit(), transform() and fit_transform() methods of trainers and transforms. By default, all columns are transformed equally.

NimbusML additionally provides a syntax to transform only a subset of columns. This is a useful feature for many transforms, especially when the dataset containts columns of mixed types. For example, a dataset with both numeric features and free text features. Similarly for trainers, the concept of Column Roles for Trainers provides a mechanism to select which columns to use as labels and features.

Transform All Columns

By default, the OneHotVectorizer transform will process all columns, which in our example results in a the original column values being replaced by their one hot encodings. Note that the output of OneHotVectorizer are VectorDataViewType, so the output names below are the column names appended with the slot names, which in our example are data driven and generated dynamically from the input data.

Example
import pandas as pd
from nimbusml.feature_extraction.categorical import OneHotVectorizer

# data with columns education and workclass
X =  pd.DataFrame(data=dict( edu = ['bs', 'ms', 'phd', 'bs'],
                             wclass= ['food', 'finance','food', 'movie'] ))
xf = OneHotVectorizer()
print( xf.fit_transform(X))

Output:

   edu.bs  edu.ms  edu.phd  wclass.food  wclass.finance  wclass.movie
0     1.0     0.0      0.0          1.0             0.0           0.0
1     0.0     1.0      0.0          0.0             1.0           0.0
2     0.0     0.0      1.0          1.0             0.0           0.0
3     1.0     0.0      0.0          0.0             0.0           1.0

Use << Operator To Select Columns

What if we only want to encode one of the columns? We simply use the << operator to tell the transform to restrict operations to the columns of interest. The << operatator is syntactic sugar for setting the columns argument of the transform.

All transforms in NimbusML have an implicit columns parameter to tell which columns to process, and optionally how to name the output columns, if any. Refer to the reference sections for each transform to see what format is allowed for the columns argument.

Example
import pandas as pd
from nimbusml.feature_extraction.categorical import OneHotVectorizer

# data with columns education and workclass
X =  pd.DataFrame(data=dict( edu = ['bs', 'ms', 'phd', 'bs'],
                             wclass= ['food', 'finance','food', 'movie'] ))

# use the << operator to select only edu to encode
xf = OneHotVectorizer() << ['edu']
print(xf.fit_transform(X))

Output:

   edu.bs  edu.ms  edu.phd   wclass
0     1.0     0.0      0.0     food
1     0.0     1.0      0.0  finance
2     0.0     0.0      1.0     food
3     1.0     0.0      0.0    movie

<< and columns= are interchangeable

Let’s see an example of setting the columns argument explicity, to get the same results as using the << operator.

Example
import pandas as pd
from nimbusml.feature_extraction.categorical import OneHotVectorizer

# data with columns education and workclass
X =  pd.DataFrame(data=dict( edu = ['bs', 'ms', 'phd', 'bs'],
                             wclass= ['food', 'finance','food', 'movie'] ))

# use `columns=` to do the same thing as `<<`
xf = OneHotVectorizer(columns=['edu'])
print(xf.fit_transform(X))

Output:

   edu.bs  edu.ms  edu.phd   wclass
0     1.0     0.0      0.0     food
1     0.0     1.0      0.0  finance
2     0.0     0.0      1.0     food
3     1.0     0.0      0.0    movie

Renaming Output Columns of Transforms

Transformations are done in place, and therefore values in the original column will be replaced with the updated values. To retain the original input column values, we can specify an optional output column, with a different name than the input column, to store the transformed values.

Some columns may not allow renaming the output columns, so always refer to the reference sections for each transform to see what format is allowed for the columns argument.

In the example below, the original edu column values are preserved, while the encoded values are stored in the new column xyz, with slot name bs, ms and phd.

Example
import pandas as pd
from nimbusml.feature_extraction.categorical import OneHotVectorizer

# data with columns education and workclass
X =  pd.DataFrame(data=dict( edu = ['bs', 'ms', 'phd', 'bs'],
                             wclass= ['food', 'finance','food', 'movie'] ))

# let's retain the edu column, and create a
# new output column xyz for the encoded values
xf = OneHotVectorizer(columns={'xyz':'edu'})
print('\n', xf.fit_transform(X))

Output:

    edu   wclass  xyz.bs  xyz.ms  xyz.phd
0   bs     food     1.0     0.0      0.0
1   ms  finance     0.0     1.0      0.0
2  phd     food     0.0     0.0      1.0
3   bs    movie     1.0     0.0      0.0

Column Names in a Pipeline

Within a nimbusml.Pipeline, there can be many transforms, each one modifying column values, creating new columns and potentially deleting columns. The output of each transform affects the data values and schema for the next transform in the pipeline.

In the example below, the original column values of edu are no longer available because they are replaced with the encoded values. However the original values of wclass are still available, because the encoded values are store in A.

Example
import pandas as pd
from nimbusml import Pipeline
from nimbusml.feature_extraction.categorical import OneHotVectorizer

# data with columns education and workclass
X =  pd.DataFrame(data=dict( edu = ['bs', 'ms', 'phd', 'bs'],
                             wclass= ['food', 'finance','food', 'movie'] ))

pipe = Pipeline([
    OneHotVectorizer() << ['edu'],
    OneHotVectorizer() << {'A':'wclass'}
])
print(pipe.fit_transform(X))

Output:

   edu.bs  edu.ms  edu.phd   wclass  A.food  A.finance  A.movie
0     1.0     0.0      0.0     food     1.0        0.0      0.0
1     0.0     1.0      0.0  finance     0.0        1.0      0.0
2     0.0     0.0      1.0     food     1.0        0.0      0.0
3     1.0     0.0      0.0    movie     0.0        0.0      1.0