Share via


Handler Class

Replace NaN values in a column with imputed values.

Constructor

Handler(replace_with='DefaultValue', impute_by_slot=True, concat=True, columns=None, **params)

Parameters

Name Description
columns

a dictionary of key-value pairs, where key is the output column name and value is the input column name.

  • Multiple key-value pairs are allowed.

  • Input column type: numeric.

  • Output column type:

    Vector Type.

  • If the output column names are same as the input column names, then

simply specify columns as a list of strings.

The << operator can be used to set this value (see Column Operator)

For example

  • Handler(columns={'out1':'input1', 'out2':'input2'})

  • Handler() << {'out1':'input1', 'out2':'input2'}

For more details see Columns.

replace_with

The method to use to replace NaN values. The following choices are available.

  • Def: Replace with default value of that type, usually 0. If no

replace method is specified, this is the default strategy.

  • Mean: Replace NaN values with the mean of the values in that column.
  • Min: Replace with minimum value in the column.
  • Max: Replace with maximum value in the column.
impute_by_slot

Whether to impute values by slot.

concat

Whether or not to concatenate an indicator vector column to the value column.

params

Additional arguments sent to compute engine.

Examples


   ###############################################################################
   # Filter
   import numpy as np
   import pandas as pd
   from nimbusml import FileDataStream
   from nimbusml.preprocessing.missing_values import Handler

   with_nans = pd.DataFrame(
       data=dict(
           Sepal_Length=[2.5, np.nan, 2.1, 1.0],
           Sepal_Width=[.75, .9, .8, .76],
           Petal_Length=[np.nan, 2.5, 2.6, 2.4],
           Petal_Width=[.8, .7, .9, 0.7],
           Species=["setosa", "viginica", "", 'versicolor']))

   # write NaNs to file to show how this transform work
   tmpfile = 'tmpfile_with_nans.csv'
   with_nans.to_csv(tmpfile, index=False)

   data = FileDataStream.read_csv(tmpfile, sep=',', numeric_dtype=np.float32)

   # transform usage
   xf = Handler(columns={'PL': 'Petal_Length'})

   # fit and transform
   features = xf.fit_transform(data)

   # print features
   print(features.head())

   #   PL.IsMissing.Petal_Length  PL.Petal_Length  Petal_Length  Petal_Width  ...
   # 0                        1.0              0.0           NaN          0.8  ...
   # 1                        0.0              2.5           2.5          0.7  ...
   # 2                        0.0              2.6           2.6          0.9  ...
   # 3                        0.0              2.4           2.4          0.7  ...

Remarks

Handler is a combination of Filter and Indicator. It creates two columns, one containing the imputed values as specified by replace_with argument, and the second column containing indicator values of which rows entries were imputed. This works for columns that have numeric type.

Methods

get_params

Get the parameters for this operator.

get_params

Get the parameters for this operator.

get_params(deep=False)

Parameters

Name Description
deep
Default value: False