Data transformations

2022-09-16

Data transformations are used to:

Prepare data for model training.
Apply an imported model in TensorFlow or ONNX format.
Post-process data after it has been passed through a model.

The transformations in this guide return classes that implement the IEstimator interface. Data transformations can be chained together. Each transformation both expects and produces data of specific types and formats, which are specified in the linked reference documentation.

Some data transformations require training data to calculate their parameters. For example: the NormalizeMeanVariance transformer calculates the mean and variance of the training data during the Fit() operation, and uses those parameters in the Transform() operation.

Other data transformations don't require training data. For example: the ConvertToGrayscale transformation can perform the Transform() operation without having seen any training data during the Fit() operation.

Column mapping and grouping

Transform	Definition	ONNX Exportable
Concatenate	Concatenate one or more input columns into a new output column	Yes
CopyColumns	Copy and rename one or more input columns	Yes
DropColumns	Drop one or more input columns	Yes
SelectColumns	Select one or more columns to keep from the input data	Yes

Normalization and scaling

Transform	Definition	ONNX Exportable
NormalizeMeanVariance	Subtract the mean (of the training data) and divide by the variance (of the training data)	Yes
NormalizeLogMeanVariance	Normalize based on the logarithm of the training data	Yes
NormalizeLpNorm	Scale input vectors by their lp-norm, where p is 1, 2 or infinity. Defaults to the l2 (Euclidean distance) norm	Yes
NormalizeGlobalContrast	Scale each value in a row by subtracting the mean of the row data and divide by either the standard deviation or l2-norm (of the row data), and multiply by a configurable scale factor (default 2)	Yes
NormalizeBinning	Assign the input value to a bin index and divide by the number of bins to produce a float value between 0 and 1. The bin boundaries are calculated to evenly distribute the training data across bins	Yes
NormalizeSupervisedBinning	Assign the input value to a bin based on its correlation with label column	Yes
NormalizeMinMax	Scale the input by the difference between the minimum and maximum values in the training data	Yes
NormalizeRobustScaling	Scale each value using statistics that are robust to outliers that will center the data around 0 and scales the data according to the quantile range.	Yes

Conversions between data types

Transform	Definition	ONNX Exportable
ConvertType	Convert the type of an input column to a new type	Yes
MapValue	Map values to keys (categories) based on the supplied dictionary of mappings	No
MapValueToKey	Map values to keys (categories) by creating the mapping from the input data	Yes
MapKeyToValue	Convert keys back to their original values	Yes
MapKeyToVector	Convert keys back to vectors of original values	Yes
MapKeyToBinaryVector	Convert keys back to a binary vector of original values	No
Hash	Hash the value in the input column	Yes

Text transformations

Transform	Definition	ONNX Exportable
FeaturizeText	Transform a text column into a float array of normalized ngrams and char-grams counts	No
TokenizeIntoWords	Split one or more text columns into individual words	Yes
TokenizeIntoCharactersAsKeys	Split one or more text columns into individual characters floats over a set of topics	Yes
NormalizeText	Change case, remove diacritical marks, punctuation marks, and numbers	Yes
ProduceNgrams	Transform text column into a bag of counts of ngrams (sequences of consecutive words)	Yes
ProduceWordBags	Transform text column into a bag of counts of ngrams vector	Yes
ProduceHashedNgrams	Transform text column into a vector of hashed ngram counts	No
ProduceHashedWordBags	Transform text column into a bag of hashed ngram counts	Yes
RemoveDefaultStopWords	Remove default stop words for the specified language from input columns	Yes
RemoveStopWords	Removes specified stop words from input columns	Yes
LatentDirichletAllocation	Transform a document (represented as a vector of floats) into a vector of floats over a set of topics	Yes
ApplyWordEmbedding	Convert vectors of text tokens into sentence vectors using a pretrained model	Yes

Image transformations

Transform	Definition	ONNX Exportable
ConvertToGrayscale	Convert an image to grayscale	No
ConvertToImage	Convert a vector of pixels to ImageDataViewType	No
ExtractPixels	Convert pixels from input image into a vector of numbers	No
LoadImages	Load images from a folder into memory	No
LoadRawImageBytes	Loads images of raw bytes into a new column.	No
ResizeImages	Resize images	No
DnnFeaturizeImage	Applies a pretrained deep neural network (DNN) model to transform an input image into a feature vector	No

Categorical data transformations

Transform	Definition	ONNX Exportable
OneHotEncoding	Convert one or more text columns into one-hot encoded vectors	Yes
OneHotHashEncoding	Convert one or more text columns into hash-based one-hot encoded vectors	No

Time series data transformations

Transform	Definition	ONNX Exportable
DetectAnomalyBySrCnn	Detect anomalies in the input time series data using the Spectral Residual (SR) algorithm	No
DetectChangePointBySsa	Detect change points in time series data using singular spectrum analysis (SSA)	No
DetectIidChangePoint	Detect change points in independent and identically distributed (IID) time series data using adaptive kernel density estimations and martingale scores	No
ForecastBySsa	Forecast time series data using singular spectrum analysis (SSA)	No
DetectSpikeBySsa	Detect spikes in time series data using singular spectrum analysis (SSA)	No
DetectIidSpike	Detect spikes in independent and identically distributed (IID) time series data using adaptive kernel density estimations and martingale scores	No
DetectEntireAnomalyBySrCnn	Detect anomalies for the entire input data using the SRCNN algorithm.	No
DetectSeasonality	Detect seasonality using fourier analysis.	No
LocalizeRootCause	Localizes root cause from time series input using a decision tree algorithm.	No
LocalizeRootCauses	Localizes root causes from tie series input.	No

Missing values

Transform	Definition	ONNX Exportable
IndicateMissingValues	Create a new boolean output column, the value of which is true when the value in the input column is missing	Yes
ReplaceMissingValues	Create a new output column, the value of which is set to a default value if the value is missing from the input column, and the input value otherwise	Yes

Feature selection

Transform	Definition	ONNX Exportable
SelectFeaturesBasedOnCount	Select features whose non-default values are greater than a threshold	Yes
SelectFeaturesBasedOnMutualInformation	Select the features on which the data in the label column is most dependent	Yes

Feature transformations

Transform	Definition	ONNX Exportable
ApproximatedKernelMap	Map each input vector onto a lower dimensional feature space, where inner products approximate a kernel function, so that the features can be used as inputs to the linear algorithms	No
ProjectToPrincipalComponents	Reduce the dimensions of the input feature vector by applying the Principal Component Analysis algorithm

Explainability transformations

Transform	Definition	ONNX Exportable
CalculateFeatureContribution	Calculate contribution scores for each element of a feature vector	No

Calibration transformations

Transform	Definition	ONNX Exportable
Platt(String, String, String)	Transforms a binary classifier raw score into a class probability using logistic regression with parameters estimated using the training data	Yes
Platt(Double, Double, String)	Transforms a binary classifier raw score into a class probability using logistic regression with fixed parameters	Yes
Naive	Transforms a binary classifier raw score into a class probability by assigning scores to bins, and calculating the probability based on the distribution among the bins	Yes
Isotonic	Transforms a binary classifier raw score into a class probability by assigning scores to bins, where the position of boundaries and the size of bins are estimated using the training data	No

Deep learning transformations

Transform	Definition	ONNX Exportable
ApplyOnnxModel	Transform the input data with an imported ONNX model	No
LoadTensorFlowModel	Transform the input data with an imported TensorFlow model	No

Custom transformations

Transform	Definition	ONNX Exportable
FilterByCustomPredicate	Drops rows where a specified predicate returns true.	No
FilterByStatefulCustomPredicate	Drops rows where a specified predicate returns true, but allows for a specified state.	No
CustomMapping	Transform existing columns onto new ones with a user-defined mapping	No
Expression	Apply an expression to transform columns into new ones	No

Share via