Edit

Share via


Data transformations

Data transformations are used to:

  • Prepare data for model training.
  • Apply an imported model in TensorFlow or ONNX format.
  • Post-process data after it has been passed through a model.

The transformations in this guide return classes that implement the IEstimator interface. Data transformations can be chained together. Each transformation both expects and produces data of specific types and formats, which are specified in the linked reference documentation.

Some data transformations require training data to calculate their parameters. For example: the NormalizeMeanVariance transformer calculates the mean and variance of the training data during the Fit() operation, and uses those parameters in the Transform() operation.

Other data transformations don't require training data. For example: the ConvertToGrayscale transformation can perform the Transform() operation without having seen any training data during the Fit() operation.

Column mapping and grouping

Transform Definition ONNX Exportable
Concatenate Concatenate one or more input columns into a new output column Yes
CopyColumns Copy and rename one or more input columns Yes
DropColumns Drop one or more input columns Yes
SelectColumns Select one or more columns to keep from the input data Yes

Normalization and scaling

Transform Definition ONNX Exportable
NormalizeMeanVariance Subtract the mean (of the training data) and divide by the variance (of the training data) Yes
NormalizeLogMeanVariance Normalize based on the logarithm of the training data Yes
NormalizeLpNorm Scale input vectors by their lp-norm, where p is 1, 2 or infinity. Defaults to the l2 (Euclidean distance) norm Yes
NormalizeGlobalContrast Scale each value in a row by subtracting the mean of the row data and divide by either the standard deviation or l2-norm (of the row data), and multiply by a configurable scale factor (default 2) Yes
NormalizeBinning Assign the input value to a bin index and divide by the number of bins to produce a float value between 0 and 1. The bin boundaries are calculated to evenly distribute the training data across bins Yes
NormalizeSupervisedBinning Assign the input value to a bin based on its correlation with label column Yes
NormalizeMinMax Scale the input by the difference between the minimum and maximum values in the training data Yes
NormalizeRobustScaling Scale each value using statistics that are robust to outliers that will center the data around 0 and scales the data according to the quantile range. Yes

Conversions between data types

Transform Definition ONNX Exportable
ConvertType Convert the type of an input column to a new type Yes
MapValue Map values to keys (categories) based on the supplied dictionary of mappings No
MapValueToKey Map values to keys (categories) by creating the mapping from the input data Yes
MapKeyToValue Convert keys back to their original values Yes
MapKeyToVector Convert keys back to vectors of original values Yes
MapKeyToBinaryVector Convert keys back to a binary vector of original values No
Hash Hash the value in the input column Yes

Text transformations

Transform Definition ONNX Exportable
FeaturizeText Transform a text column into a float array of normalized ngrams and char-grams counts No
TokenizeIntoWords Split one or more text columns into individual words Yes
TokenizeIntoCharactersAsKeys Split one or more text columns into individual characters floats over a set of topics Yes
NormalizeText Change case, remove diacritical marks, punctuation marks, and numbers Yes
ProduceNgrams Transform text column into a bag of counts of ngrams (sequences of consecutive words) Yes
ProduceWordBags Transform text column into a bag of counts of ngrams vector Yes
ProduceHashedNgrams Transform text column into a vector of hashed ngram counts No
ProduceHashedWordBags Transform text column into a bag of hashed ngram counts Yes
RemoveDefaultStopWords Remove default stop words for the specified language from input columns Yes
RemoveStopWords Removes specified stop words from input columns Yes
LatentDirichletAllocation Transform a document (represented as a vector of floats) into a vector of floats over a set of topics Yes
ApplyWordEmbedding Convert vectors of text tokens into sentence vectors using a pre-trained model Yes

Image transformations

Transform Definition ONNX Exportable
ConvertToGrayscale Convert an image to grayscale No
ConvertToImage Convert a vector of pixels to ImageDataViewType No
ExtractPixels Convert pixels from input image into a vector of numbers No
LoadImages Load images from a folder into memory No
LoadRawImageBytes Loads images of raw bytes into a new column. No
ResizeImages Resize images No
DnnFeaturizeImage Applies a pre-trained deep neural network (DNN) model to transform an input image into a feature vector No

Categorical data transformations

Transform Definition ONNX Exportable
OneHotEncoding Convert one or more text columns into one-hot encoded vectors Yes
OneHotHashEncoding Convert one or more text columns into hash-based one-hot encoded vectors No

Time series data transformations

Transform Definition ONNX Exportable
DetectAnomalyBySrCnn Detect anomalies in the input time series data using the Spectral Residual (SR) algorithm No
DetectChangePointBySsa Detect change points in time series data using singular spectrum analysis (SSA) No
DetectIidChangePoint Detect change points in independent and identically distributed (IID) time series data using adaptive kernel density estimations and martingale scores No
ForecastBySsa Forecast time series data using singular spectrum analysis (SSA) No
DetectSpikeBySsa Detect spikes in time series data using singular spectrum analysis (SSA) No
DetectIidSpike Detect spikes in independent and identically distributed (IID) time series data using adaptive kernel density estimations and martingale scores No
DetectEntireAnomalyBySrCnn Detect anomalies for the entire input data using the SRCNN algorithm. No
DetectSeasonality Detect seasonality using fourier analysis. No
LocalizeRootCause Localizes root cause from time series input using a decision tree algorithm. No
LocalizeRootCauses Localizes root causes from tie series input. No

Missing values

Transform Definition ONNX Exportable
IndicateMissingValues Create a new boolean output column, the value of which is true when the value in the input column is missing Yes
ReplaceMissingValues Create a new output column, the value of which is set to a default value if the value is missing from the input column, and the input value otherwise Yes

Feature selection

Transform Definition ONNX Exportable
SelectFeaturesBasedOnCount Select features whose non-default values are greater than a threshold Yes
SelectFeaturesBasedOnMutualInformation Select the features on which the data in the label column is most dependent Yes

Feature transformations

Transform Definition ONNX Exportable
ApproximatedKernelMap Map each input vector onto a lower dimensional feature space, where inner products approximate a kernel function, so that the features can be used as inputs to the linear algorithms No
ProjectToPrincipalComponents Reduce the dimensions of the input feature vector by applying the Principal Component Analysis algorithm

Explainability transformations

Transform Definition ONNX Exportable
CalculateFeatureContribution Calculate contribution scores for each element of a feature vector No

Calibration transformations

Transform Definition ONNX Exportable
Platt(String, String, String) Transforms a binary classifier raw score into a class probability using logistic regression with parameters estimated using the training data Yes
Platt(Double, Double, String) Transforms a binary classifier raw score into a class probability using logistic regression with fixed parameters Yes
Naive Transforms a binary classifier raw score into a class probability by assigning scores to bins, and calculating the probability based on the distribution among the bins Yes
Isotonic Transforms a binary classifier raw score into a class probability by assigning scores to bins, where the position of boundaries and the size of bins are estimated using the training data No

Deep learning transformations

Transform Definition ONNX Exportable
ApplyOnnxModel Transform the input data with an imported ONNX model No
LoadTensorFlowModel Transform the input data with an imported TensorFlow model No

Custom transformations

Transform Definition ONNX Exportable
FilterByCustomPredicate Drops rows where a specified predicate returns true. No
FilterByStatefulCustomPredicate Drops rows where a specified predicate returns true, but allows for a specified state. No
CustomMapping Transform existing columns onto new ones with a user-defined mapping No
Expression Apply an expression to transform columns into new ones No