Replace Discrete Values

2019-05-06

Important

Support for Machine Learning Studio (classic) will end on 31 August 2024. We recommend you transition to Azure Machine Learning by that date.

Beginning 1 December 2021, you will not be able to create new Machine Learning Studio (classic) resources. Through 31 August 2024, you can continue to use the existing Machine Learning Studio (classic) resources.

See information on moving machine learning projects from ML Studio (classic) to Azure Machine Learning.
Learn more about Azure Machine Learning.

ML Studio (classic) documentation is being retired and may not be updated in the future.

Replaces discrete values from one column with numeric values based on another column

Category: Statistical Functions

Note

Applies to: Machine Learning Studio (classic) only

Similar drag-and-drop modules are available in Azure Machine Learning designer.

Module overview

This article describes how to use the Replace Discrete Values module in Machine Learning Studio (classic), to generate a probability score that can be used to represent a discrete value. This score can be useful for understanding the information value of the discrete values.

How it works:

You select a column that contains the discrete (or categorical) value, and then select another column to use for reference.

Depending on whether the second column is categorical or non-categorical, the module computes one of the following values:

The conditional probability for the second column given the values in the first column.
The mean and standard deviation for each group of values in the first column.

The module outputs both a dataset with the scores, and a function that you can save and apply to other datasets.

How to configure Replace Discrete Values

Tip

We recommend working with only one pair of columns at a time. The module does not raise an error if you select multiple columns to analyze. However, in practice, if you choose multiple columns, they are matched by an internal heuristic, not by order of selection.

Therefore, we recommend that you select a single pair of columns each time, one for Discrete columns and one for Replacement columns.

If you need to generate scores for multiple columns, use separate instances of Replace Discrete Values.

Add the Replace Discrete Values module to your experiment. You can find this module in the Statistical Functions group in the experiment items list in Machine Learning Studio (classic).
Connect a dataset containing at least one column of categorical data.
Discrete columns: Click Launch column selector to choose a column that contain discrete (or categorical) values.

Any discrete columns that you select must be categorical. If you get an error, use the Edit Metadata module to change the column type.
Replacement columns: Click Launch column selector to choose the column that contains the values to use in computing a replacement score.

If you select multiple columns for Discrete columns, you must choose an equal number of replacement columns.
Run the experiment.

Note

You cannot choose which statistical function to apply. The module calculates an appropriate measure, based on the data type of the column selected for Replacement column.

Results

The module computes one of the following values for each pair of columns:

If the second column contains categorical values, the module computes the conditional probability of the second column, given the values in the first column.

For example, assume you chose occupation from the Census dataset as the discrete column and choose gender as the replacement column. The output of the module would be the:

P(gender | occupation)
If the second column contains non-categorical values that can be converted to numbers (such as numeric or Boolean values not marked as categorical), the module outputs the mean and standard deviation for each group of values in the first column.

For example, assume you use occupation as the Discrete column and the other column is the numeric column hours-per-week. The module would output these new values:

Mean(hours-per-week | occupation)

Std-Dev(hours-per-week | occupation)

In addition to the probability scores, the module also outputs a transformed dataset. In this dataset, the column selected as the Replacement columns is replaced with a column containing the computed scores.

Tip

The columns in the source dataset are not actually changed or deleted by the operation; the score columns are new ones generated by the module and output instead of the source data.

To view the source values together with the probability scores, use the Add Columns module.

Examples

The usage of Replace Discrete Values can be illustrated by some simple examples.

Example 1 - Replace a categorical value with a probability score

The following table contains a categorical column X, and a column Y with True/False values that are treated as categorical values. When you use Replace Discrete Values, it calculates a conditional probability score for the probability of Y given X, as shown in the third column.

X	Y	P(Y\|X)
Blue	0	`P(Y=0\|X=Blue) = 0.5`
Blue	1	`P(Y=1\|X=Blue) = 0.5`
Green	0	`P(Y=0\|X=Green) = 2/3`
Green	0	`P(Y=0\|X=Green) = 2/3`
Green	1	`P(Y=1\|X=Green) = 1/3`
Red	0	`P(Y=0\|X=Red) = .75`
Red	0	`P(Y=0\|X=Red) = .75`
Red	1	`P(Y=1\|X=Red) = .25`
Red	0	`P(Y=0\|X=Red) = .75`

Example 2 - Calculate mean and standard deviation based on a noncategorical column

When the second column is numerical, Replace Discrete Values calculates the mean and standard deviation instead of a conditional probability score.

The following example is based on the Auto Prices sample dataset, simplified as follows:

A small subset of columns was selected.
Only the top 30 rows were extracted, by using the Head option of the Partition and Sample module.
The Replace Discrete Values module was used to compute the mean and standard deviation for vehicle curb weight. given the categorical column, num-of-doors.

The following table illustrates the results:

Body	Num-of-doors	Curb-weight	Mean(curb-weight\|num-of-doors)	Std-Dev(curb-weight\|num-of-doors)
std	two	2548	2429.785714	507.45699
std	four	2337	2625.6	493.409877
std	two	2507	2429.785714	507.45699
turbo	four	3086	2625.6 5	493.409877
std	four	1989	2625.6	493.409877
turbo		2191
std	four	2535	2625.6	493.409877

You can verify the mean for each group of values by using the AVERAGEIF function in Excel.

Example 3 - Handling missing values

This example demonstrates how missing values (nulls) propagate to the results when conditional probability scores are calculated.

If the discrete value column and the calculation lookup column contains any missing values, the missing values are propagated to the new column.
If the discrete value column contains only missing values, the module cannot process the column and an error message appears.

X	Y	P(Y\|X)
1	True	`P(Y=true\|X=1) = 1/2`
1	False	`P(Y=false\|X=1) = 1/2`
2	True	`P(Y=true\|X=2) = 1/3`
2	False	`P(Y=false\|X=2) = 1/3`
2	Null	`P(Y=null\|X=2) = null`

Technical notes

You must ensure that any discrete columns you want to replace are categorical, or the module will return an error. To do this, use the Edit Metadata module.
If the second column contains Boolean values, the True-False values are processed as numeric with FALSE and TRUE equivalent to 0 and 1 respectively.
The formula for the standard deviation column calculates the population standard deviation. Therefore, N is used in the denominator instead of (N - 1).
If the second column contains noncategorical data (numeric or Boolean values), the module computes the mean and standard deviation of Y for the given value of X.

That is, for each row in the dataset indexed by i:

Mean(Y│X)i = Mean(Y│X = Xi)

StdDev(Y│X)i = StdDev(Y│X = Xi)
If the second column contains categorical data or values that are neither numeric nor Boolean, the module computes the conditional probability of Y for the given value of X.
Any Boolean values in the second column are processed as numeric data with FALSE and TRUE equivalent to 0 and 1 respectively.
If there is a class in the discrete column, such that a row with a missing value is present in the second column, the sum of conditional probabilities within the class is less than one.

Expected inputs

Name	Type	Description
Dataset	Data Table	Input dataset

Module parameters

Name	Range	Type	Default	Description
Discrete columns	Any	ColumnSelection		Selects the columns that contain discrete values
Replacement columns	Any	ColumnSelection		Selects the columns that contain the data to use in place of the discrete values

Outputs

Name	Type	Description
Supplemented dataset	Data Table	Dataset with replaced data
Transform function	ITransform interface	Definition of the transform function, which can be applied to other datasets

Exceptions

Exception	Description
Error 0001	Exception occurs if one or more specified columns of the data set couldn't be found.
Error 0003	Exception occurs if one or more of inputs are null or empty.
Error 0020	Exception occurs if the number of columns in some of the datasets passed to the module is too small.
Error 0021	Exception occurs if the number of rows in some of the datasets passed to the module is too small.
Error 0017	Exception occurs if one or more specified columns have a type that is unsupported by the current module.
Error 0026	Exception occurs when columns with the same name are not allowed.
Error 0022	Exception occurs if the number of selected columns in the input dataset does not equal the expected number.

For a list of errors specific to Studio (classic) modules, see Machine Learning Error codes.

For a list of API exceptions, see Machine Learning REST API Error Codes.

X	Y	P(Y\|X)
Blue	0	`P(Y=0\|X=Blue) = 0.5`
Blue	1	`P(Y=1\|X=Blue) = 0.5`
Green	0	`P(Y=0\|X=Green) = 2/3`
Green	0	`P(Y=0\|X=Green) = 2/3`
Green	1	`P(Y=1\|X=Green) = 1/3`
Red	0	`P(Y=0\|X=Red) = .75`
Red	0	`P(Y=0\|X=Red) = .75`
Red	1	`P(Y=1\|X=Red) = .25`
Red	0	`P(Y=0\|X=Red) = .75`

X	Y	P(Y\|X)
1	True	`P(Y=true\|X=1) = 1/2`
1	False	`P(Y=false\|X=1) = 1/2`
2	True	`P(Y=true\|X=2) = 1/3`
2	False	`P(Y=false\|X=2) = 1/3`
2	Null	`P(Y=null\|X=2) = null`

Share via