CategoricalCatalog.OneHotEncoding Method

Definition

Overloads

OneHotEncoding(TransformsCatalog+CategoricalTransforms, InputOutputColumnPair[], OneHotEncodingEstimator+OutputKind, Int32, ValueToKeyMappingEstimator+KeyOrdinality, IDataView)

Create a OneHotEncodingEstimator, which converts one or more input text columns specified in columns into as many columns of one-hot encoded vectors.

OneHotEncoding(TransformsCatalog+CategoricalTransforms, String, String, OneHotEncodingEstimator+OutputKind, Int32, ValueToKeyMappingEstimator+KeyOrdinality, IDataView)

Create a OneHotEncodingEstimator, which converts the input column specified by inputColumnName into a column of one-hot encoded vectors named outputColumnName.

OneHotEncoding(TransformsCatalog+CategoricalTransforms, InputOutputColumnPair[], OneHotEncodingEstimator+OutputKind, Int32, ValueToKeyMappingEstimator+KeyOrdinality, IDataView)

Create a OneHotEncodingEstimator, which converts one or more input text columns specified in columns into as many columns of one-hot encoded vectors.

public static Microsoft.ML.Transforms.OneHotEncodingEstimator OneHotEncoding (this Microsoft.ML.TransformsCatalog.CategoricalTransforms catalog, Microsoft.ML.InputOutputColumnPair[] columns, Microsoft.ML.Transforms.OneHotEncodingEstimator.OutputKind outputKind = Microsoft.ML.Transforms.OneHotEncodingEstimator+OutputKind.Indicator, int maximumNumberOfKeys = 1000000, Microsoft.ML.Transforms.ValueToKeyMappingEstimator.KeyOrdinality keyOrdinality = Microsoft.ML.Transforms.ValueToKeyMappingEstimator+KeyOrdinality.ByOccurrence, Microsoft.ML.IDataView keyData = default);
static member OneHotEncoding : Microsoft.ML.TransformsCatalog.CategoricalTransforms * Microsoft.ML.InputOutputColumnPair[] * Microsoft.ML.Transforms.OneHotEncodingEstimator.OutputKind * int * Microsoft.ML.Transforms.ValueToKeyMappingEstimator.KeyOrdinality * Microsoft.ML.IDataView -> Microsoft.ML.Transforms.OneHotEncodingEstimator
<Extension()>
Public Function OneHotEncoding (catalog As TransformsCatalog.CategoricalTransforms, columns As InputOutputColumnPair(), Optional outputKind As OneHotEncodingEstimator.OutputKind = Microsoft.ML.Transforms.OneHotEncodingEstimator+OutputKind.Indicator, Optional maximumNumberOfKeys As Integer = 1000000, Optional keyOrdinality As ValueToKeyMappingEstimator.KeyOrdinality = Microsoft.ML.Transforms.ValueToKeyMappingEstimator+KeyOrdinality.ByOccurrence, Optional keyData As IDataView = Nothing) As OneHotEncodingEstimator

Parameters

catalog
TransformsCatalog.CategoricalTransforms

The transform catalog.

columns
InputOutputColumnPair[]

The pairs of input and output columns. The output columns' data type will be a vector of Single if outputKind is Bag, Indicator, and Binary. If outputKind is Key, the output columns' data type will be a key in the case of scalar input column or a vector of keys in the case of a vector input column.

outputKind
OneHotEncodingEstimator.OutputKind

Output kind: Bag (multi-set vector), Ind (indicator vector), Key (index), or Binary encoded indicator vector.

maximumNumberOfKeys
Int32

Maximum number of terms to keep per column when auto-training.

keyOrdinality
ValueToKeyMappingEstimator.KeyOrdinality

How items should be ordered when vectorized. If ByOccurrence choosen they will be in the order encountered. If ByValue, items are sorted according to their default comparison, for example, text sorting will be case sensitive (for example, 'A' then 'Z' then 'a').

keyData
IDataView

Specifies an ordering for the encoding. If specified, this should be a single column data view, and the key-values will be taken from that column. If unspecified, the ordering will be determined from the input data upon fitting.

Returns

Examples

using System;
using Microsoft.ML;

namespace Samples.Dynamic.Transforms.Categorical
{
    public static class OneHotEncodingMultiColumn
    {
        public static void Example()
        {
            // Create a new ML context for ML.NET operations. It can be used for
            // exception tracking and logging as well as the source of randomness.
            var mlContext = new MLContext();

            // Create a small dataset as an IEnumerable.
            var samples = new[]
            {
                new DataPoint {Education = "0-5yrs", ZipCode = "98005"},
                new DataPoint {Education = "0-5yrs", ZipCode = "98052"},
                new DataPoint {Education = "6-11yrs", ZipCode = "98005"},
                new DataPoint {Education = "6-11yrs", ZipCode = "98052"},
                new DataPoint {Education = "11-15yrs", ZipCode = "98005"}
            };

            // Convert training data to IDataView.
            IDataView data = mlContext.Data.LoadFromEnumerable(samples);

            // Multi column example: A pipeline for one hot encoding two columns
            // 'Education' and 'ZipCode'.
            var multiColumnKeyPipeline =
                mlContext.Transforms.Categorical.OneHotEncoding(
                    new[]
                    {
                        new InputOutputColumnPair("Education"),
                        new InputOutputColumnPair("ZipCode")
                    });

            // Fit and Transform data.
            IDataView transformedData =
                multiColumnKeyPipeline.Fit(data).Transform(data);

            var convertedData =
                mlContext.Data.CreateEnumerable<TransformedData>(transformedData,
                    true);

            Console.WriteLine(
                "One Hot Encoding of two columns 'Education' and 'ZipCode'.");

            // One Hot Encoding of two columns 'Education' and 'ZipCode'.

            foreach (TransformedData item in convertedData)
                Console.WriteLine("{0}\t\t\t{1}", string.Join(" ", item.Education),
                    string.Join(" ", item.ZipCode));

            // 1 0 0                   1 0
            // 1 0 0                   0 1
            // 0 1 0                   1 0
            // 0 1 0                   0 1
            // 0 0 1                   1 0
        }

        private class DataPoint
        {
            public string Education { get; set; }

            public string ZipCode { get; set; }
        }

        private class TransformedData
        {
            public float[] Education { get; set; }

            public float[] ZipCode { get; set; }
        }
    }
}

Remarks

If multiple columns are passed to the estimator, all of the columns will be processed in a single pass over the data. Therefore, it is more efficient to specify one estimator with many columns than it is to specify many estimators each with a single column.

Applies to

OneHotEncoding(TransformsCatalog+CategoricalTransforms, String, String, OneHotEncodingEstimator+OutputKind, Int32, ValueToKeyMappingEstimator+KeyOrdinality, IDataView)

Create a OneHotEncodingEstimator, which converts the input column specified by inputColumnName into a column of one-hot encoded vectors named outputColumnName.

public static Microsoft.ML.Transforms.OneHotEncodingEstimator OneHotEncoding (this Microsoft.ML.TransformsCatalog.CategoricalTransforms catalog, string outputColumnName, string inputColumnName = default, Microsoft.ML.Transforms.OneHotEncodingEstimator.OutputKind outputKind = Microsoft.ML.Transforms.OneHotEncodingEstimator+OutputKind.Indicator, int maximumNumberOfKeys = 1000000, Microsoft.ML.Transforms.ValueToKeyMappingEstimator.KeyOrdinality keyOrdinality = Microsoft.ML.Transforms.ValueToKeyMappingEstimator+KeyOrdinality.ByOccurrence, Microsoft.ML.IDataView keyData = default);
static member OneHotEncoding : Microsoft.ML.TransformsCatalog.CategoricalTransforms * string * string * Microsoft.ML.Transforms.OneHotEncodingEstimator.OutputKind * int * Microsoft.ML.Transforms.ValueToKeyMappingEstimator.KeyOrdinality * Microsoft.ML.IDataView -> Microsoft.ML.Transforms.OneHotEncodingEstimator
<Extension()>
Public Function OneHotEncoding (catalog As TransformsCatalog.CategoricalTransforms, outputColumnName As String, Optional inputColumnName As String = Nothing, Optional outputKind As OneHotEncodingEstimator.OutputKind = Microsoft.ML.Transforms.OneHotEncodingEstimator+OutputKind.Indicator, Optional maximumNumberOfKeys As Integer = 1000000, Optional keyOrdinality As ValueToKeyMappingEstimator.KeyOrdinality = Microsoft.ML.Transforms.ValueToKeyMappingEstimator+KeyOrdinality.ByOccurrence, Optional keyData As IDataView = Nothing) As OneHotEncodingEstimator

Parameters

catalog
TransformsCatalog.CategoricalTransforms

The transform catalog.

outputColumnName
String

Name of the column resulting from the transformation of inputColumnName. This column's data type will be a vector of Single if outputKind is Bag, Indicator, and Binary. If outputKind is Key, this column's data type will be a key in the case of a scalar input column or a vector of keys in the case of a vector input column.

inputColumnName
String

Name of column to convert to one-hot vectors. If set to null, the value of the outputColumnName will be used as source. This column's data type can be scalar or vector of numeric, text, boolean, DateTime or DateTimeOffset,

outputKind
OneHotEncodingEstimator.OutputKind

Output kind: Bag (multi-set vector), Indicator (indicator vector), Key (index), or Binary encoded indicator vector.

maximumNumberOfKeys
Int32

Maximum number of terms to keep per column when auto-training.

keyOrdinality
ValueToKeyMappingEstimator.KeyOrdinality

How items should be ordered when vectorized. If ByOccurrence choosen they will be in the order encountered. If ByValue, items are sorted according to their default comparison, for example, text sorting will be case sensitive (for example, 'A' then 'Z' then 'a').

keyData
IDataView

Specifies an ordering for the encoding. If specified, this should be a single column data view, and the key-values will be taken from that column. If unspecified, the ordering will be determined from the input data upon fitting.

Returns

Examples

using System;
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.Transforms;

namespace Samples.Dynamic.Transforms.Categorical
{
    public static class OneHotEncoding
    {
        public static void Example()
        {
            // Create a new ML context for ML.NET operations. It can be used for
            // exception tracking and logging as well as the source of randomness.
            var mlContext = new MLContext();

            // Create a small dataset as an IEnumerable.
            var samples = new[]
            {
                new DataPoint {Education = "0-5yrs"},
                new DataPoint {Education = "0-5yrs"},
                new DataPoint {Education = "6-11yrs"},
                new DataPoint {Education = "6-11yrs"},
                new DataPoint {Education = "11-15yrs"}
            };

            // Convert training data to IDataView.
            IDataView data = mlContext.Data.LoadFromEnumerable(samples);

            // A pipeline for one hot encoding the Education column.
            var pipeline = mlContext.Transforms.Categorical.OneHotEncoding(
                "EducationOneHotEncoded", "Education");

            // Fit and transform the data.
            IDataView oneHotEncodedData = pipeline.Fit(data).Transform(data);

            PrintDataColumn(oneHotEncodedData, "EducationOneHotEncoded");

            // We have 3 slots because there are three categories in the
            // 'Education' column.

            // 1 0 0
            // 1 0 0
            // 0 1 0
            // 0 1 0
            // 0 0 1

            // A pipeline for one hot encoding the Education column (using keying).
            var keyPipeline = mlContext.Transforms.Categorical.OneHotEncoding(
                "EducationOneHotEncoded", "Education",
                OneHotEncodingEstimator.OutputKind.Key);

            // Fit and Transform data.
            oneHotEncodedData = keyPipeline.Fit(data).Transform(data);

            var keyEncodedColumn =
                oneHotEncodedData.GetColumn<uint>("EducationOneHotEncoded");

            Console.WriteLine(
                "One Hot Encoding of single column 'Education', with key type " +
                "output.");

            // One Hot Encoding of single column 'Education', with key type output.

            foreach (uint element in keyEncodedColumn)
                Console.WriteLine(element);

            // 1
            // 1
            // 2
            // 2
            // 3
        }

        private static void PrintDataColumn(IDataView transformedData,
            string columnName)
        {
            var countSelectColumn = transformedData.GetColumn<float[]>(
                transformedData.Schema[columnName]);

            foreach (var row in countSelectColumn)
            {
                for (var i = 0; i < row.Length; i++)
                    Console.Write($"{row[i]}\t");

                Console.WriteLine();
            }
        }

        private class DataPoint
        {
            public string Education { get; set; }
        }
    }
}

Applies to