CategoricalCatalog.OneHotEncoding 方法

定义

重载

OneHotEncoding(TransformsCatalog+CategoricalTransforms, InputOutputColumnPair[], OneHotEncodingEstimator+OutputKind, Int32, ValueToKeyMappingEstimator+KeyOrdinality, IDataView)

创建一个 OneHotEncodingEstimator,它将指定的 columns 一个或多个输入文本列转换为一个热编码矢量的多个列。

OneHotEncoding(TransformsCatalog+CategoricalTransforms, String, String, OneHotEncodingEstimator+OutputKind, Int32, ValueToKeyMappingEstimator+KeyOrdinality, IDataView)

创建一个 OneHotEncodingEstimator,该列将指定的 inputColumnName 输入列转换为名为 outputColumnName一个热编码矢量的列。

OneHotEncoding(TransformsCatalog+CategoricalTransforms, InputOutputColumnPair[], OneHotEncodingEstimator+OutputKind, Int32, ValueToKeyMappingEstimator+KeyOrdinality, IDataView)

创建一个 OneHotEncodingEstimator,它将指定的 columns 一个或多个输入文本列转换为一个热编码矢量的多个列。

public static Microsoft.ML.Transforms.OneHotEncodingEstimator OneHotEncoding (this Microsoft.ML.TransformsCatalog.CategoricalTransforms catalog, Microsoft.ML.InputOutputColumnPair[] columns, Microsoft.ML.Transforms.OneHotEncodingEstimator.OutputKind outputKind = Microsoft.ML.Transforms.OneHotEncodingEstimator+OutputKind.Indicator, int maximumNumberOfKeys = 1000000, Microsoft.ML.Transforms.ValueToKeyMappingEstimator.KeyOrdinality keyOrdinality = Microsoft.ML.Transforms.ValueToKeyMappingEstimator+KeyOrdinality.ByOccurrence, Microsoft.ML.IDataView keyData = default);
static member OneHotEncoding : Microsoft.ML.TransformsCatalog.CategoricalTransforms * Microsoft.ML.InputOutputColumnPair[] * Microsoft.ML.Transforms.OneHotEncodingEstimator.OutputKind * int * Microsoft.ML.Transforms.ValueToKeyMappingEstimator.KeyOrdinality * Microsoft.ML.IDataView -> Microsoft.ML.Transforms.OneHotEncodingEstimator
<Extension()>
Public Function OneHotEncoding (catalog As TransformsCatalog.CategoricalTransforms, columns As InputOutputColumnPair(), Optional outputKind As OneHotEncodingEstimator.OutputKind = Microsoft.ML.Transforms.OneHotEncodingEstimator+OutputKind.Indicator, Optional maximumNumberOfKeys As Integer = 1000000, Optional keyOrdinality As ValueToKeyMappingEstimator.KeyOrdinality = Microsoft.ML.Transforms.ValueToKeyMappingEstimator+KeyOrdinality.ByOccurrence, Optional keyData As IDataView = Nothing) As OneHotEncodingEstimator

参数

catalog
TransformsCatalog.CategoricalTransforms

转换目录。

columns
InputOutputColumnPair[]

输入和输出列对。 输出列的数据类型将是 if outputKind 的向量SingleIndicator并且BinaryBagKey如果是outputKind,则输出列的数据类型将是标量输入列的键,或者在矢量输入列的情况下为键的向量。

outputKind
OneHotEncodingEstimator.OutputKind

输出类型:包 (多集矢量) 、Ind (指示器向量) 、键 (索引) 或二进制编码指示器向量。

maximumNumberOfKeys
Int32

自动训练时要保留每列的最大术语数。

keyOrdinality
ValueToKeyMappingEstimator.KeyOrdinality

向量化时应如何对项进行排序。 如果选择 ByOccurrence ,它们将按遇到的顺序排列。 如果 ByValue按其默认比较对项进行排序,例如,文本排序将区分大小写 (,例如,“A”,然后是“Z”,然后是“a”) 。

keyData
IDataView

指定编码的排序。 如果指定,这应该是单个列数据视图,键值将从该列获取。 如果未指定,则根据拟合时从输入数据确定排序。

返回

示例

using System;
using Microsoft.ML;

namespace Samples.Dynamic.Transforms.Categorical
{
    public static class OneHotEncodingMultiColumn
    {
        public static void Example()
        {
            // Create a new ML context for ML.NET operations. It can be used for
            // exception tracking and logging as well as the source of randomness.
            var mlContext = new MLContext();

            // Create a small dataset as an IEnumerable.
            var samples = new[]
            {
                new DataPoint {Education = "0-5yrs", ZipCode = "98005"},
                new DataPoint {Education = "0-5yrs", ZipCode = "98052"},
                new DataPoint {Education = "6-11yrs", ZipCode = "98005"},
                new DataPoint {Education = "6-11yrs", ZipCode = "98052"},
                new DataPoint {Education = "11-15yrs", ZipCode = "98005"}
            };

            // Convert training data to IDataView.
            IDataView data = mlContext.Data.LoadFromEnumerable(samples);

            // Multi column example: A pipeline for one hot encoding two columns
            // 'Education' and 'ZipCode'.
            var multiColumnKeyPipeline =
                mlContext.Transforms.Categorical.OneHotEncoding(
                    new[]
                    {
                        new InputOutputColumnPair("Education"),
                        new InputOutputColumnPair("ZipCode")
                    });

            // Fit and Transform data.
            IDataView transformedData =
                multiColumnKeyPipeline.Fit(data).Transform(data);

            var convertedData =
                mlContext.Data.CreateEnumerable<TransformedData>(transformedData,
                    true);

            Console.WriteLine(
                "One Hot Encoding of two columns 'Education' and 'ZipCode'.");

            // One Hot Encoding of two columns 'Education' and 'ZipCode'.

            foreach (TransformedData item in convertedData)
                Console.WriteLine("{0}\t\t\t{1}", string.Join(" ", item.Education),
                    string.Join(" ", item.ZipCode));

            // 1 0 0                   1 0
            // 1 0 0                   0 1
            // 0 1 0                   1 0
            // 0 1 0                   0 1
            // 0 0 1                   1 0
        }

        private class DataPoint
        {
            public string Education { get; set; }

            public string ZipCode { get; set; }
        }

        private class TransformedData
        {
            public float[] Education { get; set; }

            public float[] ZipCode { get; set; }
        }
    }
}

注解

如果将多个列传递给估算器,则所有列都将在一次传递数据中进行处理。 因此,使用多个列指定一个估算器比指定一个包含单个列的估算器更有效。

适用于

OneHotEncoding(TransformsCatalog+CategoricalTransforms, String, String, OneHotEncodingEstimator+OutputKind, Int32, ValueToKeyMappingEstimator+KeyOrdinality, IDataView)

创建一个 OneHotEncodingEstimator,该列将指定的 inputColumnName 输入列转换为名为 outputColumnName一个热编码矢量的列。

public static Microsoft.ML.Transforms.OneHotEncodingEstimator OneHotEncoding (this Microsoft.ML.TransformsCatalog.CategoricalTransforms catalog, string outputColumnName, string inputColumnName = default, Microsoft.ML.Transforms.OneHotEncodingEstimator.OutputKind outputKind = Microsoft.ML.Transforms.OneHotEncodingEstimator+OutputKind.Indicator, int maximumNumberOfKeys = 1000000, Microsoft.ML.Transforms.ValueToKeyMappingEstimator.KeyOrdinality keyOrdinality = Microsoft.ML.Transforms.ValueToKeyMappingEstimator+KeyOrdinality.ByOccurrence, Microsoft.ML.IDataView keyData = default);
static member OneHotEncoding : Microsoft.ML.TransformsCatalog.CategoricalTransforms * string * string * Microsoft.ML.Transforms.OneHotEncodingEstimator.OutputKind * int * Microsoft.ML.Transforms.ValueToKeyMappingEstimator.KeyOrdinality * Microsoft.ML.IDataView -> Microsoft.ML.Transforms.OneHotEncodingEstimator
<Extension()>
Public Function OneHotEncoding (catalog As TransformsCatalog.CategoricalTransforms, outputColumnName As String, Optional inputColumnName As String = Nothing, Optional outputKind As OneHotEncodingEstimator.OutputKind = Microsoft.ML.Transforms.OneHotEncodingEstimator+OutputKind.Indicator, Optional maximumNumberOfKeys As Integer = 1000000, Optional keyOrdinality As ValueToKeyMappingEstimator.KeyOrdinality = Microsoft.ML.Transforms.ValueToKeyMappingEstimator+KeyOrdinality.ByOccurrence, Optional keyData As IDataView = Nothing) As OneHotEncodingEstimator

参数

catalog
TransformsCatalog.CategoricalTransforms

转换目录。

outputColumnName
String

由转换 inputColumnName生成的列的名称。 此列的数据类型将是 if IndicatoroutputKindBag的向量Single,并且。Binary Key如果是outputKind,则此列的数据类型将是标量输入列的键,或者当矢量输入列时为键的向量。

inputColumnName
String

要转换为单热向量的列的名称。 If set to null, the value of the outputColumnName will be used as source. 此列的数据类型可以是数值、文本、布尔DateTime值或DateTimeOffset

outputKind
OneHotEncodingEstimator.OutputKind

输出类型:包 (多集矢量) 、指示器 (指示器向量) 、键 (索引) 或二进制编码指示器向量。

maximumNumberOfKeys
Int32

自动训练时要保留每列的最大术语数。

keyOrdinality
ValueToKeyMappingEstimator.KeyOrdinality

向量化时应如何对项进行排序。 如果选择 ByOccurrence ,它们将按遇到的顺序排列。 如果 ByValue按其默认比较对项进行排序,例如,文本排序将区分大小写 (,例如,“A”,然后是“Z”,然后是“a”) 。

keyData
IDataView

指定编码的排序。 如果指定,这应该是单个列数据视图,键值将从该列获取。 如果未指定,则根据拟合时从输入数据确定排序。

返回

示例

using System;
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.Transforms;

namespace Samples.Dynamic.Transforms.Categorical
{
    public static class OneHotEncoding
    {
        public static void Example()
        {
            // Create a new ML context for ML.NET operations. It can be used for
            // exception tracking and logging as well as the source of randomness.
            var mlContext = new MLContext();

            // Create a small dataset as an IEnumerable.
            var samples = new[]
            {
                new DataPoint {Education = "0-5yrs"},
                new DataPoint {Education = "0-5yrs"},
                new DataPoint {Education = "6-11yrs"},
                new DataPoint {Education = "6-11yrs"},
                new DataPoint {Education = "11-15yrs"}
            };

            // Convert training data to IDataView.
            IDataView data = mlContext.Data.LoadFromEnumerable(samples);

            // A pipeline for one hot encoding the Education column.
            var pipeline = mlContext.Transforms.Categorical.OneHotEncoding(
                "EducationOneHotEncoded", "Education");

            // Fit and transform the data.
            IDataView oneHotEncodedData = pipeline.Fit(data).Transform(data);

            PrintDataColumn(oneHotEncodedData, "EducationOneHotEncoded");

            // We have 3 slots because there are three categories in the
            // 'Education' column.

            // 1 0 0
            // 1 0 0
            // 0 1 0
            // 0 1 0
            // 0 0 1

            // A pipeline for one hot encoding the Education column (using keying).
            var keyPipeline = mlContext.Transforms.Categorical.OneHotEncoding(
                "EducationOneHotEncoded", "Education",
                OneHotEncodingEstimator.OutputKind.Key);

            // Fit and Transform data.
            oneHotEncodedData = keyPipeline.Fit(data).Transform(data);

            var keyEncodedColumn =
                oneHotEncodedData.GetColumn<uint>("EducationOneHotEncoded");

            Console.WriteLine(
                "One Hot Encoding of single column 'Education', with key type " +
                "output.");

            // One Hot Encoding of single column 'Education', with key type output.

            foreach (uint element in keyEncodedColumn)
                Console.WriteLine(element);

            // 1
            // 1
            // 2
            // 2
            // 3
        }

        private static void PrintDataColumn(IDataView transformedData,
            string columnName)
        {
            var countSelectColumn = transformedData.GetColumn<float[]>(
                transformedData.Schema[columnName]);

            foreach (var row in countSelectColumn)
            {
                for (var i = 0; i < row.Length; i++)
                    Console.Write($"{row[i]}\t");

                Console.WriteLine();
            }
        }

        private class DataPoint
        {
            public string Education { get; set; }
        }
    }
}

适用于