TextCatalog.FeaturizeText Method

Definition

Namespace:: Microsoft.ML

Assembly:: Microsoft.ML.Transforms.dll

Package:: Microsoft.ML v4.0.1

Package:: Microsoft.ML v1.0.0

Package:: Microsoft.ML v1.1.0

Package:: Microsoft.ML v1.2.0

Package:: Microsoft.ML v1.3.1

Package:: Microsoft.ML v1.4.0

Package:: Microsoft.ML v1.5.5

Package:: Microsoft.ML v1.6.0

Package:: Microsoft.ML v1.7.0

Package:: Microsoft.ML v2.0.1

Package:: Microsoft.ML v3.0.1

Package:: Microsoft.ML v5.0.0-preview.1.25125.4

Important

Some information relates to prerelease product that may be substantially modified before it’s released. Microsoft makes no warranties, express or implied, with respect to the information provided here.

Overloads

FeaturizeText(TransformsCatalog+TextTransforms, String, String)	Create a TextFeaturizingEstimator, which transforms a text column into a featurized vector of Single that represents normalized counts of n-grams and char-grams.
FeaturizeText(TransformsCatalog+TextTransforms, String, TextFeaturizingEstimator+Options, String[])	Create a TextFeaturizingEstimator, which transforms a text column into featurized vector of Single that represents normalized counts of n-grams and char-grams.

FeaturizeText(TransformsCatalog+TextTransforms, String, String)

Source:: TextCatalog.cs

Source:: TextCatalog.cs

Source:: TextCatalog.cs

Create a TextFeaturizingEstimator, which transforms a text column into a featurized vector of Single that represents normalized counts of n-grams and char-grams.

public static Microsoft.ML.Transforms.Text.TextFeaturizingEstimator FeaturizeText(this Microsoft.ML.TransformsCatalog.TextTransforms catalog, string outputColumnName, string inputColumnName = default);

static member FeaturizeText : Microsoft.ML.TransformsCatalog.TextTransforms * string * string -> Microsoft.ML.Transforms.Text.TextFeaturizingEstimator

<Extension()>
Public Function FeaturizeText (catalog As TransformsCatalog.TextTransforms, outputColumnName As String, Optional inputColumnName As String = Nothing) As TextFeaturizingEstimator

Parameters

catalog: TransformsCatalog.TextTransforms

The text-related transform's catalog.

outputColumnName: String

Name of the column resulting from the transformation of inputColumnName. This column's data type will be a vector of Single.

inputColumnName: String

Name of the column to transform. If set to null, the value of the outputColumnName will be used as source. This estimator operates over text data.

Returns

TextFeaturizingEstimator

Examples

using System;
using System.Collections.Generic;
using Microsoft.ML;

namespace Samples.Dynamic
{
    public static class FeaturizeText
    {
        public static void Example()
        {
            // Create a new ML context, for ML.NET operations. It can be used for
            // exception tracking and logging, as well as the source of randomness.
            var mlContext = new MLContext();

            // Create a small dataset as an IEnumerable.
            var samples = new List<TextData>()
            {
                new TextData(){ Text = "ML.NET's FeaturizeText API uses a " +
                    "composition of several basic transforms to convert text " +
                    "into numeric features." },

                new TextData(){ Text = "This API can be used as a featurizer to " +
                    "perform text classification." },

                new TextData(){ Text = "There are a number of approaches to text " +
                    "classification." },

                new TextData(){ Text = "One of the simplest and most common " +
                    "approaches is called “Bag of Words”." },

                new TextData(){ Text = "Text classification can be used for a " +
                    "wide variety of tasks" },

                new TextData(){ Text = "such as sentiment analysis, topic " +
                    "detection, intent identification etc." },
            };

            // Convert training data to IDataView.
            var dataview = mlContext.Data.LoadFromEnumerable(samples);

            // A pipeline for converting text into numeric features.
            // The following call to 'FeaturizeText' instantiates 
            // 'TextFeaturizingEstimator' with default parameters.
            // The default settings for the TextFeaturizingEstimator are
            //      * StopWordsRemover: None
            //      * CaseMode: Lowercase
            //      * OutputTokensColumnName: None
            //      * KeepDiacritics: false, KeepPunctuations: true, KeepNumbers:
            //          true
            //      * WordFeatureExtractor: NgramLength = 1
            //      * CharFeatureExtractor: NgramLength = 3, UseAllLengths = false
            // The length of the output feature vector depends on these settings.
            var textPipeline = mlContext.Transforms.Text.FeaturizeText("Features",
                "Text");

            // Fit to data.
            var textTransformer = textPipeline.Fit(dataview);

            // Create the prediction engine to get the features extracted from the
            // text.
            var predictionEngine = mlContext.Model.CreatePredictionEngine<TextData,
                TransformedTextData>(textTransformer);

            // Convert the text into numeric features.
            var prediction = predictionEngine.Predict(samples[0]);

            // Print the length of the feature vector.
            Console.WriteLine($"Number of Features: {prediction.Features.Length}");

            // Print the first 10 feature values.
            Console.Write("Features: ");
            for (int i = 0; i < 10; i++)
                Console.Write($"{prediction.Features[i]:F4}  ");

            //  Expected output:
            //   Number of Features: 332
            //   Features: 0.0857  0.0857  0.0857  0.0857  0.0857  0.0857  0.0857  0.0857  0.0857  0.1715 ...
        }

        private class TextData
        {
            public string Text { get; set; }
        }

        private class TransformedTextData : TextData
        {
            public float[] Features { get; set; }
        }
    }
}

Applies to

FeaturizeText(TransformsCatalog+TextTransforms, String, TextFeaturizingEstimator+Options, String[])

Source:: TextCatalog.cs

Source:: TextCatalog.cs

Source:: TextCatalog.cs

Create a TextFeaturizingEstimator, which transforms a text column into featurized vector of Single that represents normalized counts of n-grams and char-grams.

public static Microsoft.ML.Transforms.Text.TextFeaturizingEstimator FeaturizeText(this Microsoft.ML.TransformsCatalog.TextTransforms catalog, string outputColumnName, Microsoft.ML.Transforms.Text.TextFeaturizingEstimator.Options options, params string[] inputColumnNames);

static member FeaturizeText : Microsoft.ML.TransformsCatalog.TextTransforms * string * Microsoft.ML.Transforms.Text.TextFeaturizingEstimator.Options * string[] -> Microsoft.ML.Transforms.Text.TextFeaturizingEstimator

<Extension()>
Public Function FeaturizeText (catalog As TransformsCatalog.TextTransforms, outputColumnName As String, options As TextFeaturizingEstimator.Options, ParamArray inputColumnNames As String()) As TextFeaturizingEstimator

Parameters

catalog: TransformsCatalog.TextTransforms

The text-related transform's catalog.

outputColumnName: String

Name of the column resulting from the transformation of inputColumnNames. This column's data type will be a vector of Single.

options: TextFeaturizingEstimator.Options

Advanced options to the algorithm.

inputColumnNames: String[]

Name of the columns to transform. If set to null, the value of the outputColumnName will be used as source. This estimator operates over text data, and it can transform several columns at once, yielding one vector of Single as the resulting features for all columns.

Returns

TextFeaturizingEstimator

Examples

using System;
using System.Collections.Generic;
using Microsoft.ML;
using Microsoft.ML.Transforms.Text;

namespace Samples.Dynamic
{
    public static class FeaturizeTextWithOptions
    {
        public static void Example()
        {
            // Create a new ML context, for ML.NET operations. It can be used for
            // exception tracking and logging, as well as the source of randomness.
            var mlContext = new MLContext();

            // Create a small dataset as an IEnumerable.
            var samples = new List<TextData>()
            {
                new TextData(){ Text = "ML.NET's FeaturizeText API uses a " +
                "composition of several basic transforms to convert text into " +
                "numeric features." },

                new TextData(){ Text = "This API can be used as a featurizer to " +
                "perform text classification." },

                new TextData(){ Text = "There are a number of approaches to text " +
                "classification." },

                new TextData(){ Text = "One of the simplest and most common " +
                "approaches is called “Bag of Words”." },

                new TextData(){ Text = "Text classification can be used for a " +
                "wide variety of tasks" },

                new TextData(){ Text = "such as sentiment analysis, topic " +
                "detection, intent identification etc." },
            };

            // Convert training data to IDataView.
            var dataview = mlContext.Data.LoadFromEnumerable(samples);

            // A pipeline for converting text into numeric features.
            // The following call to 'FeaturizeText' instantiates
            // 'TextFeaturizingEstimator' with given parameters. The length of the
            // output feature vector depends on these settings.
            var options = new TextFeaturizingEstimator.Options()
            {
                // Also output tokenized words
                OutputTokensColumnName = "OutputTokens",
                CaseMode = TextNormalizingEstimator.CaseMode.Lower,
                // Use ML.NET's built-in stop word remover
                StopWordsRemoverOptions = new StopWordsRemovingEstimator.Options()
                {
                    Language = TextFeaturizingEstimator.Language.English
                },

                WordFeatureExtractor = new WordBagEstimator.Options()
                {
                    NgramLength
                    = 2,
                    UseAllLengths = true
                },

                CharFeatureExtractor = new WordBagEstimator.Options()
                {
                    NgramLength
                    = 3,
                    UseAllLengths = false
                },
            };
            var textPipeline = mlContext.Transforms.Text.FeaturizeText("Features",
                options, "Text");

            // Fit to data.
            var textTransformer = textPipeline.Fit(dataview);

            // Create the prediction engine to get the features extracted from the
            // text.
            var predictionEngine = mlContext.Model.CreatePredictionEngine<TextData,
                TransformedTextData>(textTransformer);

            // Convert the text into numeric features.
            var prediction = predictionEngine.Predict(samples[0]);

            // Print the length of the feature vector.
            Console.WriteLine($"Number of Features: {prediction.Features.Length}");

            // Print feature values and tokens.
            Console.Write("Features: ");
            for (int i = 0; i < 10; i++)
                Console.Write($"{prediction.Features[i]:F4}  ");

            Console.WriteLine("\nTokens: " + string.Join(",", prediction
                .OutputTokens));

            //  Expected output:
            //   Number of Features: 282
            //   Features: 0.0941  0.0941  0.0941  0.0941  0.0941  0.0941  0.0941  0.0941  0.0941  0.1881 ...
            //   Tokens: ml.net's,featurizetext,api,uses,composition,basic,transforms,convert,text,numeric,features.
        }

        private class TextData
        {
            public string Text { get; set; }
        }

        private class TransformedTextData : TextData
        {
            public float[] Features { get; set; }
            public string[] OutputTokens { get; set; }
        }
    }
}

Remarks

This transform can operate over several columns.

Applies to