TextCatalog.TokenizeIntoCharactersAsKeys Method

Reference

Definition

Namespace:: Microsoft.ML

Assembly:: Microsoft.ML.Transforms.dll

Package:: Microsoft.ML v3.0.1

Package:: Microsoft.ML v1.0.0

Package:: Microsoft.ML v1.1.0

Package:: Microsoft.ML v1.2.0

Package:: Microsoft.ML v1.3.1

Package:: Microsoft.ML v1.4.0

Package:: Microsoft.ML v1.5.5

Package:: Microsoft.ML v1.6.0

Package:: Microsoft.ML v1.7.0

Package:: Microsoft.ML v2.0.0

Important

Some information relates to prerelease product that may be substantially modified before it’s released. Microsoft makes no warranties, express or implied, with respect to the information provided here.

Create a TokenizingByCharactersEstimator, which tokenizes by splitting text into sequences of characters using a sliding window.

public static Microsoft.ML.Transforms.Text.TokenizingByCharactersEstimator TokenizeIntoCharactersAsKeys (this Microsoft.ML.TransformsCatalog.TextTransforms catalog, string outputColumnName, string inputColumnName = default, bool useMarkerCharacters = true);

static member TokenizeIntoCharactersAsKeys : Microsoft.ML.TransformsCatalog.TextTransforms * string * string * bool -> Microsoft.ML.Transforms.Text.TokenizingByCharactersEstimator

<Extension()>
Public Function TokenizeIntoCharactersAsKeys (catalog As TransformsCatalog.TextTransforms, outputColumnName As String, Optional inputColumnName As String = Nothing, Optional useMarkerCharacters As Boolean = true) As TokenizingByCharactersEstimator

Parameters

catalog: TransformsCatalog.TextTransforms

The text-related transform's catalog.

outputColumnName: String

Name of the column resulting from the transformation of inputColumnName. This column's data type will be a variable-sized vector of keys.

inputColumnName: String

Name of the column to transform. If set to null, the value of the outputColumnName will be used as source. This estimator operates over text data type.

useMarkerCharacters: Boolean

To be able to distinguish the tokens, for example for debugging purposes, you can choose to prepend a marker character, 0x02, to the beginning, and append another marker character, 0x03, to the end of the output vector of characters.

Returns

TokenizingByCharactersEstimator

Examples

using System;
using System.Collections.Generic;
using Microsoft.ML;

namespace Samples.Dynamic
{
    public static class TokenizeIntoCharactersAsKeys
    {
        public static void Example()
        {
            // Create a new ML context, for ML.NET operations. It can be used for
            // exception tracking and logging, as well as the source of randomness.
            var mlContext = new MLContext();

            // Create an empty list as the dataset. The
            // 'TokenizeIntoCharactersAsKeys' does not require training data as
            // the estimator ('TokenizingByCharactersEstimator') created by
            // 'TokenizeIntoCharactersAsKeys' API is not a trainable estimator.
            // The empty list is only needed to pass input schema to the pipeline.
            var emptySamples = new List<TextData>();

            // Convert sample list to an empty IDataView.
            var emptyDataView = mlContext.Data.LoadFromEnumerable(emptySamples);

            // A pipeline for converting text into vector of characters.
            // The 'TokenizeIntoCharactersAsKeys' produces result as key type.
            // 'MapKeyToValue' is need to map keys back to their original values.
            var textPipeline = mlContext.Transforms.Text
                .TokenizeIntoCharactersAsKeys("CharTokens", "Text",
                    useMarkerCharacters: false)
                .Append(mlContext.Transforms.Conversion.MapKeyToValue(
                    "CharTokens"));

            // Fit to data.
            var textTransformer = textPipeline.Fit(emptyDataView);

            // Create the prediction engine to get the character vector from the
            // input text/string.
            var predictionEngine = mlContext.Model.CreatePredictionEngine<TextData,
                TransformedTextData>(textTransformer);

            // Call the prediction API to convert the text into characters.
            var data = new TextData()
            {
                Text = "ML.NET's " +
                "TokenizeIntoCharactersAsKeys API splits text/string into " +
                "characters."
            };

            var prediction = predictionEngine.Predict(data);

            // Print the length of the character vector.
            Console.WriteLine($"Number of tokens: {prediction.CharTokens.Length}");

            // Print the character vector.
            Console.WriteLine("\nCharacter Tokens: " + string.Join(",", prediction
                .CharTokens));

            //  Expected output:
            //   Number of tokens: 77
            //   Character Tokens: M,L,.,N,E,T,',s,<?>,T,o,k,e,n,i,z,e,I,n,t,o,C,h,a,r,a,c,t,e,r,s,A,s,K,e,y,s,<?>,A,P,I,<?>,
            //                     s,p,l,i,t,s,<?>,t,e,x,t,/,s,t,r,i,n,g,<?>,i,n,t,o,<?>,c,h,a,r,a,c,t,e,r,s,.
            //
            // <?>: is a unicode control character used instead of spaces ('\u2400').
        }

        private class TextData
        {
            public string Text { get; set; }
        }

        private class TransformedTextData : TextData
        {
            public string[] CharTokens { get; set; }
        }
    }
}

Applies to