TextCatalog.TokenizeIntoCharactersAsKeys Método

Referência

Definição

Namespace:: Microsoft.ML

Assembly:: Microsoft.ML.Transforms.dll

Pacote:: Microsoft.ML v4.0.1

Pacote:: Microsoft.ML v1.0.0

Pacote:: Microsoft.ML v1.1.0

Pacote:: Microsoft.ML v1.2.0

Pacote:: Microsoft.ML v1.3.1

Pacote:: Microsoft.ML v1.4.0

Pacote:: Microsoft.ML v1.5.5

Pacote:: Microsoft.ML v1.6.0

Pacote:: Microsoft.ML v1.7.0

Pacote:: Microsoft.ML v2.0.1

Pacote:: Microsoft.ML v3.0.1

Pacote:: Microsoft.ML v5.0.0-preview.1.25125.4

Origem:: TextCatalog.cs

Origem:: TextCatalog.cs

Origem:: TextCatalog.cs

Importante

Algumas informações se referem a produtos de pré-lançamento que podem ser substancialmente modificados antes do lançamento. A Microsoft não oferece garantias, expressas ou implícitas, das informações aqui fornecidas.

Crie uma TokenizingByCharactersEstimator, que tokeniza dividindo texto em sequências de caracteres usando uma janela deslizante.

public static Microsoft.ML.Transforms.Text.TokenizingByCharactersEstimator TokenizeIntoCharactersAsKeys(this Microsoft.ML.TransformsCatalog.TextTransforms catalog, string outputColumnName, string inputColumnName = default, bool useMarkerCharacters = true);

static member TokenizeIntoCharactersAsKeys : Microsoft.ML.TransformsCatalog.TextTransforms * string * string * bool -> Microsoft.ML.Transforms.Text.TokenizingByCharactersEstimator

<Extension()>
Public Function TokenizeIntoCharactersAsKeys (catalog As TransformsCatalog.TextTransforms, outputColumnName As String, Optional inputColumnName As String = Nothing, Optional useMarkerCharacters As Boolean = true) As TokenizingByCharactersEstimator

Parâmetros

catalog: TransformsCatalog.TextTransforms

O catálogo da transformação relacionada ao texto.

outputColumnName: String

Nome da coluna resultante da transformação de inputColumnName. O tipo de dados dessa coluna será um vetor de chaves de tamanho variável.

inputColumnName: String

Nome da coluna a ser transformada. Se definido como null, o valor do outputColumnName será usado como origem. Esse estimador opera no tipo de dados de texto.

useMarkerCharacters: Boolean

Para poder distinguir os tokens, por exemplo, para fins de depuração, você pode optar por preparar um caractere de marcador, 0x02até o início e acrescentar outro caractere de marcador, 0x03ao final do vetor de saída dos caracteres.

Retornos

TokenizingByCharactersEstimator

Exemplos

using System;
using System.Collections.Generic;
using Microsoft.ML;

namespace Samples.Dynamic
{
    public static class TokenizeIntoCharactersAsKeys
    {
        public static void Example()
        {
            // Create a new ML context, for ML.NET operations. It can be used for
            // exception tracking and logging, as well as the source of randomness.
            var mlContext = new MLContext();

            // Create an empty list as the dataset. The
            // 'TokenizeIntoCharactersAsKeys' does not require training data as
            // the estimator ('TokenizingByCharactersEstimator') created by
            // 'TokenizeIntoCharactersAsKeys' API is not a trainable estimator.
            // The empty list is only needed to pass input schema to the pipeline.
            var emptySamples = new List<TextData>();

            // Convert sample list to an empty IDataView.
            var emptyDataView = mlContext.Data.LoadFromEnumerable(emptySamples);

            // A pipeline for converting text into vector of characters.
            // The 'TokenizeIntoCharactersAsKeys' produces result as key type.
            // 'MapKeyToValue' is need to map keys back to their original values.
            var textPipeline = mlContext.Transforms.Text
                .TokenizeIntoCharactersAsKeys("CharTokens", "Text",
                    useMarkerCharacters: false)
                .Append(mlContext.Transforms.Conversion.MapKeyToValue(
                    "CharTokens"));

            // Fit to data.
            var textTransformer = textPipeline.Fit(emptyDataView);

            // Create the prediction engine to get the character vector from the
            // input text/string.
            var predictionEngine = mlContext.Model.CreatePredictionEngine<TextData,
                TransformedTextData>(textTransformer);

            // Call the prediction API to convert the text into characters.
            var data = new TextData()
            {
                Text = "ML.NET's " +
                "TokenizeIntoCharactersAsKeys API splits text/string into " +
                "characters."
            };

            var prediction = predictionEngine.Predict(data);

            // Print the length of the character vector.
            Console.WriteLine($"Number of tokens: {prediction.CharTokens.Length}");

            // Print the character vector.
            Console.WriteLine("\nCharacter Tokens: " + string.Join(",", prediction
                .CharTokens));

            //  Expected output:
            //   Number of tokens: 77
            //   Character Tokens: M,L,.,N,E,T,',s,<?>,T,o,k,e,n,i,z,e,I,n,t,o,C,h,a,r,a,c,t,e,r,s,A,s,K,e,y,s,<?>,A,P,I,<?>,
            //                     s,p,l,i,t,s,<?>,t,e,x,t,/,s,t,r,i,n,g,<?>,i,n,t,o,<?>,c,h,a,r,a,c,t,e,r,s,.
            //
            // <?>: is a unicode control character used instead of spaces ('\u2400').
        }

        private class TextData
        {
            public string Text { get; set; }
        }

        private class TransformedTextData : TextData
        {
            public string[] CharTokens { get; set; }
        }
    }
}

Aplica-se a