TextCatalog.ProduceNgrams Método
Definición
Importante
Parte de la información hace referencia a la versión preliminar del producto, que puede haberse modificado sustancialmente antes de lanzar la versión definitiva. Microsoft no otorga ninguna garantía, explícita o implícita, con respecto a la información proporcionada aquí.
Crea un objeto NgramExtractingEstimator que genera un vector de recuentos de n-gramas (secuencias de palabras consecutivas) encontrados en el texto de entrada.
public static Microsoft.ML.Transforms.Text.NgramExtractingEstimator ProduceNgrams (this Microsoft.ML.TransformsCatalog.TextTransforms catalog, string outputColumnName, string inputColumnName = default, int ngramLength = 2, int skipLength = 0, bool useAllLengths = true, int maximumNgramsCount = 10000000, Microsoft.ML.Transforms.Text.NgramExtractingEstimator.WeightingCriteria weighting = Microsoft.ML.Transforms.Text.NgramExtractingEstimator+WeightingCriteria.Tf);
static member ProduceNgrams : Microsoft.ML.TransformsCatalog.TextTransforms * string * string * int * int * bool * int * Microsoft.ML.Transforms.Text.NgramExtractingEstimator.WeightingCriteria -> Microsoft.ML.Transforms.Text.NgramExtractingEstimator
<Extension()>
Public Function ProduceNgrams (catalog As TransformsCatalog.TextTransforms, outputColumnName As String, Optional inputColumnName As String = Nothing, Optional ngramLength As Integer = 2, Optional skipLength As Integer = 0, Optional useAllLengths As Boolean = true, Optional maximumNgramsCount As Integer = 10000000, Optional weighting As NgramExtractingEstimator.WeightingCriteria = Microsoft.ML.Transforms.Text.NgramExtractingEstimator+WeightingCriteria.Tf) As NgramExtractingEstimator
Parámetros
- catalog
- TransformsCatalog.TextTransforms
Catálogo de la transformación relacionada con el texto.
- outputColumnName
- String
Nombre de la columna resultante de la transformación de inputColumnName
.
El tipo de datos de esta columna será un vector de Single.
- inputColumnName
- String
Nombre de la columna que se va a transformar. Si se establece null
en , el valor de outputColumnName
se usará como origen.
Este estimador opera sobre vectores de tipo de datos de claves.
- ngramLength
- Int32
Longitud del ngrama.
- skipLength
- Int32
Número de tokens que se omitirán entre cada n-grama. De forma predeterminada, no se omite ningún token.
- useAllLengths
- Boolean
Si se deben incluir todas las longitudes de n-gramas hasta ngramLength
o solo ngramLength
.
- maximumNgramsCount
- Int32
Número máximo de n-gramas que se van a almacenar en el diccionario.
Medida estadística utilizada para evaluar lo importante que es una palabra o n-grama para un documento en un corpus.
Cuando maximumNgramsCount
es menor que el número total de n-gramas encontrados, esta medida se usa para determinar qué n-gramas mantener.
Devoluciones
Ejemplos
using System;
using System.Collections.Generic;
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.Transforms.Text;
namespace Samples.Dynamic
{
public static class ProduceNgrams
{
public static void Example()
{
// Create a new ML context, for ML.NET operations. It can be used for
// exception tracking and logging, as well as the source of randomness.
var mlContext = new MLContext();
// Create a small dataset as an IEnumerable.
var samples = new List<TextData>()
{
new TextData(){ Text = "This is an example to compute n-grams." },
new TextData(){ Text = "N-gram is a sequence of 'N' consecutive " +
"words/tokens." },
new TextData(){ Text = "ML.NET's ProduceNgrams API produces " +
"vector of n-grams." },
new TextData(){ Text = "Each position in the vector corresponds " +
"to a particular n-gram." },
new TextData(){ Text = "The value at each position corresponds " +
"to," },
new TextData(){ Text = "the number of times n-gram occurred in " +
"the data (Tf), or" },
new TextData(){ Text = "the inverse of the number of documents " +
"that contain the n-gram (Idf)," },
new TextData(){ Text = "or compute both and multiply together " +
"(Tf-Idf)." },
};
// Convert training data to IDataView.
var dataview = mlContext.Data.LoadFromEnumerable(samples);
// A pipeline for converting text into numeric n-gram features.
// The following call to 'ProduceNgrams' requires the tokenized
// text /string as input. This is achieved by calling
// 'TokenizeIntoWords' first followed by 'ProduceNgrams'. Please note
// that the length of the output feature vector depends on the n-gram
// settings.
var textPipeline = mlContext.Transforms.Text.TokenizeIntoWords("Tokens",
"Text")
// 'ProduceNgrams' takes key type as input. Converting the tokens
// into key type using 'MapValueToKey'.
.Append(mlContext.Transforms.Conversion.MapValueToKey("Tokens"))
.Append(mlContext.Transforms.Text.ProduceNgrams("NgramFeatures",
"Tokens",
ngramLength: 3,
useAllLengths: false,
weighting: NgramExtractingEstimator.WeightingCriteria.Tf));
// Fit to data.
var textTransformer = textPipeline.Fit(dataview);
var transformedDataView = textTransformer.Transform(dataview);
// Create the prediction engine to get the n-gram features extracted
// from the text.
var predictionEngine = mlContext.Model.CreatePredictionEngine<TextData,
TransformedTextData>(textTransformer);
// Convert the text into numeric features.
var prediction = predictionEngine.Predict(samples[0]);
// Print the length of the feature vector.
Console.WriteLine("Number of Features: " + prediction.NgramFeatures
.Length);
// Preview of the produced n-grams.
// Get the slot names from the column's metadata.
// The slot names for a vector column corresponds to the names
// associated with each position in the vector.
VBuffer<ReadOnlyMemory<char>> slotNames = default;
transformedDataView.Schema["NgramFeatures"].GetSlotNames(ref slotNames);
var NgramFeaturesColumn = transformedDataView.GetColumn<VBuffer<
float>>(transformedDataView.Schema["NgramFeatures"]);
var slots = slotNames.GetValues();
Console.Write("N-grams: ");
foreach (var featureRow in NgramFeaturesColumn)
{
foreach (var item in featureRow.Items())
Console.Write($"{slots[item.Key]} ");
Console.WriteLine();
}
// Print the first 10 feature values.
Console.Write("Features: ");
for (int i = 0; i < 10; i++)
Console.Write($"{prediction.NgramFeatures[i]:F4} ");
// Expected output:
// Number of Features: 52
// N-grams: This|is|an is|an|example an|example|to example|to|compute to|compute|n-grams. N-gram|is|a is|a|sequence a|sequence|of sequence|of|'N' of|'N'|consecutive ...
// Features: 1.0000 1.0000 1.0000 1.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 ...
}
private class TextData
{
public string Text { get; set; }
}
private class TransformedTextData : TextData
{
public float[] NgramFeatures { get; set; }
}
}
}