TextCatalog.NormalizeText Method
Definition
Important
Some information relates to prerelease product that may be substantially modified before it’s released. Microsoft makes no warranties, express or implied, with respect to the information provided here.
Creates a TextNormalizingEstimator, which normalizes incoming text in inputColumnName
by optionally
changing case, removing diacritical marks, punctuation marks, numbers, and outputs new text as outputColumnName
.
public static Microsoft.ML.Transforms.Text.TextNormalizingEstimator NormalizeText (this Microsoft.ML.TransformsCatalog.TextTransforms catalog, string outputColumnName, string inputColumnName = default, Microsoft.ML.Transforms.Text.TextNormalizingEstimator.CaseMode caseMode = Microsoft.ML.Transforms.Text.TextNormalizingEstimator+CaseMode.Lower, bool keepDiacritics = false, bool keepPunctuations = true, bool keepNumbers = true);
static member NormalizeText : Microsoft.ML.TransformsCatalog.TextTransforms * string * string * Microsoft.ML.Transforms.Text.TextNormalizingEstimator.CaseMode * bool * bool * bool -> Microsoft.ML.Transforms.Text.TextNormalizingEstimator
<Extension()>
Public Function NormalizeText (catalog As TransformsCatalog.TextTransforms, outputColumnName As String, Optional inputColumnName As String = Nothing, Optional caseMode As TextNormalizingEstimator.CaseMode = Microsoft.ML.Transforms.Text.TextNormalizingEstimator+CaseMode.Lower, Optional keepDiacritics As Boolean = false, Optional keepPunctuations As Boolean = true, Optional keepNumbers As Boolean = true) As TextNormalizingEstimator
Parameters
- catalog
- TransformsCatalog.TextTransforms
The text-related transform's catalog.
- outputColumnName
- String
Name of the column resulting from the transformation of inputColumnName
.
This column's data type is a scalar or a vector of text depending on the input column data type.
- inputColumnName
- String
Name of the column to transform. If set to null
,
the value of the outputColumnName
will be used as source.
This estimator operates on text or vector of text data types.
- caseMode
- TextNormalizingEstimator.CaseMode
Casing text using the rules of the invariant culture.
- keepDiacritics
- Boolean
Whether to keep diacritical marks or remove them.
- keepPunctuations
- Boolean
Whether to keep punctuation marks or remove them.
- keepNumbers
- Boolean
Whether to keep numbers or remove them.
Returns
Examples
using System;
using System.Collections.Generic;
using Microsoft.ML;
using Microsoft.ML.Transforms.Text;
namespace Samples.Dynamic
{
public static class NormalizeText
{
public static void Example()
{
// Create a new ML context, for ML.NET operations. It can be used for
// exception tracking and logging, as well as the source of randomness.
var mlContext = new MLContext();
// Create an empty list as the dataset. The 'NormalizeText' API does not
// require training data as the estimator ('TextNormalizingEstimator')
// created by 'NormalizeText' API is not a trainable estimator. The
// empty list is only needed to pass input schema to the pipeline.
var emptySamples = new List<TextData>();
// Convert sample list to an empty IDataView.
var emptyDataView = mlContext.Data.LoadFromEnumerable(emptySamples);
// A pipeline for normalizing text.
var normTextPipeline = mlContext.Transforms.Text.NormalizeText(
"NormalizedText", "Text", TextNormalizingEstimator.CaseMode.Lower,
keepDiacritics: false,
keepPunctuations: false,
keepNumbers: false);
// Fit to data.
var normTextTransformer = normTextPipeline.Fit(emptyDataView);
// Create the prediction engine to get the normalized text from the
// input text/string.
var predictionEngine = mlContext.Model.CreatePredictionEngine<TextData,
TransformedTextData>(normTextTransformer);
// Call the prediction API.
var data = new TextData()
{
Text = "ML.NET's NormalizeText API " +
"changes the case of the TEXT and removes/keeps diâcrîtîcs, " +
"punctuations, and/or numbers (123)."
};
var prediction = predictionEngine.Predict(data);
// Print the normalized text.
Console.WriteLine($"Normalized Text: {prediction.NormalizedText}");
// Expected output:
// Normalized Text: mlnets normalizetext api changes the case of the text and removeskeeps diacritics punctuations andor numbers
}
private class TextData
{
public string Text { get; set; }
}
private class TransformedTextData : TextData
{
public string NormalizedText { get; set; }
}
}
}