TextCatalog.TokenizeIntoCharactersAsKeys 方法
重要
部分資訊涉及發行前產品,在發行之前可能會有大幅修改。 Microsoft 對此處提供的資訊,不做任何明確或隱含的瑕疵擔保。
建立 , TokenizingByCharactersEstimator 其會使用滑動視窗將文字分割成字元序列來標記。
C#
public static Microsoft.ML.Transforms.Text.TokenizingByCharactersEstimator TokenizeIntoCharactersAsKeys(this Microsoft.ML.TransformsCatalog.TextTransforms catalog, string outputColumnName, string inputColumnName = default, bool useMarkerCharacters = true);
static member TokenizeIntoCharactersAsKeys : Microsoft.ML.TransformsCatalog.TextTransforms * string * string * bool -> Microsoft.ML.Transforms.Text.TokenizingByCharactersEstimator
<Extension()>
Public Function TokenizeIntoCharactersAsKeys (catalog As TransformsCatalog.TextTransforms, outputColumnName As String, Optional inputColumnName As String = Nothing, Optional useMarkerCharacters As Boolean = true) As TokenizingByCharactersEstimator
- catalog
- TransformsCatalog.TextTransforms
與文字相關的轉換目錄。
- outputColumnName
- String
轉換 inputColumnName
所產生的資料行名稱。
此資料行的資料類型將是索引鍵的可變大小向量。
- inputColumnName
- String
要轉換的資料行名稱。 如果設定為 null
,則會 outputColumnName
將 的值當做來源使用。
此估算器會透過文字資料類型操作。
- useMarkerCharacters
- Boolean
為了能夠區分標記,例如為了進行偵錯,您可以選擇在標記字元前面加上 、 0x02
到開頭,並將另一個標記字元 0x03
附加至字元的輸出向量結尾。
C#
using System;
using System.Collections.Generic;
using Microsoft.ML;
namespace Samples.Dynamic
{
public static class TokenizeIntoCharactersAsKeys
{
public static void Example()
{
// Create a new ML context, for ML.NET operations. It can be used for
// exception tracking and logging, as well as the source of randomness.
var mlContext = new MLContext();
// Create an empty list as the dataset. The
// 'TokenizeIntoCharactersAsKeys' does not require training data as
// the estimator ('TokenizingByCharactersEstimator') created by
// 'TokenizeIntoCharactersAsKeys' API is not a trainable estimator.
// The empty list is only needed to pass input schema to the pipeline.
var emptySamples = new List<TextData>();
// Convert sample list to an empty IDataView.
var emptyDataView = mlContext.Data.LoadFromEnumerable(emptySamples);
// A pipeline for converting text into vector of characters.
// The 'TokenizeIntoCharactersAsKeys' produces result as key type.
// 'MapKeyToValue' is need to map keys back to their original values.
var textPipeline = mlContext.Transforms.Text
.TokenizeIntoCharactersAsKeys("CharTokens", "Text",
useMarkerCharacters: false)
.Append(mlContext.Transforms.Conversion.MapKeyToValue(
"CharTokens"));
// Fit to data.
var textTransformer = textPipeline.Fit(emptyDataView);
// Create the prediction engine to get the character vector from the
// input text/string.
var predictionEngine = mlContext.Model.CreatePredictionEngine<TextData,
TransformedTextData>(textTransformer);
// Call the prediction API to convert the text into characters.
var data = new TextData()
{
Text = "ML.NET's " +
"TokenizeIntoCharactersAsKeys API splits text/string into " +
"characters."
};
var prediction = predictionEngine.Predict(data);
// Print the length of the character vector.
Console.WriteLine($"Number of tokens: {prediction.CharTokens.Length}");
// Print the character vector.
Console.WriteLine("\nCharacter Tokens: " + string.Join(",", prediction
.CharTokens));
// Expected output:
// Number of tokens: 77
// Character Tokens: M,L,.,N,E,T,',s,<?>,T,o,k,e,n,i,z,e,I,n,t,o,C,h,a,r,a,c,t,e,r,s,A,s,K,e,y,s,<?>,A,P,I,<?>,
// s,p,l,i,t,s,<?>,t,e,x,t,/,s,t,r,i,n,g,<?>,i,n,t,o,<?>,c,h,a,r,a,c,t,e,r,s,.
//
// <?>: is a unicode control character used instead of spaces ('\u2400').
}
private class TextData
{
public string Text { get; set; }
}
private class TransformedTextData : TextData
{
public string[] CharTokens { get; set; }
}
}
}
產品 | 版本 |
---|---|
ML.NET | 1.0.0, 1.1.0, 1.2.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 2.0.0, 3.0.0, 4.0.0, Preview |