ML.NET 中的新增功能

2025-05-16

注释

本文是正在进行的工作。

可以在 dotnet/machinelearning 存储库中找到 ML.NET API 的所有发行说明。

新的深度学习任务

ML.NET 3.0 添加了对以下深度学习任务的支持：

对象检测（由 TorchSharp 提供支持）
命名实体识别 (NER)
问答（QA）

这些训练器包含在 Microsoft.ML.TorchSharp 包中。有关详细信息，请参阅宣布 ML.NET 3.0。

自动化 ML

在 ML.NET 3.0 中，AutoML 扫描程序已更新，以支持句子相似性、问答和对象检测任务。有关 AutoML 的详细信息，请参阅如何使用 ML.NET 自动化机器学习（AutoML） API。

额外的分词器支持

标记化是 AI 模型的自然语言文本预处理中的基本组件。 Tokenizer 负责将文本字符串分解为更小、更易于管理的部分，通常称为令牌。使用 Azure OpenAI 等服务时，可以使用 tokenizer 更好地了解成本和管理上下文。使用自承载模型或本地模型时，令牌是提供给这些模型的输入。有关 Microsoft.ML.Tokenizers 库中的标记化的详细信息，请参阅宣布 ML.NET 2.0。

Microsoft.ML.Tokenizers 包提供开源跨平台令牌化库。在 ML.NET 4.0 中，库已通过以下方式得到增强：

优化了 API 和现有功能。
添加了 Tiktoken 支持。
添加了对 Llama 模型的 tokenizer 支持。
添加了CodeGen tokenizer，该 tokenizer 兼容 codegen-350M-mono 和 phi-2 等模型。
添加了用于 EncodeToIds 实例的重载，并允许自定义规范化和预分词。
与 DeepDev 社区 TokenizerLib 和 SharpToken 密切合作，以确保涵盖那些库所涉及的场景。如果您正在使用 DeepDev 或 SharpToken，我们建议迁移到 Microsoft.ML.Tokenizers。有关详细信息，请参阅迁移指南。

以下示例演示如何使用 Tiktoken 文本标记器。

Tokenizer tokenizer = TiktokenTokenizer.CreateForModel("gpt-4");
string text = "Hello, World!";

// Encode to IDs.
IReadOnlyList<int> encodedIds = tokenizer.EncodeToIds(text);
Console.WriteLine($"encodedIds = {{{string.Join(", ", encodedIds)}}}");
// encodedIds = {9906, 11, 4435, 0}

// Decode IDs to text.
string? decodedText = tokenizer.Decode(encodedIds);
Console.WriteLine($"decodedText = {decodedText}");
// decodedText = Hello, World!

// Get token count.
int idsCount = tokenizer.CountTokens(text);
Console.WriteLine($"idsCount = {idsCount}");
// idsCount = 4

// Full encoding.
IReadOnlyList<EncodedToken> result = tokenizer.EncodeToTokens(text, out string? normalizedString);
Console.WriteLine($"result.Tokens = {{'{string.Join("', '", result.Select(t => t.Value))}'}}");
// result.Tokens = {'Hello', ',', ' World', '!'}
Console.WriteLine($"result.Ids = {{{string.Join(", ", result.Select(t => t.Id))}}}");
// result.Ids = {9906, 11, 4435, 0}

// Encode up to number of tokens limit.
int index1 = tokenizer.GetIndexByTokenCount(
    text,
    maxTokenCount: 1,
    out string? processedText1,
    out int tokenCount1
    ); // Encode up to one token.
Console.WriteLine($"tokenCount1 = {tokenCount1}");
// tokenCount1 = 1
Console.WriteLine($"index1 = {index1}");
// index1 = 5

int index2 = tokenizer.GetIndexByTokenCountFromEnd(
    text,
    maxTokenCount: 1,
    out string? processedText2,
    out int tokenCount2
    ); // Encode from end up to one token.
Console.WriteLine($"tokenCount2 = {tokenCount2}");
// tokenCount2 = 1
Console.WriteLine($"index2 = {index2}");
// index2 = 12

以下示例演示如何使用 Llama 文本标记器。

// Create the Tokenizer.
string modelUrl = @"https://huggingface.co/hf-internal-testing/llama-llamaTokenizer/resolve/main/llamaTokenizer.model";
using Stream remoteStream = File.OpenRead(modelUrl);
Tokenizer llamaTokenizer = LlamaTokenizer.Create(remoteStream);

string text = "Hello, World!";

// Encode to IDs.
IReadOnlyList<int> encodedIds = llamaTokenizer.EncodeToIds(text);
Console.WriteLine($"encodedIds = {{{string.Join(", ", encodedIds)}}}");
// encodedIds = {1, 15043, 29892, 2787, 29991}

// Decode IDs to text.
string? decodedText = llamaTokenizer.Decode(encodedIds);
Console.WriteLine($"decodedText = {decodedText}");
// decodedText = Hello, World!

// Get token count.
int idsCount = llamaTokenizer.CountTokens(text);
Console.WriteLine($"idsCount = {idsCount}");
// idsCount = 5

// Full encoding.
IReadOnlyList<EncodedToken> result = llamaTokenizer.EncodeToTokens(text, out string? normalizedString);
Console.WriteLine($"result.Tokens = {{'{string.Join("', '", result.Select(t => t.Value))}'}}");
// result.Tokens = {'<s>', '▁Hello', ',', '▁World', '!'}
Console.WriteLine($"result.Ids = {{{string.Join(", ", result.Select(t => t.Id))}}}");
// result.Ids = {1, 15043, 29892, 2787, 29991}

// Encode up 2 tokens.
int index1 = llamaTokenizer.GetIndexByTokenCount(text, maxTokenCount: 2, out string? processedText1, out int tokenCount1);
Console.WriteLine($"tokenCount1 = {tokenCount1}");
// tokenCount1 = 2
Console.WriteLine($"index1 = {index1}");
// index1 = 6

// Encode from end up to one token.
int index2 = llamaTokenizer.GetIndexByTokenCountFromEnd(text, maxTokenCount: 1, out string? processedText2, out int tokenCount2);
Console.WriteLine($"tokenCount2 = {tokenCount2}");
// tokenCount2 = 1
Console.WriteLine($"index2 = {index2}");
// index2 = 13

以下示例演示如何使用 CodeGen tokenizer。

string phi2VocabPath = "https://huggingface.co/microsoft/phi-2/resolve/main/vocab.json?download=true";
string phi2MergePath = "https://huggingface.co/microsoft/phi-2/resolve/main/merges.txt?download=true";
using Stream vocabStream = File.OpenRead(phi2VocabPath);
using Stream mergesStream = File.OpenRead(phi2MergePath);

Tokenizer phi2Tokenizer = CodeGenTokenizer.Create(vocabStream, mergesStream);
IReadOnlyList<int> ids = phi2Tokenizer.EncodeToIds("Hello, World");

下面的示例演示如何通过 Span<char> 使用 tokenizer，并如何在编码调用中禁用规范化或预建化。

ReadOnlySpan<char> textSpan = "Hello World".AsSpan();

// Bypass normalization.
IReadOnlyList<int> ids = llamaTokenizer.EncodeToIds(textSpan, considerNormalization: false);

// Bypass pretokenization.
ids = llamaTokenizer.EncodeToIds(textSpan, considerPreTokenization: false);

BPE tokenizer 中的字节级支持

现在 BpeTokenizer 支持字节级编码，可实现与 DeepSeek 等模型的兼容性。此增强功能将词汇处理为 UTF-8 字节。此外，新 BpeOptions 类型简化了 tokenizer 配置。

BpeOptions bpeOptions = new BpeOptions(vocabs);
BpeTokenizer tokenizer = BpeTokenizer.Create(bpeOptions);

LightGBM 训练器的确定性选项

LightGBM 训练器现在公开用于确定性训练的选项，确保结果与相同的数据和随机种子保持一致。这些选项包括 deterministic、 force_row_wise和 force_col_wise。

LightGbmBinaryTrainer trainer = ML.BinaryClassification.Trainers.LightGbm(new LightGbmBinaryTrainer.Options
{
    Deterministic = true,
    ForceRowWise = true
});

模型生成器（Visual Studio 扩展）

模型生成器已更新为使用 ML.NET 3.0 版本。模型生成器版本 17.18.0 添加了问答（QA）和命名实体识别（NER）方案。

可以在 dotnet/machinelearning-modelbuilder 存储库中找到所有模型生成器发行说明。

另请参阅

博客文章：宣布 ML.NET 3.0

通过