What's new in ML.NET

Note

This article is a work in progress.

You can find all of the release notes for the ML.NET API in the dotnet/machinelearning repo.

New deep-learning tasks

ML.NET 3.0 added support for the following deep-learning tasks:

Object detection (backed by TorchSharp)
Named entity recognition (NER)
Question answering (QA)

These trainers are included in the Microsoft.ML.TorchSharp package. For more information, see Announcing ML.NET 3.0.

AutoML

In ML.NET 3.0, the AutoML sweeper was updated to support the sentence similarity, question answering, and object detection tasks. For more information about AutoML, see How to use the ML.NET Automated Machine Learning (AutoML) API.

Additional tokenizer support

Tokenization is a fundamental component in the preprocessing of natural language text for AI models. Tokenizers are responsible for breaking down a string of text into smaller, more manageable parts, often referred to as tokens. When using services like Azure OpenAI, you can use tokenizers to get a better understanding of cost and manage context. When working with self-hosted or local models, tokens are the inputs provided to those models. For more information about tokenization in the Microsoft.ML.Tokenizers library, see Announcing ML.NET 2.0.

The Microsoft.ML.Tokenizers package provides an open-source, cross-platform tokenization library. In ML.NET 4.0, the library has been enhanced in the following ways:

Refined APIs and existing functionality.
Added Tiktoken support.
Added tokenizer support for the Llama model.
Added the CodeGen tokenizer, which is compatible with models such as codegen-350M-mono and phi-2.
Added EncodeToIds overloads that accept Span<char> instances and let you customize normalization and pretokenization.
Worked closely with the DeepDev TokenizerLib and SharpToken communities to cover scenarios covered by those libraries. If you're using DeepDev or SharpToken, we recommend migrating to Microsoft.ML.Tokenizers. For more details, see the migration guide.

The following examples show how to use the Tiktoken text tokenizer.

Tokenizer tokenizer = TiktokenTokenizer.CreateForModel("gpt-4");
string text = "Hello, World!";

// Encode to IDs.
IReadOnlyList<int> encodedIds = tokenizer.EncodeToIds(text);
Console.WriteLine($"encodedIds = {{{string.Join(", ", encodedIds)}}}");
// encodedIds = {9906, 11, 4435, 0}

// Decode IDs to text.
string? decodedText = tokenizer.Decode(encodedIds);
Console.WriteLine($"decodedText = {decodedText}");
// decodedText = Hello, World!

// Get token count.
int idsCount = tokenizer.CountTokens(text);
Console.WriteLine($"idsCount = {idsCount}");
// idsCount = 4

// Full encoding.
IReadOnlyList<EncodedToken> result = tokenizer.EncodeToTokens(text, out string? normalizedString);
Console.WriteLine($"result.Tokens = {{'{string.Join("', '", result.Select(t => t.Value))}'}}");
// result.Tokens = {'Hello', ',', ' World', '!'}
Console.WriteLine($"result.Ids = {{{string.Join(", ", result.Select(t => t.Id))}}}");
// result.Ids = {9906, 11, 4435, 0}

// Encode up to number of tokens limit.
int index1 = tokenizer.GetIndexByTokenCount(
    text,
    maxTokenCount: 1,
    out string? processedText1,
    out int tokenCount1
    ); // Encode up to one token.
Console.WriteLine($"tokenCount1 = {tokenCount1}");
// tokenCount1 = 1
Console.WriteLine($"index1 = {index1}");
// index1 = 5

int index2 = tokenizer.GetIndexByTokenCountFromEnd(
    text,
    maxTokenCount: 1,
    out string? processedText2,
    out int tokenCount2
    ); // Encode from end up to one token.
Console.WriteLine($"tokenCount2 = {tokenCount2}");
// tokenCount2 = 1
Console.WriteLine($"index2 = {index2}");
// index2 = 12

The following examples show how to use the Llama text tokenizer.

// Create the Tokenizer.
string modelUrl = @"https://huggingface.co/hf-internal-testing/llama-llamaTokenizer/resolve/main/llamaTokenizer.model";
using Stream remoteStream = File.OpenRead(modelUrl);
Tokenizer llamaTokenizer = LlamaTokenizer.Create(remoteStream);

string text = "Hello, World!";

// Encode to IDs.
IReadOnlyList<int> encodedIds = llamaTokenizer.EncodeToIds(text);
Console.WriteLine($"encodedIds = {{{string.Join(", ", encodedIds)}}}");
// encodedIds = {1, 15043, 29892, 2787, 29991}

// Decode IDs to text.
string? decodedText = llamaTokenizer.Decode(encodedIds);
Console.WriteLine($"decodedText = {decodedText}");
// decodedText = Hello, World!

// Get token count.
int idsCount = llamaTokenizer.CountTokens(text);
Console.WriteLine($"idsCount = {idsCount}");
// idsCount = 5

// Full encoding.
IReadOnlyList<EncodedToken> result = llamaTokenizer.EncodeToTokens(text, out string? normalizedString);
Console.WriteLine($"result.Tokens = {{'{string.Join("', '", result.Select(t => t.Value))}'}}");
// result.Tokens = {'<s>', '▁Hello', ',', '▁World', '!'}
Console.WriteLine($"result.Ids = {{{string.Join(", ", result.Select(t => t.Id))}}}");
// result.Ids = {1, 15043, 29892, 2787, 29991}

// Encode up 2 tokens.
int index1 = llamaTokenizer.GetIndexByTokenCount(text, maxTokenCount: 2, out string? processedText1, out int tokenCount1);
Console.WriteLine($"tokenCount1 = {tokenCount1}");
// tokenCount1 = 2
Console.WriteLine($"index1 = {index1}");
// index1 = 6

// Encode from end up to one token.
int index2 = llamaTokenizer.GetIndexByTokenCountFromEnd(text, maxTokenCount: 1, out string? processedText2, out int tokenCount2);
Console.WriteLine($"tokenCount2 = {tokenCount2}");
// tokenCount2 = 1
Console.WriteLine($"index2 = {index2}");
// index2 = 13

The following examples show how to use the CodeGen tokenizer.

string phi2VocabPath = "https://huggingface.co/microsoft/phi-2/resolve/main/vocab.json?download=true";
string phi2MergePath = "https://huggingface.co/microsoft/phi-2/resolve/main/merges.txt?download=true";
using Stream vocabStream = File.OpenRead(phi2VocabPath);
using Stream mergesStream = File.OpenRead(phi2MergePath);

Tokenizer phi2Tokenizer = CodeGenTokenizer.Create(vocabStream, mergesStream);
IReadOnlyList<int> ids = phi2Tokenizer.EncodeToIds("Hello, World");

The following example demonstrates how to use the tokenizer with Span<char> and how to disable normalization or pretokenization on the encoding calls.

ReadOnlySpan<char> textSpan = "Hello World".AsSpan();

// Bypass normalization.
IReadOnlyList<int> ids = llamaTokenizer.EncodeToIds(textSpan, considerNormalization: false);

// Bypass pretokenization.
ids = llamaTokenizer.EncodeToIds(textSpan, considerPreTokenization: false);

Byte-level support in BPE tokenizer

The BpeTokenizer now supports byte-level encoding, enabling compatibility with models like DeepSeek. This enhancement processes vocabulary as UTF-8 bytes. In addition, the new BpeOptions type simplifies tokenizer configuration.

BpeOptions bpeOptions = new BpeOptions(vocabs);
BpeTokenizer tokenizer = BpeTokenizer.Create(bpeOptions);

Deterministic option for LightGBM trainer

LightGBM trainers now expose options for deterministic training, ensuring consistent results with the same data and random seed. These options include deterministic, force_row_wise, and force_col_wise.

LightGbmBinaryTrainer trainer = ML.BinaryClassification.Trainers.LightGbm(new LightGbmBinaryTrainer.Options
{
    Deterministic = true,
    ForceRowWise = true
});

Model Builder (Visual Studio extension)

Model Builder has been updated to consume the ML.NET 3.0 release. Model Builder version 17.18.0 added question answering (QA) and named entity recognition (NER) scenarios.

You can find all of the Model Builder release notes in the dotnet/machinelearning-modelbuilder repo.

Feedback

Was this page helpful?

Last updated on 2025-05-16