What's new in ML.NET

Article
21/05/2024

Notes

This article is a work in progress.

You can find all of the release notes for the ML.NET API in the dotnet/machinelearning repo.

New deep-learning tasks

ML.NET 3.0 added support for the following deep-learning tasks:

Object detection (backed by TorchSharp)
Named entity recognition (NER)
Question answering (QA)

These trainers are included in the Microsoft.ML.TorchSharp package. For more information, see Announcing ML.NET 3.0.

AutoML

In ML.NET 3.0, the AutoML sweeper was updated to support the sentence similarity, question answering, and object detection tasks. For more information about AutoML, see How to use the ML.NET Automated Machine Learning (AutoML) API.

Additional tokenizer support

Tokenization is a fundamental component in the preprocessing of natural language text for AI models. Tokenizers are responsible for breaking down a string of text into smaller, more manageable parts, often referred to as tokens. When using services like Azure OpenAI, you can use tokenizers to get a better understanding of cost and manage context. When working with self-hosted or local models, tokens are the inputs provided to those models. For more information about tokenization in the Microsoft.ML.Tokenizers library, see Announcing ML.NET 2.0.

The Microsoft.ML.Tokenizers package provides an open-source, cross-platform tokenization library. In ML.NET 4.0, the library has been enhanced in the following ways:

Refined APIs and existing functionality.
Added Tiktoken support.
Added tokenizer support for the Llama model.
Added the CodeGen tokenizer, which is compatible with models such as codegen-350M-mono and phi-2.
Added EncodeToIds overloads that accept Span<char> instances and let you customize normalization and pretokenization.
Worked closely with the DeepDev TokenizerLib and SharpToken communities to cover scenarios covered by those libraries. If you're using DeepDev or SharpToken, we recommend migrating to Microsoft.ML.Tokenizers. For more details, see the migration guide.

The following examples show how to use the Tiktoken text tokenizer.

Tokenizer tokenizer = TiktokenTokenizer.CreateForModel("gpt-4");
string text = "Hello, World!";

// Encode to IDs.
IReadOnlyList<int> encodedIds = tokenizer.EncodeToIds(text);
Console.WriteLine($"encodedIds = {{{string.Join(", ", encodedIds)}}}");
// encodedIds = {9906, 11, 4435, 0}

// Decode IDs to text.
string? decodedText = tokenizer.Decode(encodedIds);
Console.WriteLine($"decodedText = {decodedText}");
// decodedText = Hello, World!

// Get token count.
int idsCount = tokenizer.CountTokens(text);
Console.WriteLine($"idsCount = {idsCount}");
// idsCount = 4

// Full encoding.
IReadOnlyList<EncodedToken> result = tokenizer.EncodeToTokens(text, out string? normalizedString);
Console.WriteLine($"result.Tokens = {{'{string.Join("', '", result.Select(t => t.Value))}'}}");
// result.Tokens = {'Hello', ',', ' World', '!'}
Console.WriteLine($"result.Ids = {{{string.Join(", ", result.Select(t => t.Id))}}}");
// result.Ids = {9906, 11, 4435, 0}

// Encode up to number of tokens limit.
int index1 = tokenizer.GetIndexByTokenCount(
    text,
    maxTokenCount: 1,
    out string? processedText1,
    out int tokenCount1
    ); // Encode up to one token.
Console.WriteLine($"tokenCount1 = {tokenCount1}");
// tokenCount1 = 1
Console.WriteLine($"index1 = {index1}");
// index1 = 5

int index2 = tokenizer.GetIndexByTokenCountFromEnd(
    text,
    maxTokenCount: 1,
    out string? processedText2,
    out int tokenCount2
    ); // Encode from end up to one token.
Console.WriteLine($"tokenCount2 = {tokenCount2}");
// tokenCount2 = 1
Console.WriteLine($"index2 = {index2}");
// index2 = 12

The following examples show how to use the Llama text tokenizer.

// Create the Tokenizer.
string modelUrl = @"https://huggingface.co/hf-internal-testing/llama-llamaTokenizer/resolve/main/llamaTokenizer.model";
using Stream remoteStream = File.OpenRead(modelUrl);
Tokenizer llamaTokenizer = LlamaTokenizer.Create(remoteStream);

string text = "Hello, World!";

// Encode to IDs.
IReadOnlyList<int> encodedIds = llamaTokenizer.EncodeToIds(text);
Console.WriteLine($"encodedIds = {{{string.Join(", ", encodedIds)}}}");
// encodedIds = {1, 15043, 29892, 2787, 29991}

// Decode IDs to text.
string? decodedText = llamaTokenizer.Decode(encodedIds);
Console.WriteLine($"decodedText = {decodedText}");
// decodedText = Hello, World!

// Get token count.
int idsCount = llamaTokenizer.CountTokens(text);
Console.WriteLine($"idsCount = {idsCount}");
// idsCount = 5

// Full encoding.
IReadOnlyList<EncodedToken> result = llamaTokenizer.EncodeToTokens(text, out string? normalizedString);
Console.WriteLine($"result.Tokens = {{'{string.Join("', '", result.Select(t => t.Value))}'}}");
// result.Tokens = {'<s>', '▁Hello', ',', '▁World', '!'}
Console.WriteLine($"result.Ids = {{{string.Join(", ", result.Select(t => t.Id))}}}");
// result.Ids = {1, 15043, 29892, 2787, 29991}

// Encode up 2 tokens.
int index1 = llamaTokenizer.GetIndexByTokenCount(text, maxTokenCount: 2, out string? processedText1, out int tokenCount1);
Console.WriteLine($"tokenCount1 = {tokenCount1}");
// tokenCount1 = 2
Console.WriteLine($"index1 = {index1}");
// index1 = 6

// Encode from end up to one token.
int index2 = llamaTokenizer.GetIndexByTokenCountFromEnd(text, maxTokenCount: 1, out string? processedText2, out int tokenCount2);
Console.WriteLine($"tokenCount2 = {tokenCount2}");
// tokenCount2 = 1
Console.WriteLine($"index2 = {index2}");
// index2 = 13

The following examples show how to use the CodeGen tokenizer.

string phi2VocabPath = "https://huggingface.co/microsoft/phi-2/resolve/main/vocab.json?download=true";
string phi2MergePath = "https://huggingface.co/microsoft/phi-2/resolve/main/merges.txt?download=true";
using Stream vocabStream = File.OpenRead(phi2VocabPath);
using Stream mergesStream = File.OpenRead(phi2MergePath);

Tokenizer phi2Tokenizer = CodeGenTokenizer.Create(vocabStream, mergesStream);
IReadOnlyList<int> ids = phi2Tokenizer.EncodeToIds("Hello, World");

The following example demonstrates how to use the tokenizer with Span<char> and how to disable normalization or pretokenization on the encoding calls.

ReadOnlySpan<char> textSpan = "Hello World".AsSpan();

// Bypass normalization.
IReadOnlyList<int> ids = llamaTokenizer.EncodeToIds(textSpan, considerNormalization: false);

// Bypass pretokenization.
ids = llamaTokenizer.EncodeToIds(textSpan, considerPreTokenization: false);

Model Builder (Visual Studio extension)

Model Builder has been updated to consume the ML.NET 3.0 release. Model Builder version 17.18.0 added question answering (QA) and named entity recognition (NER) scenarios.

You can find all of the Model Builder release notes in the dotnet/machinelearning-modelbuilder repo.

Ressources supplémentaires

Documentation

Automatiser l’entraînement du modèle avec l’interface CLI ML.NET - ML.NET

Découvrez comment utiliser l’outil CLI ML.NET pour entraîner automatiquement le meilleur modèle à partir de la ligne de commande.
Application de correspondance de jeu Infer.NET - programmation probabiliste - ML.NET

Découvrez comment utiliser la programmation probabiliste avec Infer.NET pour créer une application de tableau de matchs basée sur une version simplifiée de TrueSkill.
Qu’est-ce que Model Builder et comment fonctionne-t-il ? - ML.NET

Comment utiliser Model Builder ML.NET pour entraîner automatiquement un modèle Machine Learning
Vue d’ensemble de ML.NET - ML.NET

Découvrez plus d’informations sur les composants constituant ML.NET.
Tutoriel : Prédire les prix à l’aide de la régression avec Model Builder - ML.NET

Ce tutoriel montre comment créer un modèle de régression à l’aide de ML.NET Model Builder pour prédire les prix, en particulier les tarifs des taxis de New York.
Analyser les sentiments à l’aide de la CLI ML.NET - ML.NET

Générer automatiquement un modèle ML et le code C# associé à partir d’un exemple de jeu de données
documentation ML.NET - Tutoriels, informations de référence sur l’API

Découvrez comment utiliser des ML.NET open source pour créer des modèles Machine Learning personnalisés et les intégrer dans des applications. Les didacticiels, les exemples de code et plus encore vous montrent comment procéder.

Entrainement

Module

Entraîner un modèle Machine Learning pour la maintenance prédictive en utilisant ML.NET Model Builder - Training

Dans ce module, vous allez apprendre à utiliser ML.NET Model Builder pour entraîner et consommer un modèle Machine Learning pour la maintenance prédictive.

Certification

Microsoft Certified : Azure Data Scientist Associate - Certifications

Gérer l’ingestion et la préparation des données, l’entraînement et le déploiement des modèles, ainsi que la surveillance des solutions d’apprentissage automatique avec Python, Azure Machine Learning et MLflow.

Événements

Créer des applications intelligentes

17 mars, 21 h - 21 mars, 10 h

Rejoignez la série de rencontres pour créer des solutions IA évolutives basées sur des cas d’usage réels avec d’autres développeurs et experts.

S’inscrire maintenant

Partager via