Événements
Créer des applications intelligentes
17 mars, 21 h - 21 mars, 10 h
Rejoignez la série de rencontres pour créer des solutions IA évolutives basées sur des cas d’usage réels avec d’autres développeurs et experts.
S’inscrire maintenantCe navigateur n’est plus pris en charge.
Effectuez une mise à niveau vers Microsoft Edge pour tirer parti des dernières fonctionnalités, des mises à jour de sécurité et du support technique.
Notes
This article is a work in progress.
You can find all of the release notes for the ML.NET API in the dotnet/machinelearning repo.
ML.NET 3.0 added support for the following deep-learning tasks:
These trainers are included in the Microsoft.ML.TorchSharp package. For more information, see Announcing ML.NET 3.0.
In ML.NET 3.0, the AutoML sweeper was updated to support the sentence similarity, question answering, and object detection tasks. For more information about AutoML, see How to use the ML.NET Automated Machine Learning (AutoML) API.
Tokenization is a fundamental component in the preprocessing of natural language text for AI models. Tokenizers are responsible for breaking down a string of text into smaller, more manageable parts, often referred to as tokens. When using services like Azure OpenAI, you can use tokenizers to get a better understanding of cost and manage context. When working with self-hosted or local models, tokens are the inputs provided to those models. For more information about tokenization in the Microsoft.ML.Tokenizers library, see Announcing ML.NET 2.0.
The Microsoft.ML.Tokenizers package provides an open-source, cross-platform tokenization library. In ML.NET 4.0, the library has been enhanced in the following ways:
Tiktoken
support.Llama
model.CodeGen
tokenizer, which is compatible with models such as codegen-350M-mono and phi-2.EncodeToIds
overloads that accept Span<char>
instances and let you customize normalization and pretokenization.TokenizerLib
and SharpToken
communities to cover scenarios covered by those libraries. If you're using DeepDev
or SharpToken
, we recommend migrating to Microsoft.ML.Tokenizers
. For more details, see the migration guide.The following examples show how to use the Tiktoken
text tokenizer.
Tokenizer tokenizer = TiktokenTokenizer.CreateForModel("gpt-4");
string text = "Hello, World!";
// Encode to IDs.
IReadOnlyList<int> encodedIds = tokenizer.EncodeToIds(text);
Console.WriteLine($"encodedIds = {{{string.Join(", ", encodedIds)}}}");
// encodedIds = {9906, 11, 4435, 0}
// Decode IDs to text.
string? decodedText = tokenizer.Decode(encodedIds);
Console.WriteLine($"decodedText = {decodedText}");
// decodedText = Hello, World!
// Get token count.
int idsCount = tokenizer.CountTokens(text);
Console.WriteLine($"idsCount = {idsCount}");
// idsCount = 4
// Full encoding.
IReadOnlyList<EncodedToken> result = tokenizer.EncodeToTokens(text, out string? normalizedString);
Console.WriteLine($"result.Tokens = {{'{string.Join("', '", result.Select(t => t.Value))}'}}");
// result.Tokens = {'Hello', ',', ' World', '!'}
Console.WriteLine($"result.Ids = {{{string.Join(", ", result.Select(t => t.Id))}}}");
// result.Ids = {9906, 11, 4435, 0}
// Encode up to number of tokens limit.
int index1 = tokenizer.GetIndexByTokenCount(
text,
maxTokenCount: 1,
out string? processedText1,
out int tokenCount1
); // Encode up to one token.
Console.WriteLine($"tokenCount1 = {tokenCount1}");
// tokenCount1 = 1
Console.WriteLine($"index1 = {index1}");
// index1 = 5
int index2 = tokenizer.GetIndexByTokenCountFromEnd(
text,
maxTokenCount: 1,
out string? processedText2,
out int tokenCount2
); // Encode from end up to one token.
Console.WriteLine($"tokenCount2 = {tokenCount2}");
// tokenCount2 = 1
Console.WriteLine($"index2 = {index2}");
// index2 = 12
The following examples show how to use the Llama
text tokenizer.
// Create the Tokenizer.
string modelUrl = @"https://huggingface.co/hf-internal-testing/llama-llamaTokenizer/resolve/main/llamaTokenizer.model";
using Stream remoteStream = File.OpenRead(modelUrl);
Tokenizer llamaTokenizer = LlamaTokenizer.Create(remoteStream);
string text = "Hello, World!";
// Encode to IDs.
IReadOnlyList<int> encodedIds = llamaTokenizer.EncodeToIds(text);
Console.WriteLine($"encodedIds = {{{string.Join(", ", encodedIds)}}}");
// encodedIds = {1, 15043, 29892, 2787, 29991}
// Decode IDs to text.
string? decodedText = llamaTokenizer.Decode(encodedIds);
Console.WriteLine($"decodedText = {decodedText}");
// decodedText = Hello, World!
// Get token count.
int idsCount = llamaTokenizer.CountTokens(text);
Console.WriteLine($"idsCount = {idsCount}");
// idsCount = 5
// Full encoding.
IReadOnlyList<EncodedToken> result = llamaTokenizer.EncodeToTokens(text, out string? normalizedString);
Console.WriteLine($"result.Tokens = {{'{string.Join("', '", result.Select(t => t.Value))}'}}");
// result.Tokens = {'<s>', '▁Hello', ',', '▁World', '!'}
Console.WriteLine($"result.Ids = {{{string.Join(", ", result.Select(t => t.Id))}}}");
// result.Ids = {1, 15043, 29892, 2787, 29991}
// Encode up 2 tokens.
int index1 = llamaTokenizer.GetIndexByTokenCount(text, maxTokenCount: 2, out string? processedText1, out int tokenCount1);
Console.WriteLine($"tokenCount1 = {tokenCount1}");
// tokenCount1 = 2
Console.WriteLine($"index1 = {index1}");
// index1 = 6
// Encode from end up to one token.
int index2 = llamaTokenizer.GetIndexByTokenCountFromEnd(text, maxTokenCount: 1, out string? processedText2, out int tokenCount2);
Console.WriteLine($"tokenCount2 = {tokenCount2}");
// tokenCount2 = 1
Console.WriteLine($"index2 = {index2}");
// index2 = 13
The following examples show how to use the CodeGen
tokenizer.
string phi2VocabPath = "https://huggingface.co/microsoft/phi-2/resolve/main/vocab.json?download=true";
string phi2MergePath = "https://huggingface.co/microsoft/phi-2/resolve/main/merges.txt?download=true";
using Stream vocabStream = File.OpenRead(phi2VocabPath);
using Stream mergesStream = File.OpenRead(phi2MergePath);
Tokenizer phi2Tokenizer = CodeGenTokenizer.Create(vocabStream, mergesStream);
IReadOnlyList<int> ids = phi2Tokenizer.EncodeToIds("Hello, World");
The following example demonstrates how to use the tokenizer with Span<char>
and how to disable normalization or pretokenization on the encoding calls.
ReadOnlySpan<char> textSpan = "Hello World".AsSpan();
// Bypass normalization.
IReadOnlyList<int> ids = llamaTokenizer.EncodeToIds(textSpan, considerNormalization: false);
// Bypass pretokenization.
ids = llamaTokenizer.EncodeToIds(textSpan, considerPreTokenization: false);
Model Builder has been updated to consume the ML.NET 3.0 release. Model Builder version 17.18.0 added question answering (QA) and named entity recognition (NER) scenarios.
You can find all of the Model Builder release notes in the dotnet/machinelearning-modelbuilder repo.
Commentaires sur .NET
.NET est un projet open source. Sélectionnez un lien pour fournir des commentaires :
Événements
Créer des applications intelligentes
17 mars, 21 h - 21 mars, 10 h
Rejoignez la série de rencontres pour créer des solutions IA évolutives basées sur des cas d’usage réels avec d’autres développeurs et experts.
S’inscrire maintenantEntrainement
Module
Dans ce module, vous allez apprendre à utiliser ML.NET Model Builder pour entraîner et consommer un modèle Machine Learning pour la maintenance prédictive.
Certification
Microsoft Certified : Azure Data Scientist Associate - Certifications
Gérer l’ingestion et la préparation des données, l’entraînement et le déploiement des modèles, ainsi que la surveillance des solutions d’apprentissage automatique avec Python, Azure Machine Learning et MLflow.
Documentation
Automatiser l’entraînement du modèle avec l’interface CLI ML.NET - ML.NET
Découvrez comment utiliser l’outil CLI ML.NET pour entraîner automatiquement le meilleur modèle à partir de la ligne de commande.
Application de correspondance de jeu Infer.NET - programmation probabiliste - ML.NET
Découvrez comment utiliser la programmation probabiliste avec Infer.NET pour créer une application de tableau de matchs basée sur une version simplifiée de TrueSkill.
Qu’est-ce que Model Builder et comment fonctionne-t-il ? - ML.NET
Comment utiliser Model Builder ML.NET pour entraîner automatiquement un modèle Machine Learning