Microsoft.ML.Tokenizers Namespace

Reference

Important

Some information relates to prerelease product that may be substantially modified before it’s released. Microsoft makes no warranties, express or implied, with respect to the information provided here.

Classes

BertOptions	Options for the Bert tokenizer.
BertTokenizer	Tokenizer for Bert model.
BitVector
Bpe	Represent the Byte Pair Encoding model.
BpeDecoder	Allows decoding Original BPE by joining all the tokens and then replacing the suffix used to identify end-of-words by white spaces
BpeTokenizer	Represent the Byte Pair Encoding model.
BpeTrainer	The Bpe trainer responsible to train the Bpe model.
CodeGenTokenizer	Represent the Byte Pair Encoding model. Implement the CodeGen tokenizer described in https://huggingface.co/docs/transformers/main/en/model_doc/codegen#overview
DawgBuilder
EnglishRoberta	Represent the Byte Pair Encoding model.
EnglishRobertaTokenizer	Represent the Byte Pair Encoding model.
LlamaTokenizer	LlamaTokenizer is SentencePieceTokenizer which is implemented based on https://github.com/google/sentencepiece.
LowerCaseNormalizer	Normalize the string to lowercase form before processing it with the tokenizer.
Model	Represents a model used during Tokenization (like BPE or Word Piece or Unigram).
Normalizer	Normalize the string before processing it with the tokenizer.
Phi2Tokenizer	Represent the Byte Pair Encoding model. Implement the Phi2 tokenizer described in https://huggingface.co/microsoft/phi-2
PreTokenizer	Base class for all pre-tokenizers classes. The PreTokenizer is in charge of doing the pre-segmentation step.
RegexPreTokenizer	The pre-tokenizer for Tiktoken tokenizer.
RobertaPreTokenizer	The pre-tokenizer for Roberta English tokenizer.
SentencePieceNormalizer	Normalize the string according to SentencePiece normalization.
SentencePieceTokenizer	SentencePieceBpe is a tokenizer that splits the input into tokens using the SentencePiece Bpe model.
Split	This Split contains the underlying split token as well as its offsets in the original string. These offsets are in the `original` referential. It also contains any `Token` associated to the current split.
TiktokenTokenizer	Represent the rapid Byte Pair Encoding tokenizer.
Token	Represent the token produced from the tokenization process containing the token substring, the id associated to the token substring, and the offset mapping to the original string.
Tokenizer	Provides an abstraction for tokenizers, enabling the encoding of text into tokens and the decoding of token IDs back into text.
TokenizerDecoder	A Decoder has the responsibility to merge the given list of tokens in a string.
TokenizerResult	The Encoding represents the output of a Tokenizer.
Trainer	A `Trainer` has the responsibility to train a model. We feed it with lines/sentences and then it can train the given `Model`.
UpperCaseNormalizer	Normalize the string to uppercase form before processing it with the tokenizer.
WhiteSpace	The pre-tokenizer which split the text at the word boundary. The word is a set of alphabet, numeric, and underscore characters.
WordPieceOptions	Options for the WordPiece tokenizer.
WordPieceTokenizer	Represent the WordPiece tokenizer.

Structs

AddedToken	Represent a token added by the user on top of the existing Model vocabulary. AddedToken can be configured to specify the behavior they should have in various situations like: Whether they should only match single words Whether to include any WhiteSpace on its left or right
EncodedToken	Represent the token produced from the tokenization process containing the token substring, the id associated to the token substring, and the offset mapping to the original string.
EncodeResults<T>	The result of encoding a text.
EncodeSettings	The settings used to encode a text.
NormalizedString	Contains the normalized string and the mapping to the original string.
Progress

Enums

ProgressState

Represent the state of the reported progress.

Delegates

ReportProgress

Share via

Microsoft.ML.Tokenizers Namespace

Classes

Structs

Enums

Delegates

Additional resources