Microsoft.ML.Tokenizers Namespace

Classes

Bpe

Represent the Byte Pair Encoding model.

BpeDecoder

Allows decoding Original BPE by joining all the tokens and then replacing the suffix used to identify end-of-words by white spaces

BpeTrainer

The Bpe trainer responsible to train the Bpe model.

EnglishRoberta

Represent the Byte Pair Encoding model.

LowerCaseNormalizer

Normalize the string to lowercase form before processing it with the tokenizer.

Model

Represents a model used during Tokenization (like BPE or Word Piece or Unigram).

Normalizer

Normalize the string before processing it with the tokenizer.

PreTokenizer

Base class for all pre-tokenizers classes. The PreTokenizer is in charge of doing the pre-segmentation step.

RobertaPreTokenizer

The pre-tokenizer for Roberta English tokenizer.

Split

This Split contains the underlying split token as well as its offsets in the original string. These offsets are in the original referential. It also contains any Token associated to the current split.

Token

Represent the token produced from the tokenization process containing the token substring, the id associated to the token substring, and the offset mapping to the original string.

Tokenizer

A Tokenizer works as a pipeline. It processes some raw text as input and outputs a TokenizerResult object.

TokenizerDecoder

A Decoder has the responsibility to merge the given list of tokens in a string.

TokenizerResult

The Encoding represents the output of a Tokenizer.

Trainer

A Trainer has the responsibility to train a model. We feed it with lines/sentences and then it can train the given Model.

UpperCaseNormalizer

Normalize the string to uppercase form before processing it with the tokenizer.

WhiteSpace

The pre-tokenizer which split the text at the word boundary. The word is a set of alphabet, numeric, and underscore characters.

Structs

AddedToken

Represent a token added by the user on top of the existing Model vocabulary. AddedToken can be configured to specify the behavior they should have in various situations like:

  • Whether they should only match single words
  • Whether to include any WhiteSpace on its left or right
NormalizedString

Contains the normalized string and the mapping to the original string.

Progress

Enums

ProgressState

Represent the state of the reported progress.

Delegates

ReportProgress