Microsoft.ML.Tokenizers Namespace
Important
Some information relates to prerelease product that may be substantially modified before it’s released. Microsoft makes no warranties, express or implied, with respect to the information provided here.
Bert |
Options for the Bert tokenizer. |
Bert |
Tokenizer for Bert model. |
Bit |
|
Bpe |
Represent the Byte Pair Encoding model. |
Bpe |
Allows decoding Original BPE by joining all the tokens and then replacing the suffix used to identify end-of-words by white spaces |
Bpe |
Represent the Byte Pair Encoding model. |
Bpe |
The Bpe trainer responsible to train the Bpe model. |
Code |
Represent the Byte Pair Encoding model. Implement the CodeGen tokenizer described in https://huggingface.co/docs/transformers/main/en/model_doc/codegen#overview |
Dawg |
|
English |
Represent the Byte Pair Encoding model. |
English |
Represent the Byte Pair Encoding model. |
Llama |
LlamaTokenizer is SentencePieceTokenizer which is implemented based on https://github.com/google/sentencepiece. |
Lower |
Normalize the string to lowercase form before processing it with the tokenizer. |
Model |
Represents a model used during Tokenization (like BPE or Word Piece or Unigram). |
Normalizer |
Normalize the string before processing it with the tokenizer. |
Phi2Tokenizer |
Represent the Byte Pair Encoding model. Implement the Phi2 tokenizer described in https://huggingface.co/microsoft/phi-2 |
Pre |
Base class for all pre-tokenizers classes. The PreTokenizer is in charge of doing the pre-segmentation step. |
Regex |
The pre-tokenizer for Tiktoken tokenizer. |
Roberta |
The pre-tokenizer for Roberta English tokenizer. |
Sentence |
Normalize the string according to SentencePiece normalization. |
Sentence |
SentencePieceBpe is a tokenizer that splits the input into tokens using the SentencePiece Bpe model. |
Split |
This Split contains the underlying split token as well as its offsets
in the original string. These offsets are in the |
Tiktoken |
Represent the rapid Byte Pair Encoding tokenizer. |
Token |
Represent the token produced from the tokenization process containing the token substring, the id associated to the token substring, and the offset mapping to the original string. |
Tokenizer |
Provides an abstraction for tokenizers, enabling the encoding of text into tokens and the decoding of token IDs back into text. |
Tokenizer |
A Decoder has the responsibility to merge the given list of tokens in a string. |
Tokenizer |
The Encoding represents the output of a Tokenizer. |
Trainer |
A |
Upper |
Normalize the string to uppercase form before processing it with the tokenizer. |
White |
The pre-tokenizer which split the text at the word boundary. The word is a set of alphabet, numeric, and underscore characters. |
Word |
Options for the WordPiece tokenizer. |
Word |
Represent the WordPiece tokenizer. |
Added |
Represent a token added by the user on top of the existing Model vocabulary. AddedToken can be configured to specify the behavior they should have in various situations like:
|
Encoded |
Represent the token produced from the tokenization process containing the token substring, the id associated to the token substring, and the offset mapping to the original string. |
Encode |
The result of encoding a text. |
Encode |
The settings used to encode a text. |
Normalized |
Contains the normalized string and the mapping to the original string. |
Progress |
Progress |
Represent the state of the reported progress. |