BertTokenizer Class
Definition
Important
Some information relates to prerelease product that may be substantially modified before it’s released. Microsoft makes no warranties, express or implied, with respect to the information provided here.
Tokenizer for Bert model.
public sealed class BertTokenizer : Microsoft.ML.Tokenizers.WordPieceTokenizer
type BertTokenizer = class
inherit WordPieceTokenizer
Public NotInheritable Class BertTokenizer
Inherits WordPieceTokenizer
- Inheritance
Remarks
The BertTokenizer is a based on the WordPieceTokenizer and is used to tokenize text for Bert models. The implementation of the BertTokenizer is based on the original Bert implementation in the Hugging Face Transformers library. https://huggingface.co/transformers/v3.0.2/model_doc/bert.html?highlight=berttokenizerfast#berttokenizer
Properties
ApplyBasicTokenization |
Gets a value indicating whether the tokenizer should do basic tokenization. Like clean text, normalize it, lowercasing, etc. |
ClassificationToken |
Gets the classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens. |
ClassificationTokenId |
Gets the classifier token Id |
ContinuingSubwordPrefix |
Gets the prefix to use for sub-words that are not the first part of a word. (Inherited from WordPieceTokenizer) |
IndividuallyTokenizeCjk |
Gets a value indicating whether the tokenizer should split the CJK characters into tokens. |
LowerCaseBeforeTokenization |
Gets a value indicating whether the tokenizer should lowercase the input text. |
MaskingToken |
Gets the mask token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict. |
MaskingTokenId |
Gets the mask token Id |
MaxInputCharsPerWord |
Gets the maximum number of characters to authorize in a single word. (Inherited from WordPieceTokenizer) |
Normalizer |
Gets the Normalizer in use by the Tokenizer. (Inherited from WordPieceTokenizer) |
PaddingToken |
Gets the token used for padding, for example when batching sequences of different lengths |
PaddingTokenId |
Gets padding token Id |
PreTokenizer |
Gets the PreTokenizer used by the Tokenizer. (Inherited from WordPieceTokenizer) |
RemoveNonSpacingMarks |
Gets a value indicating whether to remove non-spacing marks. |
SeparatorToken |
Gets the separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens. |
SeparatorTokenId |
Gets the separator token Id |
SpecialTokens |
Gets the special tokens and their corresponding ids. (Inherited from WordPieceTokenizer) |
SplitOnSpecialTokens |
Gets a value indicating whether the tokenizer should split on the special tokens or treat special tokens as normal text. |
UnknownToken |
Gets the unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. (Inherited from WordPieceTokenizer) |
UnknownTokenId |
Gets the unknown token ID. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. (Inherited from WordPieceTokenizer) |
Methods
BuildInputsWithSpecialTokens(IEnumerable<Int32>, IEnumerable<Int32>) |
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A BERT sequence has the following format:
- single sequence: |
BuildInputsWithSpecialTokens(IEnumerable<Int32>, Span<Int32>, Int32, IEnumerable<Int32>) |
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A BERT sequence has the following format:
- single sequence: |
CountTokens(ReadOnlySpan<Char>, Boolean, Boolean) |
Get the number of tokens that the input text will be encoded to. (Inherited from Tokenizer) |
CountTokens(String, Boolean, Boolean) |
Get the number of tokens that the input text will be encoded to. (Inherited from Tokenizer) |
CountTokens(String, ReadOnlySpan<Char>, EncodeSettings) |
Get the number of tokens that the input text will be encoded to. (Inherited from WordPieceTokenizer) |
Create(Stream, BertOptions) |
Create a new instance of the BertTokenizer class. |
Create(String, BertOptions) |
Create a new instance of the BertTokenizer class. |
CreateAsync(Stream, BertOptions, CancellationToken) |
Create a new instance of the BertTokenizer class asynchronously. |
CreateAsync(String, BertOptions, CancellationToken) |
Create a new instance of the BertTokenizer class asynchronously. |
CreateTokenTypeIdsFromSequences(IEnumerable<Int32>, IEnumerable<Int32>) |
Create a mask from the two sequences passed to be used in a sequence-pair classification task. A BERT sequence pair mask has the following format:
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence | second sequence |
If |
CreateTokenTypeIdsFromSequences(IEnumerable<Int32>, Span<Int32>, Int32, IEnumerable<Int32>) | |
Decode(IEnumerable<Int32>, Boolean) |
Decode the given ids, back to a String. (Inherited from WordPieceTokenizer) |
Decode(IEnumerable<Int32>, Span<Char>, Boolean, Int32, Int32) |
Decode the given ids back to text and store the result in the |
Decode(IEnumerable<Int32>, Span<Char>, Int32, Int32) |
Decode the given ids back to text and store the result in the |
Decode(IEnumerable<Int32>) |
Decode the given ids, back to a String. (Inherited from WordPieceTokenizer) |
EncodeToIds(ReadOnlySpan<Char>, Boolean, Boolean, Boolean) |
Encodes input text to token Ids. |
EncodeToIds(ReadOnlySpan<Char>, Boolean, Boolean) |
Encodes input text to token Ids. |
EncodeToIds(ReadOnlySpan<Char>, Int32, Boolean, String, Int32, Boolean, Boolean) |
Encodes input text to token Ids. |
EncodeToIds(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean) |
Encodes input text to token Ids. |
EncodeToIds(String, Boolean, Boolean, Boolean) |
Encodes input text to token Ids. |
EncodeToIds(String, Boolean, Boolean) |
Encodes input text to token Ids. |
EncodeToIds(String, Int32, Boolean, String, Int32, Boolean, Boolean) |
Encodes input text to token Ids. |
EncodeToIds(String, Int32, String, Int32, Boolean, Boolean) |
Encodes input text to token Ids. |
EncodeToIds(String, ReadOnlySpan<Char>, EncodeSettings) |
Encodes input text to token Ids. (Inherited from WordPieceTokenizer) |
EncodeToTokens(ReadOnlySpan<Char>, String, Boolean, Boolean) |
Encodes input text to a list of EncodedTokens. (Inherited from Tokenizer) |
EncodeToTokens(String, ReadOnlySpan<Char>, EncodeSettings) |
Encodes input text to a list of EncodedTokens. (Inherited from WordPieceTokenizer) |
EncodeToTokens(String, String, Boolean, Boolean) |
Encodes input text to a list of EncodedTokens. (Inherited from Tokenizer) |
GetIndexByTokenCount(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean) |
Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from Tokenizer) |
GetIndexByTokenCount(String, Int32, String, Int32, Boolean, Boolean) |
Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from Tokenizer) |
GetIndexByTokenCount(String, ReadOnlySpan<Char>, EncodeSettings, Boolean, String, Int32) |
Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from WordPieceTokenizer) |
GetIndexByTokenCountFromEnd(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean) |
Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from Tokenizer) |
GetIndexByTokenCountFromEnd(String, Int32, String, Int32, Boolean, Boolean) |
Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from Tokenizer) |
GetSpecialTokensMask(IEnumerable<Int32>, IEnumerable<Int32>, Boolean) |
Retrieve sequence tokens mask from a IDs list. |
GetSpecialTokensMask(IEnumerable<Int32>, Span<Int32>, Int32, IEnumerable<Int32>, Boolean) |
Retrieve sequence tokens mask from a IDs list. |