BertTokenizer Class

Definition

Namespace:: Microsoft.ML.Tokenizers

Assembly:: Microsoft.ML.Tokenizers.dll

Package:: Microsoft.ML.Tokenizers v1.0.1

Package:: Microsoft.ML.Tokenizers v0.22.0

Package:: Microsoft.ML.Tokenizers v2.0.0-preview.1.25125.4

Source:: BertTokenizer.cs

Source:: BertTokenizer.cs

Source:: BertTokenizer.cs

Important

Some information relates to prerelease product that may be substantially modified before it’s released. Microsoft makes no warranties, express or implied, with respect to the information provided here.

Tokenizer for Bert model.

public sealed class BertTokenizer : Microsoft.ML.Tokenizers.WordPieceTokenizer

type BertTokenizer = class
    inherit WordPieceTokenizer

Public NotInheritable Class BertTokenizer
Inherits WordPieceTokenizer

Inheritance: Object

Tokenizer

WordPieceTokenizer
BertTokenizer

Remarks

The BertTokenizer is a based on the WordPieceTokenizer and is used to tokenize text for Bert models. The implementation of the BertTokenizer is based on the original Bert implementation in the Hugging Face Transformers library. https://huggingface.co/transformers/v3.0.2/model_doc/bert.html?highlight=berttokenizerfast#berttokenizer

Properties

ApplyBasicTokenization	Gets a value indicating whether the tokenizer should do basic tokenization. Like clean text, normalize it, lowercasing, etc.
ClassificationToken	Gets the classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.
ClassificationTokenId	Gets the classifier token Id
ContinuingSubwordPrefix	Gets the prefix to use for sub-words that are not the first part of a word. (Inherited from WordPieceTokenizer)
IndividuallyTokenizeCjk	Gets a value indicating whether the tokenizer should split the CJK characters into tokens.
LowerCaseBeforeTokenization	Gets a value indicating whether the tokenizer should lowercase the input text.
MaskingToken	Gets the mask token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.
MaskingTokenId	Gets the mask token Id
MaxInputCharsPerWord	Gets the maximum number of characters to authorize in a single word. (Inherited from WordPieceTokenizer)
Normalizer	Gets the Normalizer in use by the Tokenizer. (Inherited from WordPieceTokenizer)
PaddingToken	Gets the token used for padding, for example when batching sequences of different lengths
PaddingTokenId	Gets padding token Id
PreTokenizer	Gets the PreTokenizer used by the Tokenizer. (Inherited from WordPieceTokenizer)
RemoveNonSpacingMarks	Gets a value indicating whether to remove non-spacing marks.
SeparatorToken	Gets the separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.
SeparatorTokenId	Gets the separator token Id
SpecialTokens	Gets the special tokens and their corresponding ids. (Inherited from WordPieceTokenizer)
SplitOnSpecialTokens	Gets a value indicating whether the tokenizer should split on the special tokens or treat special tokens as normal text.
UnknownToken	Gets the unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. (Inherited from WordPieceTokenizer)
UnknownTokenId	Gets the unknown token ID. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. (Inherited from WordPieceTokenizer)

Methods

BuildInputsWithSpecialTokens(IEnumerable<Int32>, IEnumerable<Int32>)	Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A BERT sequence has the following format: - single sequence: `[CLS] tokenIds [SEP]` - pair of sequences: `[CLS] tokenIds [SEP] additionalTokenIds [SEP]`
BuildInputsWithSpecialTokens(IEnumerable<Int32>, Span<Int32>, Int32, IEnumerable<Int32>)	Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A BERT sequence has the following format: - single sequence: `[CLS] tokenIds [SEP]` - pair of sequences: `[CLS] tokenIds [SEP] additionalTokenIds [SEP]`
CountTokens(ReadOnlySpan<Char>, Boolean, Boolean)	Get the number of tokens that the input text will be encoded to. (Inherited from Tokenizer)
CountTokens(String, Boolean, Boolean)	Get the number of tokens that the input text will be encoded to. (Inherited from Tokenizer)
CountTokens(String, ReadOnlySpan<Char>, EncodeSettings)	Get the number of tokens that the input text will be encoded to. (Inherited from WordPieceTokenizer)
Create(Stream, BertOptions)	Create a new instance of the BertTokenizer class.
Create(String, BertOptions)	Create a new instance of the BertTokenizer class.
CreateAsync(Stream, BertOptions, CancellationToken)	Create a new instance of the BertTokenizer class asynchronously.
CreateAsync(String, BertOptions, CancellationToken)	Create a new instance of the BertTokenizer class asynchronously.
CreateTokenTypeIdsFromSequences(IEnumerable<Int32>, IEnumerable<Int32>)	Create a mask from the two sequences passed to be used in a sequence-pair classification task. A BERT sequence pair mask has the following format: 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 \| first sequence \| second sequence \| If `additionalTokenIds` is null, this method only returns the first portion of the type ids (0s).
CreateTokenTypeIdsFromSequences(IEnumerable<Int32>, Span<Int32>, Int32, IEnumerable<Int32>)
Decode(IEnumerable<Int32>, Boolean)	Decode the given ids, back to a String. (Inherited from WordPieceTokenizer)
Decode(IEnumerable<Int32>, Span<Char>, Boolean, Int32, Int32)	Decode the given ids back to text and store the result in the `destination` span. (Inherited from WordPieceTokenizer)
Decode(IEnumerable<Int32>, Span<Char>, Int32, Int32)	Decode the given ids back to text and store the result in the `destination` span. (Inherited from WordPieceTokenizer)
Decode(IEnumerable<Int32>)	Decode the given ids, back to a String. (Inherited from WordPieceTokenizer)
EncodeToIds(ReadOnlySpan<Char>, Boolean, Boolean, Boolean)	Encodes input text to token Ids.
EncodeToIds(ReadOnlySpan<Char>, Boolean, Boolean)	Encodes input text to token Ids.
EncodeToIds(ReadOnlySpan<Char>, Int32, Boolean, String, Int32, Boolean, Boolean)	Encodes input text to token Ids.
EncodeToIds(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean)	Encodes input text to token Ids.
EncodeToIds(String, Boolean, Boolean, Boolean)	Encodes input text to token Ids.
EncodeToIds(String, Boolean, Boolean)	Encodes input text to token Ids.
EncodeToIds(String, Int32, Boolean, String, Int32, Boolean, Boolean)	Encodes input text to token Ids.
EncodeToIds(String, Int32, String, Int32, Boolean, Boolean)	Encodes input text to token Ids.
EncodeToIds(String, ReadOnlySpan<Char>, EncodeSettings)	Encodes input text to token Ids. (Inherited from WordPieceTokenizer)
EncodeToTokens(ReadOnlySpan<Char>, String, Boolean, Boolean)	Encodes input text to a list of EncodedTokens. (Inherited from Tokenizer)
EncodeToTokens(String, ReadOnlySpan<Char>, EncodeSettings)	Encodes input text to a list of EncodedTokens. (Inherited from WordPieceTokenizer)
EncodeToTokens(String, String, Boolean, Boolean)	Encodes input text to a list of EncodedTokens. (Inherited from Tokenizer)
GetIndexByTokenCount(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean)	Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from Tokenizer)
GetIndexByTokenCount(String, Int32, String, Int32, Boolean, Boolean)	Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from Tokenizer)
GetIndexByTokenCount(String, ReadOnlySpan<Char>, EncodeSettings, Boolean, String, Int32)	Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from WordPieceTokenizer)
GetIndexByTokenCountFromEnd(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean)	Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from Tokenizer)
GetIndexByTokenCountFromEnd(String, Int32, String, Int32, Boolean, Boolean)	Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from Tokenizer)
GetSpecialTokensMask(IEnumerable<Int32>, IEnumerable<Int32>, Boolean)	Retrieve sequence tokens mask from a IDs list.
GetSpecialTokensMask(IEnumerable<Int32>, Span<Int32>, Int32, IEnumerable<Int32>, Boolean)	Retrieve sequence tokens mask from a IDs list.

Applies to