Share via


BertTokenizer Class

Definition

Tokenizer for Bert model.

public sealed class BertTokenizer : Microsoft.ML.Tokenizers.WordPieceTokenizer
type BertTokenizer = class
    inherit WordPieceTokenizer
Public NotInheritable Class BertTokenizer
Inherits WordPieceTokenizer
Inheritance

Remarks

The BertTokenizer is a based on the WordPieceTokenizer and is used to tokenize text for Bert models. The implementation of the BertTokenizer is based on the original Bert implementation in the Hugging Face Transformers library. https://huggingface.co/transformers/v3.0.2/model_doc/bert.html?highlight=berttokenizerfast#berttokenizer

Properties

ApplyBasicTokenization

Gets a value indicating whether the tokenizer should do basic tokenization. Like clean text, normalize it, lowercasing, etc.

ClassificationToken

Gets the classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.

ClassificationTokenId

Gets the classifier token Id

ContinuingSubwordPrefix

Gets the prefix to use for sub-words that are not the first part of a word.

(Inherited from WordPieceTokenizer)
IndividuallyTokenizeCjk

Gets a value indicating whether the tokenizer should split the CJK characters into tokens.

LowerCaseBeforeTokenization

Gets a value indicating whether the tokenizer should lowercase the input text.

MaskingToken

Gets the mask token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.

MaskingTokenId

Gets the mask token Id

MaxInputCharsPerWord

Gets the maximum number of characters to authorize in a single word.

(Inherited from WordPieceTokenizer)
Normalizer

Gets the Normalizer in use by the Tokenizer.

(Inherited from WordPieceTokenizer)
PaddingToken

Gets the token used for padding, for example when batching sequences of different lengths

PaddingTokenId

Gets padding token Id

PreTokenizer

Gets the PreTokenizer used by the Tokenizer.

(Inherited from WordPieceTokenizer)
RemoveNonSpacingMarks

Gets a value indicating whether to remove non-spacing marks.

SeparatorToken

Gets the separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.

SeparatorTokenId

Gets the separator token Id

SpecialTokens

Gets the special tokens and their corresponding ids.

(Inherited from WordPieceTokenizer)
SplitOnSpecialTokens

Gets a value indicating whether the tokenizer should split on the special tokens or treat special tokens as normal text.

UnknownToken

Gets the unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.

(Inherited from WordPieceTokenizer)
UnknownTokenId

Gets the unknown token ID. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.

(Inherited from WordPieceTokenizer)

Methods

BuildInputsWithSpecialTokens(IEnumerable<Int32>, IEnumerable<Int32>)

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A BERT sequence has the following format: - single sequence: [CLS] tokenIds [SEP] - pair of sequences: [CLS] tokenIds [SEP] additionalTokenIds [SEP]

BuildInputsWithSpecialTokens(IEnumerable<Int32>, Span<Int32>, Int32, IEnumerable<Int32>)

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A BERT sequence has the following format: - single sequence: [CLS] tokenIds [SEP] - pair of sequences: [CLS] tokenIds [SEP] additionalTokenIds [SEP]

CountTokens(ReadOnlySpan<Char>, Boolean, Boolean)

Get the number of tokens that the input text will be encoded to.

(Inherited from Tokenizer)
CountTokens(String, Boolean, Boolean)

Get the number of tokens that the input text will be encoded to.

(Inherited from Tokenizer)
CountTokens(String, ReadOnlySpan<Char>, EncodeSettings)

Get the number of tokens that the input text will be encoded to.

(Inherited from WordPieceTokenizer)
Create(Stream, BertOptions)

Create a new instance of the BertTokenizer class.

Create(String, BertOptions)

Create a new instance of the BertTokenizer class.

CreateAsync(Stream, BertOptions, CancellationToken)

Create a new instance of the BertTokenizer class asynchronously.

CreateAsync(String, BertOptions, CancellationToken)

Create a new instance of the BertTokenizer class asynchronously.

CreateTokenTypeIdsFromSequences(IEnumerable<Int32>, IEnumerable<Int32>)

Create a mask from the two sequences passed to be used in a sequence-pair classification task. A BERT sequence pair mask has the following format: 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence | If additionalTokenIds is null, this method only returns the first portion of the type ids (0s).

CreateTokenTypeIdsFromSequences(IEnumerable<Int32>, Span<Int32>, Int32, IEnumerable<Int32>)
Decode(IEnumerable<Int32>, Boolean)

Decode the given ids, back to a String.

(Inherited from WordPieceTokenizer)
Decode(IEnumerable<Int32>, Span<Char>, Boolean, Int32, Int32)

Decode the given ids back to text and store the result in the destination span.

(Inherited from WordPieceTokenizer)
Decode(IEnumerable<Int32>, Span<Char>, Int32, Int32)

Decode the given ids back to text and store the result in the destination span.

(Inherited from WordPieceTokenizer)
Decode(IEnumerable<Int32>)

Decode the given ids, back to a String.

(Inherited from WordPieceTokenizer)
EncodeToIds(ReadOnlySpan<Char>, Boolean, Boolean, Boolean)

Encodes input text to token Ids.

EncodeToIds(ReadOnlySpan<Char>, Boolean, Boolean)

Encodes input text to token Ids.

EncodeToIds(ReadOnlySpan<Char>, Int32, Boolean, String, Int32, Boolean, Boolean)

Encodes input text to token Ids.

EncodeToIds(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean)

Encodes input text to token Ids.

EncodeToIds(String, Boolean, Boolean, Boolean)

Encodes input text to token Ids.

EncodeToIds(String, Boolean, Boolean)

Encodes input text to token Ids.

EncodeToIds(String, Int32, Boolean, String, Int32, Boolean, Boolean)

Encodes input text to token Ids.

EncodeToIds(String, Int32, String, Int32, Boolean, Boolean)

Encodes input text to token Ids.

EncodeToIds(String, ReadOnlySpan<Char>, EncodeSettings)

Encodes input text to token Ids.

(Inherited from WordPieceTokenizer)
EncodeToTokens(ReadOnlySpan<Char>, String, Boolean, Boolean)

Encodes input text to a list of EncodedTokens.

(Inherited from Tokenizer)
EncodeToTokens(String, ReadOnlySpan<Char>, EncodeSettings)

Encodes input text to a list of EncodedTokens.

(Inherited from WordPieceTokenizer)
EncodeToTokens(String, String, Boolean, Boolean)

Encodes input text to a list of EncodedTokens.

(Inherited from Tokenizer)
GetIndexByTokenCount(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean)

Find the index of the maximum encoding capacity without surpassing the token limit.

(Inherited from Tokenizer)
GetIndexByTokenCount(String, Int32, String, Int32, Boolean, Boolean)

Find the index of the maximum encoding capacity without surpassing the token limit.

(Inherited from Tokenizer)
GetIndexByTokenCount(String, ReadOnlySpan<Char>, EncodeSettings, Boolean, String, Int32)

Find the index of the maximum encoding capacity without surpassing the token limit.

(Inherited from WordPieceTokenizer)
GetIndexByTokenCountFromEnd(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean)

Find the index of the maximum encoding capacity without surpassing the token limit.

(Inherited from Tokenizer)
GetIndexByTokenCountFromEnd(String, Int32, String, Int32, Boolean, Boolean)

Find the index of the maximum encoding capacity without surpassing the token limit.

(Inherited from Tokenizer)
GetSpecialTokensMask(IEnumerable<Int32>, IEnumerable<Int32>, Boolean)

Retrieve sequence tokens mask from a IDs list.

GetSpecialTokensMask(IEnumerable<Int32>, Span<Int32>, Int32, IEnumerable<Int32>, Boolean)

Retrieve sequence tokens mask from a IDs list.

Applies to