CodeGenTokenizer.Create(Stream, Stream, Boolean, Boolean, Boolean) Method

Definition

Namespace:: Microsoft.ML.Tokenizers

Assembly:: Microsoft.ML.Tokenizers.dll

Package:: Microsoft.ML.Tokenizers v1.0.1

Package:: Microsoft.ML.Tokenizers v0.22.0

Package:: Microsoft.ML.Tokenizers v2.0.0-preview.1.25125.4

Source:: CodeGenTokenizer.cs

Source:: CodeGenTokenizer.cs

Source:: CodeGenTokenizer.cs

Important

Some information relates to prerelease product that may be substantially modified before it’s released. Microsoft makes no warranties, express or implied, with respect to the information provided here.

Create a CodeGen tokenizer from the given vocab and merges streams.

public static Microsoft.ML.Tokenizers.CodeGenTokenizer Create(System.IO.Stream vocabStream, System.IO.Stream mergesStream, bool addPrefixSpace = false, bool addBeginOfSentence = false, bool addEndOfSentence = false);

static member Create : System.IO.Stream * System.IO.Stream * bool * bool * bool -> Microsoft.ML.Tokenizers.CodeGenTokenizer

Public Shared Function Create (vocabStream As Stream, mergesStream As Stream, Optional addPrefixSpace As Boolean = false, Optional addBeginOfSentence As Boolean = false, Optional addEndOfSentence As Boolean = false) As CodeGenTokenizer

Parameters

vocabStream: Stream

The stream containing the vocab file.

mergesStream: Stream

The stream containing the merges file.

addPrefixSpace: Boolean

Indicate whether to add a space before the token.

addBeginOfSentence: Boolean

Indicate emitting the beginning of sentence token during the encoding.

addEndOfSentence: Boolean

Indicate emitting the end of sentence token during the encoding.

Returns

CodeGenTokenizer

The CodeGen tokenizer object.

Remarks

The tokenizer will be created according to the configuration specified in https://huggingface.co/Salesforce/codegen-350M-mono/raw/main/tokenizer.json. It is important to provide the similar vocab and merges files to the ones used in the training of the model. The vocab and merges files can be downloaded from the following links: https://huggingface.co/Salesforce/codegen-350M-mono/resolve/main/vocab.json?download=true https://huggingface.co/Salesforce/codegen-350M-mono/resolve/main/merges.txt?download=true When creating the tokenizer, ensure that the vocabulary stream is sourced from a trusted provider.

Applies to