Share via


CodeGenTokenizer.Create(Stream, Stream, Boolean, Boolean, Boolean) Method

Definition

Create a CodeGen tokenizer from the given vocab and merges streams.

public static Microsoft.ML.Tokenizers.CodeGenTokenizer Create(System.IO.Stream vocabStream, System.IO.Stream mergesStream, bool addPrefixSpace = false, bool addBeginOfSentence = false, bool addEndOfSentence = false);
static member Create : System.IO.Stream * System.IO.Stream * bool * bool * bool -> Microsoft.ML.Tokenizers.CodeGenTokenizer
Public Shared Function Create (vocabStream As Stream, mergesStream As Stream, Optional addPrefixSpace As Boolean = false, Optional addBeginOfSentence As Boolean = false, Optional addEndOfSentence As Boolean = false) As CodeGenTokenizer

Parameters

vocabStream
Stream

The stream containing the vocab file.

mergesStream
Stream

The stream containing the merges file.

addPrefixSpace
Boolean

Indicate whether to add a space before the token.

addBeginOfSentence
Boolean

Indicate emitting the beginning of sentence token during the encoding.

addEndOfSentence
Boolean

Indicate emitting the end of sentence token during the encoding.

Returns

The CodeGen tokenizer object.

Remarks

The tokenizer will be created according to the configuration specified in https://huggingface.co/Salesforce/codegen-350M-mono/raw/main/tokenizer.json. It is important to provide the similar vocab and merges files to the ones used in the training of the model. The vocab and merges files can be downloaded from the following links: https://huggingface.co/Salesforce/codegen-350M-mono/resolve/main/vocab.json?download=true https://huggingface.co/Salesforce/codegen-350M-mono/resolve/main/merges.txt?download=true When creating the tokenizer, ensure that the vocabulary stream is sourced from a trusted provider.

Applies to