CodeGenTokenizer.Create(Stream, Stream, Boolean, Boolean, Boolean) Method
Definition
Important
Some information relates to prerelease product that may be substantially modified before it’s released. Microsoft makes no warranties, express or implied, with respect to the information provided here.
Create a CodeGen tokenizer from the given vocab and merges streams.
public static Microsoft.ML.Tokenizers.CodeGenTokenizer Create(System.IO.Stream vocabStream, System.IO.Stream mergesStream, bool addPrefixSpace = false, bool addBeginOfSentence = false, bool addEndOfSentence = false);
static member Create : System.IO.Stream * System.IO.Stream * bool * bool * bool -> Microsoft.ML.Tokenizers.CodeGenTokenizer
Public Shared Function Create (vocabStream As Stream, mergesStream As Stream, Optional addPrefixSpace As Boolean = false, Optional addBeginOfSentence As Boolean = false, Optional addEndOfSentence As Boolean = false) As CodeGenTokenizer
Parameters
- vocabStream
- Stream
The stream containing the vocab file.
- mergesStream
- Stream
The stream containing the merges file.
- addPrefixSpace
- Boolean
Indicate whether to add a space before the token.
- addBeginOfSentence
- Boolean
Indicate emitting the beginning of sentence token during the encoding.
- addEndOfSentence
- Boolean
Indicate emitting the end of sentence token during the encoding.
Returns
The CodeGen tokenizer object.
Remarks
The tokenizer will be created according to the configuration specified in https://huggingface.co/Salesforce/codegen-350M-mono/raw/main/tokenizer.json. It is important to provide the similar vocab and merges files to the ones used in the training of the model. The vocab and merges files can be downloaded from the following links: https://huggingface.co/Salesforce/codegen-350M-mono/resolve/main/vocab.json?download=true https://huggingface.co/Salesforce/codegen-350M-mono/resolve/main/merges.txt?download=true When creating the tokenizer, ensure that the vocabulary stream is sourced from a trusted provider.