Training and testing datasets

In a Custom Speech project, you can upload datasets for training, qualitative inspection, and quantitative measurement. This article covers the types of training and testing data that you can use for Custom Speech.

Text and audio that you use to test and train a custom model should include samples from a diverse set of speakers and scenarios that you want your model to recognize. Consider these factors when you're gathering data for custom model testing and training:

  • Include text and audio data to cover the kinds of verbal statements that your users will make when they're interacting with your model. For example, a model that raises and lowers the temperature needs training on statements that people might make to request such changes.
  • Include all speech variances that you want your model to recognize. Many factors can vary speech, including accents, dialects, language-mixing, age, gender, voice pitch, stress level, and time of day.
  • Include samples from different environments, for example, indoor, outdoor, and road noise, where your model will be used.
  • Record audio with hardware devices that the production system will use. If your model must identify speech recorded on devices of varying quality, the audio data that you provide to train your model must also represent these diverse scenarios.
  • Keep the dataset diverse and representative of your project requirements. You can add more data to your model later.
  • Only include data that your model needs to transcribe. Including data that isn't within your custom model's recognition requirements can harm recognition quality overall.

Data types

The following table lists accepted data types, when each data type should be used, and the recommended quantity. Not every data type is required to create a model. Data requirements will vary depending on whether you're creating a test or training a model.

Data type Used for testing Recommended quantity Used for training Recommended quantity
Audio only Yes (visual inspection) 5+ audio files No Not applicable
Audio + human-labeled transcripts Yes (evaluation of accuracy) 0.5-5 hours of audio Yes 1-20 hours of audio
Plain text No Not applicable Yes 1-200 MB of related text
Structured text (public preview) No Not applicable Yes Up to 10 classes with up to 4,000 items and up to 50,000 training sentences
Pronunciation No Not applicable Yes 1 KB to 1 MB of pronunciation text

Training with plain text or structured text usually finishes within a few minutes.


Start with plain-text data or structured-text data. This data will improve the recognition of special terms and phrases. Training with text is much faster than training with audio (minutes versus days).

Start with small sets of sample data that match the language, acoustics, and hardware where your model will be used. Small datasets of representative data can expose problems before you invest in gathering larger datasets for training. For sample Custom Speech data, see this GitHub repository.

If you will train a custom model with audio data, choose a Speech resource region with dedicated hardware for training audio data. See footnotes in the regions table for more information. In regions with dedicated hardware for Custom Speech training, the Speech service will use up to 20 hours of your audio training data, and can process about 10 hours of data per day. In other regions, the Speech service uses up to 8 hours of your audio data, and can process about 1 hour of data per day. After the model is trained, you can copy the model to another region as needed with the CopyModelToSubscriptionToSubscription REST API.

Consider datasets by scenario

A model that's trained on a subset of scenarios can perform well in only those scenarios. Carefully choose data that represents the full scope of scenarios that you need your custom model to recognize. The following table shows datasets to consider for some speech recognition scenarios:

Scenario Plain text data and structured text data Audio + human-labeled transcripts New words with pronunciation
Call center Marketing documents, website, product reviews related to call center activity Call center calls transcribed by humans Terms that have ambiguous pronunciations (see the Xbox example in the preceding section)
Voice assistant Lists of sentences that use various combinations of commands and entities Recorded voices speaking commands into device, transcribed into text Names (movies, songs, products) that have unique pronunciations
Dictation Written input, such as instant messages or emails Similar to preceding examples Similar to preceding examples
Video closed captioning TV show scripts, movies, marketing content, video summaries Exact transcripts of videos Similar to preceding examples

To help determine which dataset to use to address your problems, refer to the following table:

Use case Data type
Improve recognition accuracy on industry-specific vocabulary and grammar, such as medical terminology or IT jargon. Plain text or structured text data
Define the phonetic and displayed form of a word or term that has nonstandard pronunciation, such as product names or acronyms. Pronunciation data or phonetic pronunciation in structured text
Improve recognition accuracy on speaking styles, accents, or specific background noises. Audio + human-labeled transcripts

Audio + human-labeled transcript data for training or testing

You can use audio + human-labeled transcript data for both training and testing purposes. You must provide human-labeled transcriptions (word by word) for comparison:

  • To improve the acoustic aspects like slight accents, speaking styles, and background noises.
  • To measure the accuracy of Microsoft's speech-to-text accuracy when it's processing your audio files.

For a list of base models that support training with audio data, see Language support. Even if a base model does support training with audio data, the service might use only part of the audio. And it will still use all the transcripts.


If a base model doesn't support customization with audio data, only the transcription text will be used for training. If you switch to a base model that supports customization with audio data, the training time may increase from several hours to several days. The change in training time would be most noticeable when you switch to a base model in a region without dedicated hardware for training. If the audio data is not required, you should remove it to decrease the training time.

Audio with human-labeled transcripts offers the greatest accuracy improvements if the audio comes from the target use case. Samples must cover the full scope of speech. For example, a call center for a retail store would get the most calls about swimwear and sunglasses during summer months. Ensure that your sample includes the full scope of speech that you want to detect.

Consider these details:

  • Training with audio will bring the most benefits if the audio is also hard to understand for humans. In most cases, you should start training by using only related text.
  • If you use one of the most heavily used languages, such as US English, it's unlikely that you would need to train with audio data. For such languages, the base models already offer very good recognition results in most scenarios, so it's probably enough to train with related text.
  • Custom Speech can capture word context only to reduce substitution errors, not insertion or deletion errors.
  • Avoid samples that include transcription errors, but do include a diversity of audio quality.
  • Avoid sentences that are unrelated to your problem domain. Unrelated sentences can harm your model.
  • When the transcript quality varies, you can duplicate exceptionally good sentences, such as excellent transcriptions that include key phrases, to increase their weight.
  • The Speech service automatically uses the transcripts to improve the recognition of domain-specific words and phrases, as though they were added as related text.
  • It can take several days for a training operation to finish. To improve the speed of training, be sure to create your Speech service subscription in a region that has dedicated hardware for training.

A large training dataset is required to improve recognition. Generally, we recommend that you provide word-by-word transcriptions for 1 to 20 hours of audio. However, even as little as 30 minutes can help improve recognition results. Although creating human-labeled transcription can take time, improvements in recognition will only be as good as the data that you provide. You should upload only high-quality transcripts.

Audio files can have silence at the beginning and end of the recording. If possible, include at least a half-second of silence before and after speech in each sample file. Although audio with low recording volume or disruptive background noise is not helpful, it shouldn't limit or degrade your custom model. Always consider upgrading your microphones and signal processing hardware before gathering audio samples.

Custom Speech projects require audio files with these properties:

Property Value
File format RIFF (WAV)
Sample rate 8,000 Hz or 16,000 Hz
Channels 1 (mono)
Maximum length per audio 2 hours (testing) / 60 s (training)
Sample format PCM, 16-bit
Archive format .zip
Maximum zip size 2 GB or 10,000 files

Plain-text data for training

You can add plain text sentences of related text to improve the recognition of domain-specific words and phrases. Related text sentences can reduce substitution errors related to misrecognition of common words and domain-specific words by showing them in context. Domain-specific words can be uncommon or made-up words, but their pronunciation must be straightforward to be recognized.

Provide domain-related sentences in a single text file. Use text data that's close to the expected spoken utterances. Utterances don't need to be complete or grammatically correct, but they must accurately reflect the spoken input that you expect the model to recognize. When possible, try to have one sentence or keyword controlled on a separate line. To increase the weight of a term such as product names, add several sentences that include the term. But don't copy too much - it could affect the overall recognition rate.


Avoid related text sentences that include noise such as unrecognizable characters or words.

Use this table to ensure that your plain text dataset file is formatted correctly:

Property Value
Text encoding UTF-8 BOM
Number of utterances per line 1
Maximum file size 200 MB

You must also adhere to the following restrictions:

  • Avoid repeating characters, words, or groups of words more than three times, as in "aaaa," "yeah yeah yeah yeah," or "that's it that's it that's it that's it." The Speech service might drop lines with too many repetitions.
  • Don't use special characters or UTF-8 characters above U+00A1.
  • URIs will be rejected.
  • For some languages such as Japanese or Korean, importing large amounts of text data can take a long time or can time out. Consider dividing the dataset into multiple text files with up to 20,000 lines in each.

Structured-text data for training


Structured-text data for training is in public preview.

Use structured text data when your data follows a particular pattern in particular utterances that differ only by words or phrases from a list. To simplify the creation of training data and to enable better modeling inside the Custom Language model, you can use a structured text in Markdown format to define lists of items and phonetic pronunciation of words. You can then reference these lists inside your training utterances.

Expected utterances often follow a certain pattern. One common pattern is that utterances differ only by words or phrases from a list. Examples of this pattern could be:

  • "I have a question about product," where product is a list of possible products.
  • "Make that object color," where object is a list of geometric shapes and color is a list of colors.

For a list of supported base models and locales for training with structured text, see Language support. You must use the latest base model for these locales. For locales that don't support training with structured text, the service will take any training sentences that don't reference any classes as part of training with plain-text data.

The structured-text file should have an .md extension. The maximum file size is 200 MB, and the text encoding must be UTF-8 BOM. The syntax of the Markdown is the same as that from the Language Understanding models, in particular list entities and example utterances. For more information about the complete Markdown syntax, see the Language Understanding Markdown.

Here are key details about the supported Markdown format:

Property Description Limits
@list A list of items that can be referenced in an example sentence. Maximum of 20 lists. Maximum of 35,000 items per list.
speech:phoneticlexicon A list of phonetic pronunciations according to the Universal Phone Set. Pronunciation is adjusted for each instance where the word appears in a list or training sentence. For example, if you have a word that sounds like "cat" and you want to adjust the pronunciation to "k ae t", you would add - cat/k ae t to the speech:phoneticlexicon list. Maximum of 15,000 entries. Maximum of 2 pronunciations per word.
#ExampleSentences A pound symbol (#) delimits a section of example sentences. The section heading can only contain letters, digits, and underscores. Example sentences should reflect the range of speech that your model should expect. A training sentence can refer to items under a @list by using surrounding left and right curly braces ({@list name}). You can refer to multiple lists in the same training sentence, or none at all. Maximum file size of 200MB.
// Comments follow a double slash (//). Not applicable

Here's an example structured text file:

// This is a comment because it follows a double slash (`//`).

// Here are three separate lists of items that can be referenced in an example sentence. You can have up to 10 of these.
@ list food =
- pizza
- burger
- ice cream
- soda

@ list pet =
- cat
- dog
- fish

@ list sports =
- soccer
- tennis
- cricket
- basketball
- baseball
- football

// List of phonetic pronunciations
@ speech:phoneticlexicon
- cat/k ae t
- fish/f ih sh

// Here are two sections of training sentences. 
- you can include sentences without a class reference
- what {@pet} do you have
- I like eating {@food} and playing {@sports}
- my {@pet} likes {@food}

- you can include more sentences without a class reference
- or more sentences that have a class reference like {@pet} 

Pronunciation data for training

Specialized or made up words might have unique pronunciations. These words can be recognized if they can be broken down into smaller words to pronounce them. For example, to recognize "Xbox", pronounce it as "X box". This approach won't increase overall accuracy, but can improve recognition of this and other keywords.

You can provide a custom pronunciation file to improve recognition. Don't use custom pronunciation files to alter the pronunciation of common words. For a list of languages that support custom pronunciation, see language support.


You can either use a pronunciation data file on its own, or you can add pronunciation within a structured text data file. The Speech service doesn't support training a model where you select both of those datasets as input.

The spoken form is the phonetic sequence spelled out. It can be composed of letters, words, syllables, or a combination of all three. This table includes some examples:

Recognized displayed form Spoken form
3CPO three c p o
CNTK c n t k
IEEE i triple e

You provide pronunciations in a single text file. Include the spoken utterance and a custom pronunciation for each. Each row in the file should begin with the recognized form, then a tab character, and then the space-delimited phonetic sequence.

3CPO    three c p o
CNTK    c n t k
IEEE    i triple e

Refer to the following table to ensure that your pronunciation dataset files are valid and correctly formatted.

Property Value
Text encoding UTF-8 BOM (ANSI is also supported for English)
Number of pronunciations per line 1
Maximum file size 1 MB (1 KB for free tier)

Audio data for testing

Audio data is optimal for testing the accuracy of Microsoft's baseline speech-to-text model or a custom model. Keep in mind that audio data is used to inspect the accuracy of speech with regard to a specific model's performance. If you want to quantify the accuracy of a model, use audio + human-labeled transcripts.

Custom Speech projects require audio files with these properties:

Property Value
File format RIFF (WAV)
Sample rate 8,000 Hz or 16,000 Hz
Channels 1 (mono)
Maximum length per audio 2 hours
Sample format PCM, 16-bit
Archive format .zip
Maximum archive size 2 GB or 10,000 files


When you're uploading training and testing data, the .zip file size can't exceed 2 GB. If you require more data for training, divide it into several .zip files and upload them separately. Later, you can choose to train from multiple datasets. However, you can test from only a single dataset.

Use SoX to verify audio properties or convert existing audio to the appropriate formats. Here are some example SoX commands:

Activity SoX command
Check the audio file format. sox --i <filename>
Convert the audio file to single channel, 16-bit, 16 KHz. sox <input> -b 16 -e signed-integer -c 1 -r 16k -t wav <output>.wav

Audio data for training

Not all base models support training with audio data. For a list of base models that support training with audio data, see Language support.

Even if a base model supports training with audio data, the service might use only part of the audio. In regions with dedicated hardware available for training audio data, the Speech service will use up to 20 hours of your audio training data. In other regions, the Speech service uses up to 8 hours of your audio data.

Next steps