Q: Where do I start if I want to use a base model?

First, get an API key and region in the Azure portal . If you want to make REST calls to a predeployed base model, see the REST APIs documentation. If you want to use WebSockets, download the Speech SDK .

Q: When a new version of a base model is available, is my deployment automatically updated?

Deployments are not automatically updated. If you adapted and deployed a model, the existing deployment remains as is. You can decommission the deployed model, readapt it by using the newer version of the base model, and redeploy it for better accuracy. Both base models and custom models are retired after some time (see Model and endpoint lifecycle ).

Question 1

What is the difference between a base model and a custom speech to text model?

Accepted Answer

A baseline speech to text model is trained with Microsoft-owned data and is already deployed in the cloud. You can create and use a custom model to better fit an environment that has specific ambient noise or language. Factory floors, cars, or noisy streets would require an adapted acoustic model. Topics such as biology, physics, radiology, product names, and custom acronyms would require an adapted language model. If you want to train a custom model, you should start with related text to improve the recognition of special terms and phrases.

Question 2

Where do I start if I want to use a base model?

Accepted Answer

First, get an API key and region in the Azure portal. If you want to make REST calls to a predeployed base model, see the REST APIs documentation. If you want to use WebSockets, download the Speech SDK.

Question 3

Do I always need to build a custom speech model?

Accepted Answer

No. If your application uses generic, day-to-day language, you don't need to customize a model. If your application is used in an environment where there's little or no background noise, you don't need to customize a model.

You can deploy baseline and customized models in the portal and then run accuracy tests against them. You can use this feature to measure the accuracy of a base model versus a custom model.

Question 4

How do I know when the processing for my dataset or model is complete?

Accepted Answer

Currently, the only way to know is to view the status of the model or dataset in the table. When the processing is complete, the status is Succeeded.

Question 5

Can I create more than one model?

Accepted Answer

There's no limit to the number of models you can have in your collection.

Question 6

I realized that I made a mistake. How do I cancel a data import or model creation that's in progress?

Accepted Answer

Currently, you can't roll back an acoustic or language adaptation process. You can delete imported data and models when they're in a terminal state.

Question 7

I get several results for each phrase with the detailed output format. Which one should I use?

Accepted Answer

Always take the first result, even if another result ("N-Best") might have a higher confidence value. Speech service considers the first result to be the best. The result can also be an empty string if no speech was recognized.

The other results are likely worse and might not have full capitalization and punctuation applied. These results are most useful in special scenarios, such as giving users the option to pick corrections from a list or handling incorrectly recognized commands.

Question 8

Why are there multiple base models?

Accepted Answer

You can choose from more than one base model in Speech service. Each model name contains the date when it was added. When you start training a custom model, use the most recent model to get the best accuracy. Older base models are still available for some time after a new model is made available. You can continue using the model that you worked with until it's retired (see Model and endpoint lifecycle). We still recommend that you switch to the latest base model for better accuracy.

Question 9

Can I update my existing model (model stacking)?

Accepted Answer

You can't update an existing model. As a solution, combine the old dataset with the new dataset and readapt.

The old dataset and the new dataset must be combined in a single .zip file (for acoustic data) or in a .txt file (for language data). When the adaptation is finished, redeploy the new, updated model to obtain a new endpoint.

Question 10

When a new version of a base model is available, is my deployment automatically updated?

Accepted Answer

Deployments are not automatically updated.

If you adapted and deployed a model, the existing deployment remains as is. You can decommission the deployed model, readapt it by using the newer version of the base model, and redeploy it for better accuracy.

Both base models and custom models are retired after some time (see Model and endpoint lifecycle).

Question 11

Can I download my model and run it locally?

Accepted Answer

You can run a custom model locally in a Docker container.

Question 12

Can I copy or move my datasets, models, and deployments to another region or subscription?

Accepted Answer

You can use the Models_Copy REST API to copy a custom model to another region or subscription. Datasets and deployments can't be copied. You can import a dataset again in another subscription and create endpoints there by using the model copies.

Question 13

Are my requests logged?

Accepted Answer

By default, requests aren't logged (neither audio nor transcription). If necessary, you can select the Log content from this endpoint option when you create a custom endpoint. You can also enable audio logging in the Speech SDK on a per-request basis, without having to create a custom endpoint. In both cases, audio and recognition results of requests will be stored in secure storage. Subscriptions that use Microsoft-owned storage are available for 30 days.

You can export the logged files on the deployment page in Speech Studio if you use a custom endpoint with Log content from this endpoint enabled. If audio logging is enabled via the SDK, call the API to access the files. You can also use API to delete the logs anytime.

Question 14

Are my requests throttled?

Accepted Answer

For information, see Speech service quotas and limits.

Question 15

How am I charged for dual channel audio?

Accepted Answer

If you submit each channel separately in their own file, you're charged for the audio duration of each file. If you submit a single file with the channels multiplexed together, you're charged for the duration of the single file. For more information about pricing, see the Azure AI services pricing page.

Important

If you have further privacy concerns that prevent you from using the custom speech service, contact one of the support channels.

Increasing concurrency

For information, see Speech service quotas and limits.

Question 16

What is the limit to the size of a dataset, and why is it the limit?

Accepted Answer

The limit is because of the restriction on the size of files for HTTP upload. For the actual limit, see Speech service quotas and limits. You can split your data into multiple datasets and select all of them to train the model.

Question 17

Can I zip (compress) my text files so that I can upload a larger text file?

Accepted Answer

No. Currently, only uncompressed text files are allowed.

Question 18

The data report says there were failed utterances. What is the issue?

Accepted Answer

A failure to upload 100 percent of the utterances in a file isn't a problem. If most of the utterances in an acoustic or language dataset (for example, more than 95 percent) are successfully imported, the dataset can be usable. However, we still recommend that you try to understand why the utterances failed and then fix the problem. Most common problems, such as formatting errors, are easy to fix.

Question 19

How much acoustic data do I need?

Accepted Answer

We recommend starting with from 30 minutes to 1 hour of acoustic data.

Question 20

What data should I collect?

Accepted Answer

Collect data that's as close to the application scenario and use case as possible. The data collection should match the target application and users in terms of device or devices, environments, and types of speakers. In general, you should collect data from as broad a range of speakers as possible.

Question 21

How should I collect acoustic data?

Accepted Answer

You can create a standalone data collection application or use off-the-shelf audio recording software. You can also create a version of your application that logs the audio data and then uses the data.

Question 22

Do I need to transcribe adaptation data myself?

Accepted Answer

Yes. You can transcribe it yourself or use a professional transcription service. Some users prefer professional transcribers, and others use crowdsourcing or transcribe the data themselves.

Question 23

How long does it take to train a custom model with audio data?

Accepted Answer

Training a model with audio data can be a lengthy process. Depending on the amount of data, it can take several days to create a custom model. If it can't be finished within one week, the service might abort the training operation and report the model as failed.

In general, Speech service processes approximately 10 hours of audio data per day in regions that have dedicated hardware. Training with text only is faster and ordinarily finishes within minutes.

Use one of the regions where dedicated hardware is available for training. The Speech service uses up to 100 hours of audio for training in these regions.

Question 24

What is word error rate (WER), and how is it computed?

Accepted Answer

WER is the evaluation metric for speech recognition. WER is calculated as the total number of errors (insertions, deletions, and substitutions), divided by the total number of words in the reference transcription. For more information, see Test model quantitatively.

Question 25

How do I determine whether the results of an accuracy test are good?

Accepted Answer

The results show a comparison between the base model and the model you customized. To make customization worthwhile, you should aim to beat the base model.

Question 26

How do I determine the WER of a base model so I can see whether it improved?

Accepted Answer

The offline test results show the baseline accuracy of the custom model and the improvement over baseline.

Question 27

How much text data do I need to upload?

Accepted Answer

It depends on how different the vocabulary and phrases used in your application are from the starting language models. For all new words, it's useful to provide as many examples as possible of the usage of those words. For common phrases that are used in your application, including phrases in the language data, providing many examples is useful because it tells the system to listen for these terms also. It's common to have at least 100 and, ordinarily, several hundred or more utterances in the language dataset. Also, if some types of queries are expected to be more common than others, you can insert multiple copies of the common queries in the dataset.

Question 28

Can I simply upload a list of words?

Accepted Answer

Uploading a list of words adds them to the vocabulary, but it doesn't teach the system how the words are ordinarily used. By providing full or partial utterances (sentences or phrases of things that users are likely to say), the language model can learn the new words and how they're used. The custom language model is good not only for adding new words to the system, but also for adjusting the likelihood of known words for your application. Providing full utterances helps the system learn better.

Share via

General

What is the difference between a base model and a custom speech to text model?

Where do I start if I want to use a base model?

Do I always need to build a custom speech model?

How do I know when the processing for my dataset or model is complete?

Can I create more than one model?

I realized that I made a mistake. How do I cancel a data import or model creation that's in progress?

I get several results for each phrase with the detailed output format. Which one should I use?

Why are there multiple base models?

Can I update my existing model (model stacking)?

When a new version of a base model is available, is my deployment automatically updated?

Can I download my model and run it locally?

Can I copy or move my datasets, models, and deployments to another region or subscription?

Are my requests logged?

Are my requests throttled?

How am I charged for dual channel audio?

Increasing concurrency

Importing data

What is the limit to the size of a dataset, and why is it the limit?

Can I zip (compress) my text files so that I can upload a larger text file?

The data report says there were failed utterances. What is the issue?

Creating an acoustic model

How much acoustic data do I need?

What data should I collect?

How should I collect acoustic data?

Do I need to transcribe adaptation data myself?

How long does it take to train a custom model with audio data?

Accuracy testing

What is word error rate (WER), and how is it computed?

How do I determine whether the results of an accuracy test are good?

How do I determine the WER of a base model so I can see whether it improved?

Creating a language model

How much text data do I need to upload?

Can I simply upload a list of words?

Next steps

Share via

Speech to text FAQ

General

What is the difference between a base model and a custom speech to text model?

Where do I start if I want to use a base model?

Do I always need to build a custom speech model?

How do I know when the processing for my dataset or model is complete?

Can I create more than one model?

I realized that I made a mistake. How do I cancel a data import or model creation that's in progress?

I get several results for each phrase with the detailed output format. Which one should I use?

Why are there multiple base models?

Can I update my existing model (model stacking)?

When a new version of a base model is available, is my deployment automatically updated?

Can I download my model and run it locally?

Can I copy or move my datasets, models, and deployments to another region or subscription?

Are my requests logged?

Are my requests throttled?

How am I charged for dual channel audio?

Increasing concurrency

Importing data

What is the limit to the size of a dataset, and why is it the limit?

Can I zip (compress) my text files so that I can upload a larger text file?

The data report says there were failed utterances. What is the issue?

Creating an acoustic model

How much acoustic data do I need?

What data should I collect?

How should I collect acoustic data?

Do I need to transcribe adaptation data myself?

How long does it take to train a custom model with audio data?

Accuracy testing

What is word error rate (WER), and how is it computed?

How do I determine whether the results of an accuracy test are good?

How do I determine the WER of a base model so I can see whether it improved?

Creating a language model

How much text data do I need to upload?

Can I simply upload a list of words?

Next steps

Feedback

Additional resources