What is speech to text?

In this overview, you learn about the benefits and capabilities of the speech to text feature of the Speech service, which is part of Azure AI services. Speech to text can be used for real-time, batch transcription, or fast transcription of audio streams into text.

Note

To compare pricing of real-time to batch transcription, see Speech service pricing.

For a full list of available speech to text languages, see Language and voice support.

Real-time speech to text

With real-time speech to text, the audio is transcribed as speech is recognized from a microphone or file. Use real-time speech to text for applications that need to transcribe audio in real-time such as:

Real-time speech to text is available via the Speech SDK and the Speech CLI.

Fast transcription (Preview)

Fast transcription API is used to transcribe audio files with returning results synchronously and much faster than real-time audio. Use fast transcription in the scenarios that you need the transcript of an audio recording as quickly as possible with predictable latency, such as:

  • Quick audio or video transcription, subtitles, and edit.
  • Video dubbing

Note

Fast transcription API is only available via the speech to text REST API version 2024-05-15-preview and later.

To get started with fast transcription, see use the fast transcription API (preview).

Batch transcription API

Batch transcription is used to transcribe a large amount of audio in storage. You can point to audio files with a shared access signature (SAS) URI and asynchronously receive transcription results. Use batch transcription for applications that need to transcribe audio in bulk such as:

  • Transcriptions, captions, or subtitles for prerecorded audio
  • Contact center post-call analytics
  • Diarization

Batch transcription is available via:

Custom speech

With custom speech, you can evaluate and improve the accuracy of speech recognition for your applications and products. A custom speech model can be used for real-time speech to text, speech translation, and batch transcription.

Tip

A hosted deployment endpoint isn't required to use custom speech with the Batch transcription API. You can conserve resources if the custom speech model is only used for batch transcription. For more information, see Speech service pricing.

Out of the box, speech recognition utilizes a Universal Language Model as a base model that is trained with Microsoft-owned data and reflects commonly used spoken language. The base model is pretrained with dialects and phonetics representing various common domains. When you make a speech recognition request, the most recent base model for each supported language is used by default. The base model works well in most speech recognition scenarios.

A custom model can be used to augment the base model to improve recognition of domain-specific vocabulary specific to the application by providing text data to train the model. It can also be used to improve recognition based for the specific audio conditions of the application by providing audio data with reference transcriptions. For more information, see custom speech and Speech to text REST API.

Customization options vary by language or locale. To verify support, see Language and voice support for the Speech service.

Responsible AI

An AI system includes not only the technology, but also the people who use it, the people who are affected by it, and the environment in which it's deployed. Read the transparency notes to learn about responsible AI use and deployment in your systems.

Next steps