What is speech-to-text?

In this overview, you learn about the benefits and capabilities of the speech-to-text feature of the Speech service, which is part of Azure Cognitive Services.

Speech-to-text, also known as speech recognition, enables real-time or offline transcription of audio streams into text. For a full list of available speech-to-text languages, see Language and voice support for the Speech service.


Microsoft uses the same recognition technology for Windows and Office products.

Get started

To get started, try the speech-to-text quickstart. Speech-to-text is available via the Speech SDK, the REST API, and the Speech CLI.

In depth samples are available in the Azure-Samples/cognitive-services-speech-sdk repository on GitHub. There are samples for C# (including UWP, Unity, and Xamarin), C++, Java, JavaScript (including Browser and Node.js), Objective-C, Python, and Swift. Code samples for Go are available in the Microsoft/cognitive-services-speech-sdk-go repository on GitHub.

Batch transcription

Batch transcription is a set of Speech-to-text REST API operations that enable you to transcribe a large amount of audio in storage. You can point to audio files with a shared access signature (SAS) URI and asynchronously receive transcription results. For more information on how to use the batch transcription API, see How to use batch transcription and Batch transcription samples (REST).

Custom Speech

The Azure speech-to-text service analyzes audio in real-time or batch to transcribe the spoken word into text. Out of the box, speech to text utilizes a Universal Language Model as a base model that is trained with Microsoft-owned data and reflects commonly used spoken language. This base model is pre-trained with dialects and phonetics representing a variety of common domains. The base model works well in most scenarios.

The base model may not be sufficient if the audio contains ambient noise or includes a lot of industry and domain-specific jargon. In these cases, building a custom speech model makes sense by training with additional data associated with that specific domain. You can create and train custom acoustic, language, and pronunciation models. For more information, see Custom Speech and Speech-to-text REST API.

Customization options vary by language or locale. To verify support, see Language and voice support for the Speech service.

Next steps