What is the Speech service?
The Speech service provides speech-to-text and text-to-speech capabilities with an Azure Speech resource. You can transcribe speech to text with high accuracy, produce natural-sounding text-to-speech voices, translate spoken audio, and use speaker recognition during conversations.
Create custom voices, add specific words to your base vocabulary, or build your own models. Run Speech anywhere, in the cloud or at the edge in containers. It's easy to speech enable your applications, tools, and devices with the Speech CLI, Speech SDK, Speech Studio, or REST APIs.
Speech is available for many languages, regions, and price points.
Speech scenarios
Common scenarios for speech include:
- Captioning: Learn how to synchronize captions with your input audio, apply profanity filters, get partial results, apply customizations, and identify spoken languages for multilingual scenarios.
- Audio Content Creation: You can use neural voices to make interactions with chatbots and voice assistants more natural and engaging, convert digital texts such as e-books into audiobooks and enhance in-car navigation systems.
- Call Center: Transcribe calls in real-time or process a batch of calls, redact personally identifying information, and extract insights such as sentiment to help with your call center use case.
- Language learning: Provide pronunciation assessment feedback to language learners, support real-time transcription for remote learning conversations, and read aloud teaching materials with neural voices.
- Voice assistants: Create natural, humanlike conversational interfaces for their applications and experiences. The voice assistant feature provides fast, reliable interaction between a device and an assistant implementation.
Microsoft uses Speech for many scenarios, such as captioning in Teams, dictation in Office 365, and Read Aloud in the Edge browser.
Speech capabilities
Speech feature summaries are provided below with links for more information.
Speech-to-text
Use speech-to-text to transcribe audio into text, either in real time or asynchronously.
Tip
You can try speech-to-text in Speech Studio without signing up or writing any code.
Convert audio to text from a range of sources, including microphones, audio files, and blob storage. Use speaker diarisation to determine who said what and when. Get readable transcripts with automatic formatting and punctuation.
The base model may not be sufficient if the audio contains ambient noise or includes a lot of industry and domain-specific jargon. In these cases, you can create and train custom speech models with acoustic, language, and pronunciation data. Custom speech models are private and can offer a competitive advantage.
Text-to-speech
With text to speech, you can convert input text into humanlike synthesized speech. Use neural voices, which are humanlike voices powered by deep neural networks. Use the Speech Synthesis Markup Language (SSML) to fine-tune the pitch, pronunciation, speaking rate, volume, and more.
- Prebuilt neural voice: Highly natural out-of-the-box voices. Check the prebuilt neural voice samples the Voice Gallery and determine the right voice for your business needs.
- Custom neural voice: Besides the pre-built neural voices that come out of the box, you can also create a custom neural voice that is recognizable and unique to your brand or product. Custom neural voices are private and can offer a competitive advantage. Check the custom neural voice samples here.
Speech translation
Speech translation enables real-time, multilingual translation of speech to your applications, tools, and devices. Use this feature for speech-to-speech and speech-to-text translation.
Language identification
Language identification is used to identify languages spoken in audio when compared against a list of supported languages. Use language identification by itself, with speech-to-text recognition, or with speech translation.
Speaker recognition
Speaker recognition provides algorithms that verify and identify speakers by their unique voice characteristics. Speaker recognition is used to answer the question, "Who is speaking?".
Pronunciation assessment
Pronunciation assessment evaluates speech pronunciation and gives speakers feedback on the accuracy and fluency of spoken audio. With pronunciation assessment, language learners can practice, get instant feedback, and improve their pronunciation so that they can speak and present with confidence.
Intent recognition
Intent recognition: Use speech-to-text with conversational language understanding to derive user intents from transcribed speech and act on voice commands.
Delivery and presence
You can deploy Azure Cognitive Services Speech features in the cloud or on-premises.
With containers, you can bring the service closer to your data for compliance, security, or other operational reasons.
Speech service deployment in sovereign clouds is available for some government entities and their partners. For example, the Azure Government cloud is available to US government entities and their partners. Azure China cloud is available to organizations with a business presence in China. For more information, see sovereign clouds.
Use Speech in your application
The Speech Studio is a set of UI-based tools for building and integrating features from Azure Cognitive Services Speech service in your applications. You create projects in Speech Studio by using a no-code approach, and then reference those assets in your applications by using the Speech SDK, the Speech CLI, or the REST APIs.
The Speech CLI is a command-line tool for using Speech service without having to write any code. Most features in the Speech SDK are available in the Speech CLI, and some advanced features and customizations are simplified in the Speech CLI.
The Speech SDK exposes many of the Speech service capabilities you can use to develop speech-enabled applications. The Speech SDK is available in many programming languages and across all platforms.
In some cases, you can't or shouldn't use the Speech SDK. In those cases, you can use REST APIs to access the Speech service. For example, use REST APIs for batch transcription and speaker recognition REST APIs.
Get started
We offer quickstarts in many popular programming languages. Each quickstart is designed to teach you basic design patterns and have you running code in less than 10 minutes. See the following list for the quickstart for each feature:
Code samples
Sample code for the Speech service is available on GitHub. These samples cover common scenarios like reading audio from a file or stream, continuous and single-shot recognition, and working with custom models. Use these links to view SDK and REST samples:
- Speech-to-text, text-to-speech, and speech translation samples (SDK)
- Batch transcription samples (REST)
- Text-to-speech samples (REST)
- Voice assistant samples (SDK)
Next steps
Feedback
Submit and view feedback for