Captioning with speech to text

Article
02/16/2024

In this guide, you learn how to create captions with speech to text. Captioning is the process of converting the audio content of a television broadcast, webcast, film, video, live event, or other production into text, and then displaying the text on a screen, monitor, or other visual display system.

Concepts include how to synchronize captions with your input audio, apply profanity filters, get partial results, apply customizations, and identify spoken languages for multilingual scenarios. This guide covers captioning for speech, but doesn't include speaker ID or sound effects such as bells ringing.

Here are some common captioning scenarios:

Online courses and instructional videos
Sporting events
Voice and video calls

The following are aspects to consider when using captioning:

Let your audience know that captions are generated by an automated service.
Center captions horizontally on the screen, in a large and prominent font.
Consider whether to use partial results, when to start displaying captions, and how many words to show at a time.
Learn about captioning protocols such as SMPTE-TT.
Consider output formats such as SRT (SubRip Text) and WebVTT (Web Video Text Tracks). These can be loaded onto most video players such as VLC, automatically adding the captions on to your video.

Tip

Try the Speech Studio and choose a sample video clip to see real-time or offline processed captioning results.

Try the Azure AI Video Indexer as a demonstration of how you can get captions for videos that you upload.

Captioning can accompany real-time or prerecorded speech. Whether you're showing captions in real-time or with a recording, you can use the Speech SDK or Speech CLI to recognize speech and get transcriptions. You can also use the Batch transcription API for pre-recorded video.

Caption output format

The Speech service supports output formats such as SRT (SubRip Text) and WebVTT (Web Video Text Tracks). These can be loaded onto most video players such as VLC, automatically adding the captions on to your video.

Tip

The Speech service provides profanity filter options. You can specify whether to mask, remove, or show profanity.

The SRT (SubRip Text) timespan output format is hh:mm:ss,fff.

1
00:00:00,180 --> 00:00:03,230
Welcome to applied Mathematics course 201.

The WebVTT (Web Video Text Tracks) timespan output format is hh:mm:ss.fff.

WEBVTT

00:00:00.180 --> 00:00:03.230
Welcome to applied Mathematics course 201.
{
  "ResultId": "8e89437b4b9349088a933f8db4ccc263",
  "Duration": "00:00:03.0500000"
}

Input audio to the Speech service

For real-time captioning, use a microphone or audio input stream instead of file input. For examples of how to recognize speech from a microphone, see the Speech to text quickstart and How to recognize speech documentation. For more information about streaming, see How to use the audio input stream.

For captioning of a prerecording, send file input to the Speech service. For more information, see How to use compressed input audio.

Caption and speech synchronization

You want to synchronize captions with the audio track, whether it's in real-time or with a prerecording.

The Speech service returns the offset and duration of the recognized speech.

Offset: The offset into the audio stream being recognized, expressed as duration. Offset is measured in ticks, starting from 0 (zero) tick, associated with the first audio byte processed by the SDK. For example, the offset begins when you start recognition, since that's when the SDK starts processing the audio stream. One tick represents one hundred nanoseconds or one ten-millionth of a second.
Duration: Duration of the utterance that is being recognized. The duration in ticks doesn't include trailing or leading silence.

For more information, see Get speech recognition results.

Get partial results

Consider when to start displaying captions, and how many words to show at a time. Speech recognition results are subject to change while an utterance is still being recognized. Partial results are returned with each Recognizing event. As each word is processed, the Speech service re-evaluates an utterance in the new context and again returns the best result. The new result isn't guaranteed to be the same as the previous result. The complete and final transcription of an utterance is returned with the Recognized event.

Note

Punctuation of partial results is not available.

For captioning of prerecorded speech or wherever latency isn't a concern, you could wait for the complete transcription of each utterance before displaying any words. Given the final offset and duration of each word in an utterance, you know when to show subsequent words at pace with the soundtrack.

Real-time captioning presents tradeoffs with respect to latency versus accuracy. You could show the text from each Recognizing event as soon as possible. However, if you can accept some latency, you can improve the accuracy of the caption by displaying the text from the Recognized event. There's also some middle ground, which is referred to as "stable partial results".

You can request that the Speech service return fewer Recognizing events that are more accurate. This is done by setting the SpeechServiceResponse_StablePartialResultThreshold property to a value between 0 and 2147483647. The value that you set is the number of times a word has to be recognized before the Speech service returns a Recognizing event. For example, if you set the SpeechServiceResponse_StablePartialResultThreshold property value to 5, the Speech service affirms recognition of a word at least five times before returning the partial results to you with a Recognizing event.

speechConfig.SetProperty(PropertyId.SpeechServiceResponse_StablePartialResultThreshold, 5);

speechConfig->SetProperty(PropertyId::SpeechServiceResponse_StablePartialResultThreshold, 5);

speechConfig.SetProperty(common.SpeechServiceResponseStablePartialResultThreshold, 5)

speechConfig.setProperty(PropertyId.SpeechServiceResponse_StablePartialResultThreshold, 5);

speechConfig.setProperty(sdk.PropertyId.SpeechServiceResponse_StablePartialResultThreshold, 5);

[self.speechConfig setPropertyTo:5 byId:SPXSpeechServiceResponseStablePartialResultThreshold];

self.speechConfig!.setPropertyTo(5, by: SPXPropertyId.speechServiceResponseStablePartialResultThreshold)

speech_config.set_property(property_id = speechsdk.PropertyId.SpeechServiceResponse_StablePartialResultThreshold, value = 5)

spx recognize --file caption.this.mp4 --format any --property SpeechServiceResponse_StablePartialResultThreshold=5 --output vtt file - --output srt file -

Requesting more stable partial results reduce the "flickering" or changing text, but it can increase latency as you wait for higher confidence results.

Stable partial threshold example

In the following recognition sequence without setting a stable partial threshold, "math" is recognized as a word, but the final text is "mathematics". At another point, "course 2" is recognized, but the final text is "course 201".

RECOGNIZING: Text=welcome to
RECOGNIZING: Text=welcome to applied math
RECOGNIZING: Text=welcome to applied mathematics
RECOGNIZING: Text=welcome to applied mathematics course 2
RECOGNIZING: Text=welcome to applied mathematics course 201
RECOGNIZED: Text=Welcome to applied Mathematics course 201.

In the previous example, the transcriptions were additive and no text was retracted. But at other times you might find that the partial results were inaccurate. In either case, the unstable partial results can be perceived as "flickering" when displayed.

For this example, if the stable partial result threshold is set to 5, no words are altered or backtracked.

RECOGNIZING: Text=welcome to
RECOGNIZING: Text=welcome to applied
RECOGNIZING: Text=welcome to applied mathematics
RECOGNIZED: Text=Welcome to applied Mathematics course 201.

Language identification

If the language in the audio could change, use continuous language identification. Language identification is used to identify languages spoken in audio when compared against a list of supported languages. You provide up to 10 candidate languages, at least one of which is expected in the audio. The Speech service returns the most likely language in the audio.

Customizations to improve accuracy

A phrase list is a list of words or phrases that you provide right before starting speech recognition. Adding a phrase to a phrase list increases its importance, thus making it more likely to be recognized.

Examples of phrases include:

Names
Geographical locations
Homonyms
Words or acronyms unique to your industry or organization

There are some situations where training a custom model is likely the best option to improve accuracy. For example, if you're captioning orthodontic lectures, you might want to train a custom model with the corresponding domain data.