Quickstart: Create captions with speech to text
Reference documentation | Package (NuGet) | Additional samples on GitHub
In this quickstart, you run a console app to create captions with speech to text.
Tip
Try out the Speech Studio and choose a sample video clip to see real-time or offline processed captioning results.
Tip
Try out the Azure AI Speech Toolkit to easily build and run captioning samples on Visual Studio Code.
Prerequisites
- An Azure subscription. You can create one for free.
- Create a Speech resource in the Azure portal.
- Get the Speech resource key and region. After your Speech resource is deployed, select Go to resource to view and manage keys.
Set up the environment
The Speech SDK is available as a NuGet package and implements .NET Standard 2.0. You install the Speech SDK later in this guide, but first check the SDK installation guide for any more requirements.
You must also install GStreamer for compressed input audio.
Set environment variables
You need to authenticate your application to access Azure AI services. This article shows you how to use environment variables to store your credentials. You can then access the environment variables from your code to authenticate your application. For production, use a more secure way to store and access your credentials.
Important
We recommend Microsoft Entra ID authentication with managed identities for Azure resources to avoid storing credentials with your applications that run in the cloud.
If you use an API key, store it securely somewhere else, such as in Azure Key Vault. Don't include the API key directly in your code, and never post it publicly.
For more information about AI services security, see Authenticate requests to Azure AI services.
To set the environment variables for your Speech resource key and region, open a console window, and follow the instructions for your operating system and development environment.
- To set the
SPEECH_KEY
environment variable, replace your-key with one of the keys for your resource. - To set the
SPEECH_REGION
environment variable, replace your-region with one of the regions for your resource.
setx SPEECH_KEY your-key
setx SPEECH_REGION your-region
Note
If you only need to access the environment variables in the current console, you can set the environment variable with set
instead of setx
.
After you add the environment variables, you might need to restart any programs that need to read the environment variables, including the console window. For example, if you're using Visual Studio as your editor, restart Visual Studio before you run the example.
Create captions from speech
Follow these steps to build and run the captioning quickstart code example.
- Copy the scenarios/csharp/dotnetcore/captioning/ sample files from GitHub. If you have Git installed, open a command prompt and run the
git clone
command to download the Speech SDK samples repository.git clone https://github.com/Azure-Samples/cognitive-services-speech-sdk.git
- Open a command prompt and change to the project directory.
cd <your-local-path>/scenarios/csharp/dotnetcore/captioning/captioning/
- Build the project with the .NET CLI.
dotnet build
- Run the application with your preferred command line arguments. See usage and arguments for the available options. Here is an example:
dotnet run --input caption.this.mp4 --format any --output caption.output.txt --srt --realTime --threshold 5 --delay 0 --profanity mask --phrases "Contoso;Jessie;Rehaan"
Important
Make sure that the paths specified by
--input
and--output
are valid. Otherwise you must change the paths.Make sure that you set the
SPEECH_KEY
andSPEECH_REGION
environment variables as described above. Otherwise use the--key
and--region
arguments.
Check results
When you use the realTime
option in the example above, the partial results from Recognizing
events are included in the output. In this example, only the final Recognized
event includes the commas. Commas aren't the only differences between Recognizing
and Recognized
events. For more information, see Get partial results.
1
00:00:00,170 --> 00:00:00,380
The
2
00:00:00,380 --> 00:00:01,770
The rainbow
3
00:00:01,770 --> 00:00:02,560
The rainbow has seven
4
00:00:02,560 --> 00:00:03,820
The rainbow has seven colors
5
00:00:03,820 --> 00:00:05,050
The rainbow has seven colors red
6
00:00:05,050 --> 00:00:05,850
The rainbow has seven colors red
orange
7
00:00:05,850 --> 00:00:06,440
The rainbow has seven colors red
orange yellow
8
00:00:06,440 --> 00:00:06,730
The rainbow has seven colors red
orange yellow green
9
00:00:06,730 --> 00:00:07,160
orange, yellow, green, blue,
indigo and Violet.
When you use the --offline
option, the results are stable from the final Recognized
event. Partial results aren't included in the output:
1
00:00:00,170 --> 00:00:05,540
The rainbow has seven colors, red,
orange, yellow, green, blue,
2
00:00:05,540 --> 00:00:07,160
indigo and Violet.
The SRT (SubRip Text) timespan output format is hh:mm:ss,fff
. For more information, see Caption output format.
Usage and arguments
Usage: captioning --input <input file>
Connection options include:
--key
: Your Speech resource key. Overrides the SPEECH_KEY environment variable. You must set the environment variable (recommended) or use the--key
option.--region REGION
: Your Speech resource region. Overrides the SPEECH_REGION environment variable. You must set the environment variable (recommended) or use the--region
option. Examples:westus
,northeurope
Important
If you use an API key, store it securely somewhere else, such as in Azure Key Vault. Don't include the API key directly in your code, and never post it publicly.
For more information about AI services security, see Authenticate requests to Azure AI services.
Input options include:
--input FILE
: Input audio from file. The default input is the microphone.--format FORMAT
: Use compressed audio format. Valid only with--file
. Valid values arealaw
,any
,flac
,mp3
,mulaw
, andogg_opus
. The default value isany
. To use awav
file, don't specify the format. This option is not available with the JavaScript captioning sample. For compressed audio files such as MP4, install GStreamer and see How to use compressed input audio.
Language options include:
--language LANG
: Specify a language using one of the corresponding supported locales. This is used when breaking captions into lines. Default value isen-US
.
Recognition options include:
--offline
: Output offline results. Overrides--realTime
. Default output mode is offline.--realTime
: Output real-time results.
Real-time output includes Recognizing
event results. The default offline output is Recognized
event results only. These are always written to the console, never to an output file. The --quiet
option overrides this. For more information, see Get speech recognition results.
Accuracy options include:
--phrases PHRASE1;PHRASE2
: You can specify a list of phrases to be recognized, such asContoso;Jessie;Rehaan
. For more information, see Improve recognition with phrase list.
Output options include:
--help
: Show this help and stop--output FILE
: Output captions to the specifiedfile
. This flag is required.--srt
: Output captions in SRT (SubRip Text) format. The default format is WebVTT (Web Video Text Tracks). For more information about SRT and WebVTT caption file formats, see Caption output format.--maxLineLength LENGTH
: Set the maximum number of characters per line for a caption to LENGTH. Minimum is 20. Default is 37 (30 for Chinese).--lines LINES
: Set the number of lines for a caption to LINES. Minimum is 1. Default is 2.--delay MILLISECONDS
: How many MILLISECONDS to delay the display of each caption, to mimic a real-time experience. This option is only applicable when you use therealTime
flag. Minimum is 0.0. Default is 1000.--remainTime MILLISECONDS
: How many MILLISECONDS a caption should remain on screen if it is not replaced by another. Minimum is 0.0. Default is 1000.--quiet
: Suppress console output, except errors.--profanity OPTION
: Valid values: raw, remove, mask. For more information, see Profanity filter concepts.--threshold NUMBER
: Set stable partial result threshold. The default value is3
. This option is only applicable when you use therealTime
flag. For more information, see Get partial results concepts.
Clean up resources
You can use the Azure portal or Azure Command Line Interface (CLI) to remove the Speech resource you created.
Reference documentation | Package (NuGet) | Additional samples on GitHub
In this quickstart, you run a console app to create captions with speech to text.
Tip
Try out the Speech Studio and choose a sample video clip to see real-time or offline processed captioning results.
Tip
Try out the Azure AI Speech Toolkit to easily build and run captioning samples on Visual Studio Code.
Prerequisites
- An Azure subscription. You can create one for free.
- Create a Speech resource in the Azure portal.
- Get the Speech resource key and region. After your Speech resource is deployed, select Go to resource to view and manage keys.
Set up the environment
The Speech SDK is available as a NuGet package and implements .NET Standard 2.0. You install the Speech SDK later in this guide, but first check the SDK installation guide for any more requirements
You must also install GStreamer for compressed input audio.
Set environment variables
You need to authenticate your application to access Azure AI services. This article shows you how to use environment variables to store your credentials. You can then access the environment variables from your code to authenticate your application. For production, use a more secure way to store and access your credentials.
Important
We recommend Microsoft Entra ID authentication with managed identities for Azure resources to avoid storing credentials with your applications that run in the cloud.
If you use an API key, store it securely somewhere else, such as in Azure Key Vault. Don't include the API key directly in your code, and never post it publicly.
For more information about AI services security, see Authenticate requests to Azure AI services.
To set the environment variables for your Speech resource key and region, open a console window, and follow the instructions for your operating system and development environment.
- To set the
SPEECH_KEY
environment variable, replace your-key with one of the keys for your resource. - To set the
SPEECH_REGION
environment variable, replace your-region with one of the regions for your resource.
setx SPEECH_KEY your-key
setx SPEECH_REGION your-region
Note
If you only need to access the environment variables in the current console, you can set the environment variable with set
instead of setx
.
After you add the environment variables, you might need to restart any programs that need to read the environment variables, including the console window. For example, if you're using Visual Studio as your editor, restart Visual Studio before you run the example.
Create captions from speech
Follow these steps to build and run the captioning quickstart code example with Visual Studio Community 2022 on Windows.
Download or copy the scenarios/cpp/windows/captioning/ sample files from GitHub into a local directory.
Open the
captioning.sln
solution file in Visual Studio Community 2022.Install the Speech SDK in your project with the NuGet package manager.
Install-Package Microsoft.CognitiveServices.Speech
Open Project > Properties > General. Set Configuration to
All configurations
. Set C++ Language Standard toISO C++17 Standard (/std:c++17)
.Open Build > Configuration Manager.
- On a 64-bit Windows installation, set Active solution platform to
x64
. - On a 32-bit Windows installation, set Active solution platform to
x86
.
- On a 64-bit Windows installation, set Active solution platform to
Open Project > Properties > Debugging. Enter your preferred command line arguments at Command Arguments. See usage and arguments for the available options. Here is an example:
--input caption.this.mp4 --format any --output caption.output.txt --srt --realTime --threshold 5 --delay 0 --profanity mask --phrases "Contoso;Jessie;Rehaan"
Important
Make sure that the paths specified by
--input
and--output
are valid. Otherwise you must change the paths.Make sure that you set the
SPEECH_KEY
andSPEECH_REGION
environment variables as described above. Otherwise use the--key
and--region
arguments.Build and run the console application.
Check results
When you use the realTime
option in the example above, the partial results from Recognizing
events are included in the output. In this example, only the final Recognized
event includes the commas. Commas aren't the only differences between Recognizing
and Recognized
events. For more information, see Get partial results.
1
00:00:00,170 --> 00:00:00,380
The
2
00:00:00,380 --> 00:00:01,770
The rainbow
3
00:00:01,770 --> 00:00:02,560
The rainbow has seven
4
00:00:02,560 --> 00:00:03,820
The rainbow has seven colors
5
00:00:03,820 --> 00:00:05,050
The rainbow has seven colors red
6
00:00:05,050 --> 00:00:05,850
The rainbow has seven colors red
orange
7
00:00:05,850 --> 00:00:06,440
The rainbow has seven colors red
orange yellow
8
00:00:06,440 --> 00:00:06,730
The rainbow has seven colors red
orange yellow green
9
00:00:06,730 --> 00:00:07,160
orange, yellow, green, blue,
indigo and Violet.
When you use the --offline
option, the results are stable from the final Recognized
event. Partial results aren't included in the output:
1
00:00:00,170 --> 00:00:05,540
The rainbow has seven colors, red,
orange, yellow, green, blue,
2
00:00:05,540 --> 00:00:07,160
indigo and Violet.
The SRT (SubRip Text) timespan output format is hh:mm:ss,fff
. For more information, see Caption output format.
Usage and arguments
Usage: captioning --input <input file>
Connection options include:
--key
: Your Speech resource key. Overrides the SPEECH_KEY environment variable. You must set the environment variable (recommended) or use the--key
option.--region REGION
: Your Speech resource region. Overrides the SPEECH_REGION environment variable. You must set the environment variable (recommended) or use the--region
option. Examples:westus
,northeurope
Important
If you use an API key, store it securely somewhere else, such as in Azure Key Vault. Don't include the API key directly in your code, and never post it publicly.
For more information about AI services security, see Authenticate requests to Azure AI services.
Input options include:
--input FILE
: Input audio from file. The default input is the microphone.--format FORMAT
: Use compressed audio format. Valid only with--file
. Valid values arealaw
,any
,flac
,mp3
,mulaw
, andogg_opus
. The default value isany
. To use awav
file, don't specify the format. This option is not available with the JavaScript captioning sample. For compressed audio files such as MP4, install GStreamer and see How to use compressed input audio.
Language options include:
--language LANG
: Specify a language using one of the corresponding supported locales. This is used when breaking captions into lines. Default value isen-US
.
Recognition options include:
--offline
: Output offline results. Overrides--realTime
. Default output mode is offline.--realTime
: Output real-time results.
Real-time output includes Recognizing
event results. The default offline output is Recognized
event results only. These are always written to the console, never to an output file. The --quiet
option overrides this. For more information, see Get speech recognition results.
Accuracy options include:
--phrases PHRASE1;PHRASE2
: You can specify a list of phrases to be recognized, such asContoso;Jessie;Rehaan
. For more information, see Improve recognition with phrase list.
Output options include:
--help
: Show this help and stop--output FILE
: Output captions to the specifiedfile
. This flag is required.--srt
: Output captions in SRT (SubRip Text) format. The default format is WebVTT (Web Video Text Tracks). For more information about SRT and WebVTT caption file formats, see Caption output format.--maxLineLength LENGTH
: Set the maximum number of characters per line for a caption to LENGTH. Minimum is 20. Default is 37 (30 for Chinese).--lines LINES
: Set the number of lines for a caption to LINES. Minimum is 1. Default is 2.--delay MILLISECONDS
: How many MILLISECONDS to delay the display of each caption, to mimic a real-time experience. This option is only applicable when you use therealTime
flag. Minimum is 0.0. Default is 1000.--remainTime MILLISECONDS
: How many MILLISECONDS a caption should remain on screen if it is not replaced by another. Minimum is 0.0. Default is 1000.--quiet
: Suppress console output, except errors.--profanity OPTION
: Valid values: raw, remove, mask. For more information, see Profanity filter concepts.--threshold NUMBER
: Set stable partial result threshold. The default value is3
. This option is only applicable when you use therealTime
flag. For more information, see Get partial results concepts.
Clean up resources
You can use the Azure portal or Azure Command Line Interface (CLI) to remove the Speech resource you created.
Reference documentation | Package (Go) | Additional samples on GitHub
In this quickstart, you run a console app to create captions with speech to text.
Tip
Try out the Speech Studio and choose a sample video clip to see real-time or offline processed captioning results.
Tip
Try out the Azure AI Speech Toolkit to easily build and run captioning samples on Visual Studio Code.
Prerequisites
- An Azure subscription. You can create one for free.
- Create a Speech resource in the Azure portal.
- Get the Speech resource key and region. After your Speech resource is deployed, select Go to resource to view and manage keys.
Set up the environment
Check whether there are any platform-specific installation steps.
You must also install GStreamer for compressed input audio.
Create captions from speech
Follow these steps to build and run the captioning quickstart code example.
Download or copy the scenarios/go/captioning/ sample files from GitHub into a local directory.
Open a command prompt in the same directory as
captioning.go
.Run the following commands to create a
go.mod
file that links to the Speech SDK components hosted on GitHub:go mod init captioning go get github.com/Microsoft/cognitive-services-speech-sdk-go
Build the GO module.
go build
Run the application with your preferred command line arguments. See usage and arguments for the available options. Here is an example:
go run captioning --key YourSubscriptionKey --region YourServiceRegion --input caption.this.mp4 --format any --output caption.output.txt --srt --recognizing --threshold 5 --profanity mask --phrases "Contoso;Jessie;Rehaan"
Replace
YourSubscriptionKey
with your Speech resource key, and replaceYourServiceRegion
with your Speech resource region, such aswestus
ornortheurope
. Make sure that the paths specified by--input
and--output
are valid. Otherwise you must change the paths.Important
Remember to remove the key from your code when you're done, and never post it publicly. For production, use a secure way of storing and accessing your credentials like Azure Key Vault. See the Azure AI services security article for more information.
Check results
The output file with complete captions is written to caption.output.txt
. Intermediate results are shown in the console:
00:00:00,180 --> 00:00:01,600
Welcome to
00:00:00,180 --> 00:00:01,820
Welcome to applied
00:00:00,180 --> 00:00:02,420
Welcome to applied mathematics
00:00:00,180 --> 00:00:02,930
Welcome to applied mathematics course
00:00:00,180 --> 00:00:03,100
Welcome to applied Mathematics course 2
00:00:00,180 --> 00:00:03,230
Welcome to applied Mathematics course 201.
The SRT (SubRip Text) timespan output format is hh:mm:ss,fff
. For more information, see Caption output format.
Usage and arguments
Usage: go run captioning.go helper.go --key <key> --region <region> --input <input file>
Connection options include:
--key
: Your Speech resource key.--region REGION
: Your Speech resource region. Examples:westus
,northeurope
Input options include:
--input FILE
: Input audio from file. The default input is the microphone.--format FORMAT
: Use compressed audio format. Valid only with--file
. Valid values arealaw
,any
,flac
,mp3
,mulaw
, andogg_opus
. The default value isany
. To use awav
file, don't specify the format. This option is not available with the JavaScript captioning sample. For compressed audio files such as MP4, install GStreamer and see How to use compressed input audio.
Language options include:
--languages LANG1,LANG2
: Enable language identification for specified languages. For example:en-US,ja-JP
. This option is only available with the C++, C#, and Python captioning samples. For more information, see Language identification.
Recognition options include:
--recognizing
: OutputRecognizing
event results. The default output isRecognized
event results only. These are always written to the console, never to an output file. The--quiet
option overrides this. For more information, see Get speech recognition results.
Accuracy options include:
--phrases PHRASE1;PHRASE2
: You can specify a list of phrases to be recognized, such asContoso;Jessie;Rehaan
. For more information, see Improve recognition with phrase list.
Output options include:
--help
: Show this help and stop--output FILE
: Output captions to the specifiedfile
. This flag is required.--srt
: Output captions in SRT (SubRip Text) format. The default format is WebVTT (Web Video Text Tracks). For more information about SRT and WebVTT caption file formats, see Caption output format.--quiet
: Suppress console output, except errors.--profanity OPTION
: Valid values: raw, remove, mask. For more information, see Profanity filter concepts.--threshold NUMBER
: Set stable partial result threshold. The default value is3
. For more information, see Get partial results concepts.
Clean up resources
You can use the Azure portal or Azure Command Line Interface (CLI) to remove the Speech resource you created.
Reference documentation | Additional samples on GitHub
In this quickstart, you run a console app to create captions with speech to text.
Tip
Try out the Speech Studio and choose a sample video clip to see real-time or offline processed captioning results.
Tip
Try out the Azure AI Speech Toolkit to easily build and run captioning samples on Visual Studio Code.
Prerequisites
- An Azure subscription. You can create one for free.
- Create a Speech resource in the Azure portal.
- Get the Speech resource key and region. After your Speech resource is deployed, select Go to resource to view and manage keys.
Set up the environment
Before you can do anything, you need to install the Speech SDK. The sample in this quickstart works with the Microsoft Build of OpenJDK 17
- Install Apache Maven. Then run
mvn -v
to confirm successful installation. - Create a new
pom.xml
file in the root of your project, and copy the following into it:<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.microsoft.cognitiveservices.speech.samples</groupId> <artifactId>quickstart-eclipse</artifactId> <version>1.0.0-SNAPSHOT</version> <build> <sourceDirectory>src</sourceDirectory> <plugins> <plugin> <artifactId>maven-compiler-plugin</artifactId> <version>3.7.0</version> <configuration> <source>1.8</source> <target>1.8</target> </configuration> </plugin> </plugins> </build> <dependencies> <dependency> <groupId>com.microsoft.cognitiveservices.speech</groupId> <artifactId>client-sdk</artifactId> <version>1.40.0</version> </dependency> </dependencies> </project>
- Install the Speech SDK and dependencies.
mvn clean dependency:copy-dependencies
- You must also install GStreamer for compressed input audio.
Set environment variables
You need to authenticate your application to access Azure AI services. This article shows you how to use environment variables to store your credentials. You can then access the environment variables from your code to authenticate your application. For production, use a more secure way to store and access your credentials.
Important
We recommend Microsoft Entra ID authentication with managed identities for Azure resources to avoid storing credentials with your applications that run in the cloud.
If you use an API key, store it securely somewhere else, such as in Azure Key Vault. Don't include the API key directly in your code, and never post it publicly.
For more information about AI services security, see Authenticate requests to Azure AI services.
To set the environment variables for your Speech resource key and region, open a console window, and follow the instructions for your operating system and development environment.
- To set the
SPEECH_KEY
environment variable, replace your-key with one of the keys for your resource. - To set the
SPEECH_REGION
environment variable, replace your-region with one of the regions for your resource.
setx SPEECH_KEY your-key
setx SPEECH_REGION your-region
Note
If you only need to access the environment variables in the current console, you can set the environment variable with set
instead of setx
.
After you add the environment variables, you might need to restart any programs that need to read the environment variables, including the console window. For example, if you're using Visual Studio as your editor, restart Visual Studio before you run the example.
Create captions from speech
Follow these steps to build and run the captioning quickstart code example.
- Copy the scenarios/java/jre/captioning/ sample files from GitHub into your project directory. The
pom.xml
file that you created in environment setup must also be in this directory. - Open a command prompt and run this command to compile the project files.
javac Captioning.java -cp ".;target\dependency\*" -encoding UTF-8
- Run the application with your preferred command line arguments. See usage and arguments for the available options. Here is an example:
java -cp ".;target\dependency\*" Captioning --input caption.this.mp4 --format any --output caption.output.txt --srt --realTime --threshold 5 --delay 0 --profanity mask --phrases "Contoso;Jessie;Rehaan"
Important
Make sure that the paths specified by
--input
and--output
are valid. Otherwise you must change the paths.Make sure that you set the
SPEECH_KEY
andSPEECH_REGION
environment variables as described above. Otherwise use the--key
and--region
arguments.
Check results
When you use the realTime
option in the example above, the partial results from Recognizing
events are included in the output. In this example, only the final Recognized
event includes the commas. Commas aren't the only differences between Recognizing
and Recognized
events. For more information, see Get partial results.
1
00:00:00,170 --> 00:00:00,380
The
2
00:00:00,380 --> 00:00:01,770
The rainbow
3
00:00:01,770 --> 00:00:02,560
The rainbow has seven
4
00:00:02,560 --> 00:00:03,820
The rainbow has seven colors
5
00:00:03,820 --> 00:00:05,050
The rainbow has seven colors red
6
00:00:05,050 --> 00:00:05,850
The rainbow has seven colors red
orange
7
00:00:05,850 --> 00:00:06,440
The rainbow has seven colors red
orange yellow
8
00:00:06,440 --> 00:00:06,730
The rainbow has seven colors red
orange yellow green
9
00:00:06,730 --> 00:00:07,160
orange, yellow, green, blue,
indigo and Violet.
When you use the --offline
option, the results are stable from the final Recognized
event. Partial results aren't included in the output:
1
00:00:00,170 --> 00:00:05,540
The rainbow has seven colors, red,
orange, yellow, green, blue,
2
00:00:05,540 --> 00:00:07,160
indigo and Violet.
The SRT (SubRip Text) timespan output format is hh:mm:ss,fff
. For more information, see Caption output format.
Usage and arguments
Usage: java -cp ".;target\dependency\*" Captioning --input <input file>
Connection options include:
--key
: Your Speech resource key. Overrides the SPEECH_KEY environment variable. You must set the environment variable (recommended) or use the--key
option.--region REGION
: Your Speech resource region. Overrides the SPEECH_REGION environment variable. You must set the environment variable (recommended) or use the--region
option. Examples:westus
,northeurope
Important
If you use an API key, store it securely somewhere else, such as in Azure Key Vault. Don't include the API key directly in your code, and never post it publicly.
For more information about AI services security, see Authenticate requests to Azure AI services.
Input options include:
--input FILE
: Input audio from file. The default input is the microphone.--format FORMAT
: Use compressed audio format. Valid only with--file
. Valid values arealaw
,any
,flac
,mp3
,mulaw
, andogg_opus
. The default value isany
. To use awav
file, don't specify the format. This option is not available with the JavaScript captioning sample. For compressed audio files such as MP4, install GStreamer and see How to use compressed input audio.
Language options include:
--language LANG
: Specify a language using one of the corresponding supported locales. This is used when breaking captions into lines. Default value isen-US
.
Recognition options include:
--offline
: Output offline results. Overrides--realTime
. Default output mode is offline.--realTime
: Output real-time results.
Real-time output includes Recognizing
event results. The default offline output is Recognized
event results only. These are always written to the console, never to an output file. The --quiet
option overrides this. For more information, see Get speech recognition results.
Accuracy options include:
--phrases PHRASE1;PHRASE2
: You can specify a list of phrases to be recognized, such asContoso;Jessie;Rehaan
. For more information, see Improve recognition with phrase list.
Output options include:
--help
: Show this help and stop--output FILE
: Output captions to the specifiedfile
. This flag is required.--srt
: Output captions in SRT (SubRip Text) format. The default format is WebVTT (Web Video Text Tracks). For more information about SRT and WebVTT caption file formats, see Caption output format.--maxLineLength LENGTH
: Set the maximum number of characters per line for a caption to LENGTH. Minimum is 20. Default is 37 (30 for Chinese).--lines LINES
: Set the number of lines for a caption to LINES. Minimum is 1. Default is 2.--delay MILLISECONDS
: How many MILLISECONDS to delay the display of each caption, to mimic a real-time experience. This option is only applicable when you use therealTime
flag. Minimum is 0.0. Default is 1000.--remainTime MILLISECONDS
: How many MILLISECONDS a caption should remain on screen if it is not replaced by another. Minimum is 0.0. Default is 1000.--quiet
: Suppress console output, except errors.--profanity OPTION
: Valid values: raw, remove, mask. For more information, see Profanity filter concepts.--threshold NUMBER
: Set stable partial result threshold. The default value is3
. This option is only applicable when you use therealTime
flag. For more information, see Get partial results concepts.
Clean up resources
You can use the Azure portal or Azure Command Line Interface (CLI) to remove the Speech resource you created.
Reference documentation | Package (npm) | Additional samples on GitHub | Library source code
In this quickstart, you run a console app to create captions with speech to text.
Tip
Try out the Speech Studio and choose a sample video clip to see real-time or offline processed captioning results.
Tip
Try out the Azure AI Speech Toolkit to easily build and run captioning samples on Visual Studio Code.
Prerequisites
- An Azure subscription. You can create one for free.
- Create a Speech resource in the Azure portal.
- Get the Speech resource key and region. After your Speech resource is deployed, select Go to resource to view and manage keys.
Set up the environment
Before you can do anything, you need to install the Speech SDK for JavaScript. If you just want the package name to install, run npm install microsoft-cognitiveservices-speech-sdk
. For guided installation instructions, see the SDK installation guide.
Create captions from speech
Follow these steps to build and run the captioning quickstart code example.
Copy the scenarios/javascript/node/captioning/ sample files from GitHub into your project directory.
Open a command prompt in the same directory as
Captioning.js
.Install the Speech SDK for JavaScript:
npm install microsoft-cognitiveservices-speech-sdk
Run the application with your preferred command line arguments. See usage and arguments for the available options. Here is an example:
node captioning.js --key YourSubscriptionKey --region YourServiceRegion --input caption.this.wav --output caption.output.txt --srt --recognizing --threshold 5 --profanity mask --phrases "Contoso;Jessie;Rehaan"
Replace
YourSubscriptionKey
with your Speech resource key, and replaceYourServiceRegion
with your Speech resource region, such aswestus
ornortheurope
. Make sure that the paths specified by--input
and--output
are valid. Otherwise you must change the paths.Note
The Speech SDK for JavaScript doesn't support compressed input audio. You must use a WAV file as shown in the example.
Important
Remember to remove the key from your code when you're done, and never post it publicly. For production, use a secure way of storing and accessing your credentials like Azure Key Vault. See the Azure AI services security article for more information.
Check results
The output file with complete captions is written to caption.output.txt
. Intermediate results are shown in the console:
00:00:00,180 --> 00:00:01,600
Welcome to
00:00:00,180 --> 00:00:01,820
Welcome to applied
00:00:00,180 --> 00:00:02,420
Welcome to applied mathematics
00:00:00,180 --> 00:00:02,930
Welcome to applied mathematics course
00:00:00,180 --> 00:00:03,100
Welcome to applied Mathematics course 2
00:00:00,180 --> 00:00:03,230
Welcome to applied Mathematics course 201.
The SRT (SubRip Text) timespan output format is hh:mm:ss,fff
. For more information, see Caption output format.
Usage and arguments
Usage: node captioning.js --key <key> --region <region> --input <input file>
Connection options include:
--key
: Your Speech resource key.--region REGION
: Your Speech resource region. Examples:westus
,northeurope
Input options include:
--input FILE
: Input audio from file. The default input is the microphone.--format FORMAT
: Use compressed audio format. Valid only with--file
. Valid values arealaw
,any
,flac
,mp3
,mulaw
, andogg_opus
. The default value isany
. To use awav
file, don't specify the format. This option is not available with the JavaScript captioning sample. For compressed audio files such as MP4, install GStreamer and see How to use compressed input audio.
Language options include:
--languages LANG1,LANG2
: Enable language identification for specified languages. For example:en-US,ja-JP
. This option is only available with the C++, C#, and Python captioning samples. For more information, see Language identification.
Recognition options include:
--recognizing
: OutputRecognizing
event results. The default output isRecognized
event results only. These are always written to the console, never to an output file. The--quiet
option overrides this. For more information, see Get speech recognition results.
Accuracy options include:
--phrases PHRASE1;PHRASE2
: You can specify a list of phrases to be recognized, such asContoso;Jessie;Rehaan
. For more information, see Improve recognition with phrase list.
Output options include:
--help
: Show this help and stop--output FILE
: Output captions to the specifiedfile
. This flag is required.--srt
: Output captions in SRT (SubRip Text) format. The default format is WebVTT (Web Video Text Tracks). For more information about SRT and WebVTT caption file formats, see Caption output format.--quiet
: Suppress console output, except errors.--profanity OPTION
: Valid values: raw, remove, mask. For more information, see Profanity filter concepts.--threshold NUMBER
: Set stable partial result threshold. The default value is3
. For more information, see Get partial results concepts.
Clean up resources
You can use the Azure portal or Azure Command Line Interface (CLI) to remove the Speech resource you created.
Reference documentation | Package (download) | Additional samples on GitHub
The Speech SDK for Objective-C does support getting speech recognition results for captioning, but we haven't yet included a guide here. Please select another programming language to get started and learn about the concepts, or see the Objective-C reference and samples linked from the beginning of this article.
Reference documentation | Package (download) | Additional samples on GitHub
The Speech SDK for Swift does support getting speech recognition results for captioning, but we haven't yet included a guide here. Please select another programming language to get started and learn about the concepts, or see the Swift reference and samples linked from the beginning of this article.
Reference documentation | Package (PyPi) | Additional samples on GitHub
In this quickstart, you run a console app to create captions with speech to text.
Tip
Try out the Speech Studio and choose a sample video clip to see real-time or offline processed captioning results.
Tip
Try out the Azure AI Speech Toolkit to easily build and run captioning samples on Visual Studio Code.
Prerequisites
- An Azure subscription. You can create one for free.
- Create a Speech resource in the Azure portal.
- Get the Speech resource key and region. After your Speech resource is deployed, select Go to resource to view and manage keys.
Set up the environment
The Speech SDK for Python is available as a Python Package Index (PyPI) module. The Speech SDK for Python is compatible with Windows, Linux, and macOS.
- You must install the Microsoft Visual C++ Redistributable for Visual Studio 2015, 2017, 2019, and 2022 for your platform. Installing this package for the first time might require a restart.
- On Linux, you must use the x64 target architecture.
- Install a version of Python from 3.10 or later. First check the SDK installation guide for any more requirements
- You must also install GStreamer for compressed input audio.
Set environment variables
You need to authenticate your application to access Azure AI services. This article shows you how to use environment variables to store your credentials. You can then access the environment variables from your code to authenticate your application. For production, use a more secure way to store and access your credentials.
Important
We recommend Microsoft Entra ID authentication with managed identities for Azure resources to avoid storing credentials with your applications that run in the cloud.
If you use an API key, store it securely somewhere else, such as in Azure Key Vault. Don't include the API key directly in your code, and never post it publicly.
For more information about AI services security, see Authenticate requests to Azure AI services.
To set the environment variables for your Speech resource key and region, open a console window, and follow the instructions for your operating system and development environment.
- To set the
SPEECH_KEY
environment variable, replace your-key with one of the keys for your resource. - To set the
SPEECH_REGION
environment variable, replace your-region with one of the regions for your resource.
setx SPEECH_KEY your-key
setx SPEECH_REGION your-region
Note
If you only need to access the environment variables in the current console, you can set the environment variable with set
instead of setx
.
After you add the environment variables, you might need to restart any programs that need to read the environment variables, including the console window. For example, if you're using Visual Studio as your editor, restart Visual Studio before you run the example.
Create captions from speech
Follow these steps to build and run the captioning quickstart code example.
- Download or copy the scenarios/python/console/captioning/ sample files from GitHub into a local directory.
- Open a command prompt in the same directory as
captioning.py
. - Run this command to install the Speech SDK:
pip install azure-cognitiveservices-speech
- Run the application with your preferred command line arguments. See usage and arguments for the available options. Here is an example:
python captioning.py --input caption.this.mp4 --format any --output caption.output.txt --srt --realTime --threshold 5 --delay 0 --profanity mask --phrases "Contoso;Jessie;Rehaan"
Important
Make sure that the paths specified by
--input
and--output
are valid. Otherwise you must change the paths.Make sure that you set the
SPEECH_KEY
andSPEECH_REGION
environment variables as described above. Otherwise use the--key
and--region
arguments.
Check results
When you use the realTime
option in the example above, the partial results from Recognizing
events are included in the output. In this example, only the final Recognized
event includes the commas. Commas aren't the only differences between Recognizing
and Recognized
events. For more information, see Get partial results.
1
00:00:00,170 --> 00:00:00,380
The
2
00:00:00,380 --> 00:00:01,770
The rainbow
3
00:00:01,770 --> 00:00:02,560
The rainbow has seven
4
00:00:02,560 --> 00:00:03,820
The rainbow has seven colors
5
00:00:03,820 --> 00:00:05,050
The rainbow has seven colors red
6
00:00:05,050 --> 00:00:05,850
The rainbow has seven colors red
orange
7
00:00:05,850 --> 00:00:06,440
The rainbow has seven colors red
orange yellow
8
00:00:06,440 --> 00:00:06,730
The rainbow has seven colors red
orange yellow green
9
00:00:06,730 --> 00:00:07,160
orange, yellow, green, blue,
indigo and Violet.
When you use the --offline
option, the results are stable from the final Recognized
event. Partial results aren't included in the output:
1
00:00:00,170 --> 00:00:05,540
The rainbow has seven colors, red,
orange, yellow, green, blue,
2
00:00:05,540 --> 00:00:07,160
indigo and Violet.
The SRT (SubRip Text) timespan output format is hh:mm:ss,fff
. For more information, see Caption output format.
Usage and arguments
Usage: python captioning.py --input <input file>
Connection options include:
--key
: Your Speech resource key. Overrides the SPEECH_KEY environment variable. You must set the environment variable (recommended) or use the--key
option.--region REGION
: Your Speech resource region. Overrides the SPEECH_REGION environment variable. You must set the environment variable (recommended) or use the--region
option. Examples:westus
,northeurope
Important
If you use an API key, store it securely somewhere else, such as in Azure Key Vault. Don't include the API key directly in your code, and never post it publicly.
For more information about AI services security, see Authenticate requests to Azure AI services.
Input options include:
--input FILE
: Input audio from file. The default input is the microphone.--format FORMAT
: Use compressed audio format. Valid only with--file
. Valid values arealaw
,any
,flac
,mp3
,mulaw
, andogg_opus
. The default value isany
. To use awav
file, don't specify the format. This option is not available with the JavaScript captioning sample. For compressed audio files such as MP4, install GStreamer and see How to use compressed input audio.
Language options include:
--language LANG
: Specify a language using one of the corresponding supported locales. This is used when breaking captions into lines. Default value isen-US
.
Recognition options include:
--offline
: Output offline results. Overrides--realTime
. Default output mode is offline.--realTime
: Output real-time results.
Real-time output includes Recognizing
event results. The default offline output is Recognized
event results only. These are always written to the console, never to an output file. The --quiet
option overrides this. For more information, see Get speech recognition results.
Accuracy options include:
--phrases PHRASE1;PHRASE2
: You can specify a list of phrases to be recognized, such asContoso;Jessie;Rehaan
. For more information, see Improve recognition with phrase list.
Output options include:
--help
: Show this help and stop--output FILE
: Output captions to the specifiedfile
. This flag is required.--srt
: Output captions in SRT (SubRip Text) format. The default format is WebVTT (Web Video Text Tracks). For more information about SRT and WebVTT caption file formats, see Caption output format.--maxLineLength LENGTH
: Set the maximum number of characters per line for a caption to LENGTH. Minimum is 20. Default is 37 (30 for Chinese).--lines LINES
: Set the number of lines for a caption to LINES. Minimum is 1. Default is 2.--delay MILLISECONDS
: How many MILLISECONDS to delay the display of each caption, to mimic a real-time experience. This option is only applicable when you use therealTime
flag. Minimum is 0.0. Default is 1000.--remainTime MILLISECONDS
: How many MILLISECONDS a caption should remain on screen if it is not replaced by another. Minimum is 0.0. Default is 1000.--quiet
: Suppress console output, except errors.--profanity OPTION
: Valid values: raw, remove, mask. For more information, see Profanity filter concepts.--threshold NUMBER
: Set stable partial result threshold. The default value is3
. This option is only applicable when you use therealTime
flag. For more information, see Get partial results concepts.
Clean up resources
You can use the Azure portal or Azure Command Line Interface (CLI) to remove the Speech resource you created.
In this quickstart, you run a console app to create captions with speech to text.
Tip
Try out the Speech Studio and choose a sample video clip to see real-time or offline processed captioning results.
Tip
Try out the Azure AI Speech Toolkit to easily build and run captioning samples on Visual Studio Code.
Prerequisites
- An Azure subscription. You can create one for free.
- Create a Speech resource in the Azure portal.
- Get the Speech resource key and region. After your Speech resource is deployed, select Go to resource to view and manage keys.
Set up the environment
Follow these steps and see the Speech CLI quickstart for other requirements for your platform.
Run the following .NET CLI command to install the Speech CLI:
dotnet tool install --global Microsoft.CognitiveServices.Speech.CLI
Run the following commands to configure your Speech resource key and region. Replace
SUBSCRIPTION-KEY
with your Speech resource key and replaceREGION
with your Speech resource region.spx config @key --set SUBSCRIPTION-KEY spx config @region --set REGION
You must also install GStreamer for compressed input audio.
Create captions from speech
With the Speech CLI, you can output both SRT (SubRip Text) and WebVTT (Web Video Text Tracks) captions from any type of media that contains audio.
To recognize audio from a file and output both WebVtt (vtt
) and SRT (srt
) captions, follow these steps.
Make sure that you have an input file named
caption.this.mp4
in the path.Run the following command to output captions from the video file:
spx recognize --file caption.this.mp4 --format any --output vtt file - --output srt file - --output each file - @output.each.detailed --property SpeechServiceResponse_StablePartialResultThreshold=5 --profanity masked --phrases "Constoso;Jessie;Rehaan"
The SRT and WebVTT captions are output to the console as shown here:
1 00:00:00,180 --> 00:00:03,230 Welcome to applied Mathematics course 201. WEBVTT 00:00:00.180 --> 00:00:03.230 Welcome to applied Mathematics course 201. { "ResultId": "561a0ea00cc14bb09bd294357df3270f", "Duration": "00:00:03.0500000" }
Usage and arguments
Here are details about the optional arguments from the previous command:
--file caption.this.mp4 --format any
: Input audio from file. The default input is the microphone. For compressed audio files such as MP4, install GStreamer and see How to use compressed input audio.--output vtt file -
and--output srt file -
: Outputs WebVTT and SRT captions to standard output. For more information about SRT and WebVTT caption file formats, see Caption output format. For more information about the--output
argument, see Speech CLI output options.@output.each.detailed
: Outputs event results with text, offset, and duration. For more information, see Get speech recognition results.--property SpeechServiceResponse_StablePartialResultThreshold=5
: You can request that the Speech service return fewerRecognizing
events that are more accurate. In this example, the Speech service must affirm recognition of a word at least five times before returning the partial results to you. For more information, see Get partial results concepts.--profanity masked
: You can specify whether to mask, remove, or show profanity in recognition results. For more information, see Profanity filter concepts.--phrases "Constoso;Jessie;Rehaan"
: You can specify a list of phrases to be recognized, such as Contoso, Jessie, and Rehaan. For more information, see Improve recognition with phrase list.
Clean up resources
You can use the Azure portal or Azure Command Line Interface (CLI) to remove the Speech resource you created.