Quickstart: Create captions with speech to text

Reference documentation | Package (NuGet) | Additional Samples on GitHub

In this quickstart, you run a console app to create captions with speech to text.

Tip

Try the Speech Studio and choose a sample video clip to see real-time or offline processed captioning results.

Prerequisites

Set up the environment

The Speech SDK is available as a NuGet package and implements .NET Standard 2.0. You install the Speech SDK later in this guide, but first check the SDK installation guide for any more requirements.

You must also install GStreamer for compressed input audio.

Set environment variables

Your application must be authenticated to access Azure AI services resources. For production, use a secure way of storing and accessing your credentials. For example, after you get a key for your Speech resource, write it to a new environment variable on the local machine that runs the application.

Tip

Don't include the key directly in your code, and never post it publicly. See Azure AI services security for more authentication options such as Azure Key Vault.

To set the environment variable for your Speech resource key, open a console window, and follow the instructions for your operating system and development environment.

  • To set the SPEECH_KEY environment variable, replace your-key with one of the keys for your resource.
  • To set the SPEECH_REGION environment variable, replace your-region with one of the regions for your resource.
setx SPEECH_KEY your-key
setx SPEECH_REGION your-region

Note

If you only need to access the environment variables in the current console, you can set the environment variable with set instead of setx.

After you add the environment variables, you might need to restart any programs that need to read the environment variable, including the console window. For example, if you're using Visual Studio as your editor, restart Visual Studio before you run the example.

Create captions from speech

Follow these steps to build and run the captioning quickstart code example.

  1. Copy the scenarios/csharp/dotnetcore/captioning/ sample files from GitHub. If you have Git installed, open a command prompt and run the git clone command to download the Speech SDK samples repository.
    git clone https://github.com/Azure-Samples/cognitive-services-speech-sdk.git
    
  2. Open a command prompt and change to the project directory.
    cd <your-local-path>/scenarios/csharp/dotnetcore/captioning/captioning/
    
  3. Build the project with the .NET CLI.
    dotnet build
    
  4. Run the application with your preferred command line arguments. See usage and arguments for the available options. Here is an example:
    dotnet run --input caption.this.mp4 --format any --output caption.output.txt --srt --realTime --threshold 5 --delay 0 --profanity mask --phrases "Contoso;Jessie;Rehaan"
    

    Important

    Make sure that the paths specified by --input and --output are valid. Otherwise you must change the paths.

    Make sure that you set the SPEECH_KEY and SPEECH_REGION environment variables as described above. Otherwise use the --key and --region arguments.

Check results

When you use the realTime option in the example above, the partial results from Recognizing events are included in the output. In this example, only the final Recognized event includes the commas. Commas aren't the only differences between Recognizing and Recognized events. For more information, see Get partial results.

1
00:00:00,170 --> 00:00:00,380
The

2
00:00:00,380 --> 00:00:01,770
The rainbow

3
00:00:01,770 --> 00:00:02,560
The rainbow has seven

4
00:00:02,560 --> 00:00:03,820
The rainbow has seven colors

5
00:00:03,820 --> 00:00:05,050
The rainbow has seven colors red

6
00:00:05,050 --> 00:00:05,850
The rainbow has seven colors red
orange

7
00:00:05,850 --> 00:00:06,440
The rainbow has seven colors red
orange yellow

8
00:00:06,440 --> 00:00:06,730
The rainbow has seven colors red
orange yellow green

9
00:00:06,730 --> 00:00:07,160
orange, yellow, green, blue,
indigo and Violet.

When you use the --offline option, the results are stable from the final Recognized event. Partial results aren't included in the output:

1
00:00:00,170 --> 00:00:05,540
The rainbow has seven colors, red,
orange, yellow, green, blue,

2
00:00:05,540 --> 00:00:07,160
indigo and Violet.

The SRT (SubRip Text) timespan output format is hh:mm:ss,fff. For more information, see Caption output format.

Usage and arguments

Usage: captioning --input <input file>

Connection options include:

  • --key: Your Speech resource key. Overrides the SPEECH_KEY environment variable. You must set the environment variable (recommended) or use the --key option.
  • --region REGION: Your Speech resource region. Overrides the SPEECH_REGION environment variable. You must set the environment variable (recommended) or use the --region option. Examples: westus, northeurope

Input options include:

  • --input FILE: Input audio from file. The default input is the microphone.
  • --format FORMAT: Use compressed audio format. Valid only with --file. Valid values are alaw, any, flac, mp3, mulaw, and ogg_opus. The default value is any. To use a wav file, don't specify the format. This option is not available with the JavaScript captioning sample. For compressed audio files such as MP4, install GStreamer and see How to use compressed input audio.

Language options include:

  • --language LANG: Specify a language using one of the corresponding supported locales. This is used when breaking captions into lines. Default value is en-US.

Recognition options include:

  • --offline: Output offline results. Overrides --realTime. Default output mode is offline.
  • --realTime: Output real-time results.

Real-time output includes Recognizing event results. The default offline output is Recognized event results only. These are always written to the console, never to an output file. The --quiet option overrides this. For more information, see Get speech recognition results.

Accuracy options include:

Output options include:

  • --help: Show this help and stop
  • --output FILE: Output captions to the specified file. This flag is required.
  • --srt: Output captions in SRT (SubRip Text) format. The default format is WebVTT (Web Video Text Tracks). For more information about SRT and WebVTT caption file formats, see Caption output format.
  • --maxLineLength LENGTH: Set the maximum number of characters per line for a caption to LENGTH. Minimum is 20. Default is 37 (30 for Chinese).
  • --lines LINES: Set the number of lines for a caption to LINES. Minimum is 1. Default is 2.
  • --delay MILLISECONDS: How many MILLISECONDS to delay the display of each caption, to mimic a real-time experience. This option is only applicable when you use the realTime flag. Minimum is 0.0. Default is 1000.
  • --remainTime MILLISECONDS: How many MILLISECONDS a caption should remain on screen if it is not replaced by another. Minimum is 0.0. Default is 1000.
  • --quiet: Suppress console output, except errors.
  • --profanity OPTION: Valid values: raw, remove, mask. For more information, see Profanity filter concepts.
  • --threshold NUMBER: Set stable partial result threshold. The default value is 3. This option is only applicable when you use the realTime flag. For more information, see Get partial results concepts.

Clean up resources

You can use the Azure portal or Azure Command Line Interface (CLI) to remove the Speech resource you created.

Reference documentation | Package (NuGet) | Additional Samples on GitHub

In this quickstart, you run a console app to create captions with speech to text.

Tip

Try the Speech Studio and choose a sample video clip to see real-time or offline processed captioning results.

Prerequisites

Set up the environment

The Speech SDK is available as a NuGet package and implements .NET Standard 2.0. You install the Speech SDK later in this guide, but first check the SDK installation guide for any more requirements

You must also install GStreamer for compressed input audio.

Set environment variables

Your application must be authenticated to access Azure AI services resources. For production, use a secure way of storing and accessing your credentials. For example, after you get a key for your Speech resource, write it to a new environment variable on the local machine that runs the application.

Tip

Don't include the key directly in your code, and never post it publicly. See Azure AI services security for more authentication options such as Azure Key Vault.

To set the environment variable for your Speech resource key, open a console window, and follow the instructions for your operating system and development environment.

  • To set the SPEECH_KEY environment variable, replace your-key with one of the keys for your resource.
  • To set the SPEECH_REGION environment variable, replace your-region with one of the regions for your resource.
setx SPEECH_KEY your-key
setx SPEECH_REGION your-region

Note

If you only need to access the environment variables in the current console, you can set the environment variable with set instead of setx.

After you add the environment variables, you might need to restart any programs that need to read the environment variable, including the console window. For example, if you're using Visual Studio as your editor, restart Visual Studio before you run the example.

Create captions from speech

Follow these steps to build and run the captioning quickstart code example with Visual Studio Community 2022 on Windows.

  1. Download or copy the scenarios/cpp/windows/captioning/ sample files from GitHub into a local directory.

  2. Open the captioning.sln solution file in Visual Studio Community 2022.

  3. Install the Speech SDK in your project with the NuGet package manager.

    Install-Package Microsoft.CognitiveServices.Speech
    
  4. Open Project > Properties > General. Set Configuration to All configurations. Set C++ Language Standard to ISO C++17 Standard (/std:c++17).

  5. Open Build > Configuration Manager.

    • On a 64-bit Windows installation, set Active solution platform to x64.
    • On a 32-bit Windows installation, set Active solution platform to x86.
  6. Open Project > Properties > Debugging. Enter your preferred command line arguments at Command Arguments. See usage and arguments for the available options. Here is an example:

    --input caption.this.mp4 --format any --output caption.output.txt --srt --realTime --threshold 5 --delay 0 --profanity mask --phrases "Contoso;Jessie;Rehaan"
    

    Important

    Make sure that the paths specified by --input and --output are valid. Otherwise you must change the paths.

    Make sure that you set the SPEECH_KEY and SPEECH_REGION environment variables as described above. Otherwise use the --key and --region arguments.

  7. Build and run the console application.

Check results

When you use the realTime option in the example above, the partial results from Recognizing events are included in the output. In this example, only the final Recognized event includes the commas. Commas aren't the only differences between Recognizing and Recognized events. For more information, see Get partial results.

1
00:00:00,170 --> 00:00:00,380
The

2
00:00:00,380 --> 00:00:01,770
The rainbow

3
00:00:01,770 --> 00:00:02,560
The rainbow has seven

4
00:00:02,560 --> 00:00:03,820
The rainbow has seven colors

5
00:00:03,820 --> 00:00:05,050
The rainbow has seven colors red

6
00:00:05,050 --> 00:00:05,850
The rainbow has seven colors red
orange

7
00:00:05,850 --> 00:00:06,440
The rainbow has seven colors red
orange yellow

8
00:00:06,440 --> 00:00:06,730
The rainbow has seven colors red
orange yellow green

9
00:00:06,730 --> 00:00:07,160
orange, yellow, green, blue,
indigo and Violet.

When you use the --offline option, the results are stable from the final Recognized event. Partial results aren't included in the output:

1
00:00:00,170 --> 00:00:05,540
The rainbow has seven colors, red,
orange, yellow, green, blue,

2
00:00:05,540 --> 00:00:07,160
indigo and Violet.

The SRT (SubRip Text) timespan output format is hh:mm:ss,fff. For more information, see Caption output format.

Usage and arguments

Usage: captioning --input <input file>

Connection options include:

  • --key: Your Speech resource key. Overrides the SPEECH_KEY environment variable. You must set the environment variable (recommended) or use the --key option.
  • --region REGION: Your Speech resource region. Overrides the SPEECH_REGION environment variable. You must set the environment variable (recommended) or use the --region option. Examples: westus, northeurope

Input options include:

  • --input FILE: Input audio from file. The default input is the microphone.
  • --format FORMAT: Use compressed audio format. Valid only with --file. Valid values are alaw, any, flac, mp3, mulaw, and ogg_opus. The default value is any. To use a wav file, don't specify the format. This option is not available with the JavaScript captioning sample. For compressed audio files such as MP4, install GStreamer and see How to use compressed input audio.

Language options include:

  • --language LANG: Specify a language using one of the corresponding supported locales. This is used when breaking captions into lines. Default value is en-US.

Recognition options include:

  • --offline: Output offline results. Overrides --realTime. Default output mode is offline.
  • --realTime: Output real-time results.

Real-time output includes Recognizing event results. The default offline output is Recognized event results only. These are always written to the console, never to an output file. The --quiet option overrides this. For more information, see Get speech recognition results.

Accuracy options include:

Output options include:

  • --help: Show this help and stop
  • --output FILE: Output captions to the specified file. This flag is required.
  • --srt: Output captions in SRT (SubRip Text) format. The default format is WebVTT (Web Video Text Tracks). For more information about SRT and WebVTT caption file formats, see Caption output format.
  • --maxLineLength LENGTH: Set the maximum number of characters per line for a caption to LENGTH. Minimum is 20. Default is 37 (30 for Chinese).
  • --lines LINES: Set the number of lines for a caption to LINES. Minimum is 1. Default is 2.
  • --delay MILLISECONDS: How many MILLISECONDS to delay the display of each caption, to mimic a real-time experience. This option is only applicable when you use the realTime flag. Minimum is 0.0. Default is 1000.
  • --remainTime MILLISECONDS: How many MILLISECONDS a caption should remain on screen if it is not replaced by another. Minimum is 0.0. Default is 1000.
  • --quiet: Suppress console output, except errors.
  • --profanity OPTION: Valid values: raw, remove, mask. For more information, see Profanity filter concepts.
  • --threshold NUMBER: Set stable partial result threshold. The default value is 3. This option is only applicable when you use the realTime flag. For more information, see Get partial results concepts.

Clean up resources

You can use the Azure portal or Azure Command Line Interface (CLI) to remove the Speech resource you created.

Reference documentation | Package (Go) | Additional Samples on GitHub

In this quickstart, you run a console app to create captions with speech to text.

Tip

Try the Speech Studio and choose a sample video clip to see real-time or offline processed captioning results.

Prerequisites

Set up the environment

Check whether there are any platform-specific installation steps.

You must also install GStreamer for compressed input audio.

Create captions from speech

Follow these steps to build and run the captioning quickstart code example.

  1. Download or copy the scenarios/go/captioning/ sample files from GitHub into a local directory.

  2. Open a command prompt in the same directory as captioning.go.

  3. Run the following commands to create a go.mod file that links to the Speech SDK components hosted on GitHub:

    go mod init captioning
    go get github.com/Microsoft/cognitive-services-speech-sdk-go
    
  4. Build the GO module.

    go build
    
  5. Run the application with your preferred command line arguments. See usage and arguments for the available options. Here is an example:

    go run captioning --key YourSubscriptionKey --region YourServiceRegion --input caption.this.mp4 --format any --output caption.output.txt --srt --recognizing --threshold 5 --profanity mask --phrases "Contoso;Jessie;Rehaan"
    

    Replace YourSubscriptionKey with your Speech resource key, and replace YourServiceRegion with your Speech resource region, such as westus or northeurope. Make sure that the paths specified by --input and --output are valid. Otherwise you must change the paths.

    Important

    Remember to remove the key from your code when you're done, and never post it publicly. For production, use a secure way of storing and accessing your credentials like Azure Key Vault. See the Azure AI services security article for more information.

Check results

The output file with complete captions is written to caption.output.txt. Intermediate results are shown in the console:

00:00:00,180 --> 00:00:01,600
Welcome to

00:00:00,180 --> 00:00:01,820
Welcome to applied

00:00:00,180 --> 00:00:02,420
Welcome to applied mathematics

00:00:00,180 --> 00:00:02,930
Welcome to applied mathematics course

00:00:00,180 --> 00:00:03,100
Welcome to applied Mathematics course 2

00:00:00,180 --> 00:00:03,230
Welcome to applied Mathematics course 201.

The SRT (SubRip Text) timespan output format is hh:mm:ss,fff. For more information, see Caption output format.

Usage and arguments

Usage: go run captioning.go helper.go --key <key> --region <region> --input <input file>

Connection options include:

  • --key: Your Speech resource key.
  • --region REGION: Your Speech resource region. Examples: westus, northeurope

Input options include:

  • --input FILE: Input audio from file. The default input is the microphone.
  • --format FORMAT: Use compressed audio format. Valid only with --file. Valid values are alaw, any, flac, mp3, mulaw, and ogg_opus. The default value is any. To use a wav file, don't specify the format. This option is not available with the JavaScript captioning sample. For compressed audio files such as MP4, install GStreamer and see How to use compressed input audio.

Language options include:

  • --languages LANG1,LANG2: Enable language identification for specified languages. For example: en-US,ja-JP. This option is only available with the C++, C#, and Python captioning samples. For more information, see Language identification.

Recognition options include:

  • --recognizing: Output Recognizing event results. The default output is Recognized event results only. These are always written to the console, never to an output file. The --quiet option overrides this. For more information, see Get speech recognition results.

Accuracy options include:

Output options include:

  • --help: Show this help and stop
  • --output FILE: Output captions to the specified file. This flag is required.
  • --srt: Output captions in SRT (SubRip Text) format. The default format is WebVTT (Web Video Text Tracks). For more information about SRT and WebVTT caption file formats, see Caption output format.
  • --quiet: Suppress console output, except errors.
  • --profanity OPTION: Valid values: raw, remove, mask. For more information, see Profanity filter concepts.
  • --threshold NUMBER: Set stable partial result threshold. The default value is 3. For more information, see Get partial results concepts.

Clean up resources

You can use the Azure portal or Azure Command Line Interface (CLI) to remove the Speech resource you created.

Reference documentation | Additional Samples on GitHub

In this quickstart, you run a console app to create captions with speech to text.

Tip

Try the Speech Studio and choose a sample video clip to see real-time or offline processed captioning results.

Prerequisites

Set up the environment

Before you can do anything, you need to install the Speech SDK. The sample in this quickstart works with the Microsoft Build of OpenJDK 17

  1. Install Apache Maven. Then run mvn -v to confirm successful installation.
  2. Create a new pom.xml file in the root of your project, and copy the following into it:
    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
        <modelVersion>4.0.0</modelVersion>
        <groupId>com.microsoft.cognitiveservices.speech.samples</groupId>
        <artifactId>quickstart-eclipse</artifactId>
        <version>1.0.0-SNAPSHOT</version>
        <build>
            <sourceDirectory>src</sourceDirectory>
            <plugins>
            <plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.7.0</version>
                <configuration>
                <source>1.8</source>
                <target>1.8</target>
                </configuration>
            </plugin>
            </plugins>
        </build>
        <dependencies>
            <dependency>
            <groupId>com.microsoft.cognitiveservices.speech</groupId>
            <artifactId>client-sdk</artifactId>
            <version>1.37.0</version>
            </dependency>
        </dependencies>
    </project>
    
  3. Install the Speech SDK and dependencies.
    mvn clean dependency:copy-dependencies
    
  4. You must also install GStreamer for compressed input audio.

Set environment variables

Your application must be authenticated to access Azure AI services resources. For production, use a secure way of storing and accessing your credentials. For example, after you get a key for your Speech resource, write it to a new environment variable on the local machine that runs the application.

Tip

Don't include the key directly in your code, and never post it publicly. See Azure AI services security for more authentication options such as Azure Key Vault.

To set the environment variable for your Speech resource key, open a console window, and follow the instructions for your operating system and development environment.

  • To set the SPEECH_KEY environment variable, replace your-key with one of the keys for your resource.
  • To set the SPEECH_REGION environment variable, replace your-region with one of the regions for your resource.
setx SPEECH_KEY your-key
setx SPEECH_REGION your-region

Note

If you only need to access the environment variables in the current console, you can set the environment variable with set instead of setx.

After you add the environment variables, you might need to restart any programs that need to read the environment variable, including the console window. For example, if you're using Visual Studio as your editor, restart Visual Studio before you run the example.

Create captions from speech

Follow these steps to build and run the captioning quickstart code example.

  1. Copy the scenarios/java/jre/captioning/ sample files from GitHub into your project directory. The pom.xml file that you created in environment setup must also be in this directory.
  2. Open a command prompt and run this command to compile the project files.
    javac Captioning.java -cp ".;target\dependency\*" -encoding UTF-8
    
  3. Run the application with your preferred command line arguments. See usage and arguments for the available options. Here is an example:
    java -cp ".;target\dependency\*" Captioning --input caption.this.mp4 --format any --output caption.output.txt --srt --realTime --threshold 5 --delay 0 --profanity mask --phrases "Contoso;Jessie;Rehaan"
    

    Important

    Make sure that the paths specified by --input and --output are valid. Otherwise you must change the paths.

    Make sure that you set the SPEECH_KEY and SPEECH_REGION environment variables as described above. Otherwise use the --key and --region arguments.

Check results

When you use the realTime option in the example above, the partial results from Recognizing events are included in the output. In this example, only the final Recognized event includes the commas. Commas aren't the only differences between Recognizing and Recognized events. For more information, see Get partial results.

1
00:00:00,170 --> 00:00:00,380
The

2
00:00:00,380 --> 00:00:01,770
The rainbow

3
00:00:01,770 --> 00:00:02,560
The rainbow has seven

4
00:00:02,560 --> 00:00:03,820
The rainbow has seven colors

5
00:00:03,820 --> 00:00:05,050
The rainbow has seven colors red

6
00:00:05,050 --> 00:00:05,850
The rainbow has seven colors red
orange

7
00:00:05,850 --> 00:00:06,440
The rainbow has seven colors red
orange yellow

8
00:00:06,440 --> 00:00:06,730
The rainbow has seven colors red
orange yellow green

9
00:00:06,730 --> 00:00:07,160
orange, yellow, green, blue,
indigo and Violet.

When you use the --offline option, the results are stable from the final Recognized event. Partial results aren't included in the output:

1
00:00:00,170 --> 00:00:05,540
The rainbow has seven colors, red,
orange, yellow, green, blue,

2
00:00:05,540 --> 00:00:07,160
indigo and Violet.

The SRT (SubRip Text) timespan output format is hh:mm:ss,fff. For more information, see Caption output format.

Usage and arguments

Usage: java -cp ".;target\dependency\*" Captioning --input <input file>

Connection options include:

  • --key: Your Speech resource key. Overrides the SPEECH_KEY environment variable. You must set the environment variable (recommended) or use the --key option.
  • --region REGION: Your Speech resource region. Overrides the SPEECH_REGION environment variable. You must set the environment variable (recommended) or use the --region option. Examples: westus, northeurope

Input options include:

  • --input FILE: Input audio from file. The default input is the microphone.
  • --format FORMAT: Use compressed audio format. Valid only with --file. Valid values are alaw, any, flac, mp3, mulaw, and ogg_opus. The default value is any. To use a wav file, don't specify the format. This option is not available with the JavaScript captioning sample. For compressed audio files such as MP4, install GStreamer and see How to use compressed input audio.

Language options include:

  • --language LANG: Specify a language using one of the corresponding supported locales. This is used when breaking captions into lines. Default value is en-US.

Recognition options include:

  • --offline: Output offline results. Overrides --realTime. Default output mode is offline.
  • --realTime: Output real-time results.

Real-time output includes Recognizing event results. The default offline output is Recognized event results only. These are always written to the console, never to an output file. The --quiet option overrides this. For more information, see Get speech recognition results.

Accuracy options include:

Output options include:

  • --help: Show this help and stop
  • --output FILE: Output captions to the specified file. This flag is required.
  • --srt: Output captions in SRT (SubRip Text) format. The default format is WebVTT (Web Video Text Tracks). For more information about SRT and WebVTT caption file formats, see Caption output format.
  • --maxLineLength LENGTH: Set the maximum number of characters per line for a caption to LENGTH. Minimum is 20. Default is 37 (30 for Chinese).
  • --lines LINES: Set the number of lines for a caption to LINES. Minimum is 1. Default is 2.
  • --delay MILLISECONDS: How many MILLISECONDS to delay the display of each caption, to mimic a real-time experience. This option is only applicable when you use the realTime flag. Minimum is 0.0. Default is 1000.
  • --remainTime MILLISECONDS: How many MILLISECONDS a caption should remain on screen if it is not replaced by another. Minimum is 0.0. Default is 1000.
  • --quiet: Suppress console output, except errors.
  • --profanity OPTION: Valid values: raw, remove, mask. For more information, see Profanity filter concepts.
  • --threshold NUMBER: Set stable partial result threshold. The default value is 3. This option is only applicable when you use the realTime flag. For more information, see Get partial results concepts.

Clean up resources

You can use the Azure portal or Azure Command Line Interface (CLI) to remove the Speech resource you created.

Reference documentation | Package (npm) | Additional Samples on GitHub | Library source code

In this quickstart, you run a console app to create captions with speech to text.

Tip

Try the Speech Studio and choose a sample video clip to see real-time or offline processed captioning results.

Prerequisites

Set up the environment

Before you can do anything, you need to install the Speech SDK for JavaScript. If you just want the package name to install, run npm install microsoft-cognitiveservices-speech-sdk. For guided installation instructions, see the SDK installation guide.

Create captions from speech

Follow these steps to build and run the captioning quickstart code example.

  1. Copy the scenarios/javascript/node/captioning/ sample files from GitHub into your project directory.

  2. Open a command prompt in the same directory as Captioning.js.

  3. Install the Speech SDK for JavaScript:

    npm install microsoft-cognitiveservices-speech-sdk
    
  4. Run the application with your preferred command line arguments. See usage and arguments for the available options. Here is an example:

    node captioning.js --key YourSubscriptionKey --region YourServiceRegion --input caption.this.wav --output caption.output.txt --srt --recognizing --threshold 5 --profanity mask --phrases "Contoso;Jessie;Rehaan"
    

    Replace YourSubscriptionKey with your Speech resource key, and replace YourServiceRegion with your Speech resource region, such as westus or northeurope. Make sure that the paths specified by --input and --output are valid. Otherwise you must change the paths.

    Note

    The Speech SDK for JavaScript doesn't support compressed input audio. You must use a WAV file as shown in the example.

    Important

    Remember to remove the key from your code when you're done, and never post it publicly. For production, use a secure way of storing and accessing your credentials like Azure Key Vault. See the Azure AI services security article for more information.

Check results

The output file with complete captions is written to caption.output.txt. Intermediate results are shown in the console:

00:00:00,180 --> 00:00:01,600
Welcome to

00:00:00,180 --> 00:00:01,820
Welcome to applied

00:00:00,180 --> 00:00:02,420
Welcome to applied mathematics

00:00:00,180 --> 00:00:02,930
Welcome to applied mathematics course

00:00:00,180 --> 00:00:03,100
Welcome to applied Mathematics course 2

00:00:00,180 --> 00:00:03,230
Welcome to applied Mathematics course 201.

The SRT (SubRip Text) timespan output format is hh:mm:ss,fff. For more information, see Caption output format.

Usage and arguments

Usage: node captioning.js --key <key> --region <region> --input <input file>

Connection options include:

  • --key: Your Speech resource key.
  • --region REGION: Your Speech resource region. Examples: westus, northeurope

Input options include:

  • --input FILE: Input audio from file. The default input is the microphone.
  • --format FORMAT: Use compressed audio format. Valid only with --file. Valid values are alaw, any, flac, mp3, mulaw, and ogg_opus. The default value is any. To use a wav file, don't specify the format. This option is not available with the JavaScript captioning sample. For compressed audio files such as MP4, install GStreamer and see How to use compressed input audio.

Language options include:

  • --languages LANG1,LANG2: Enable language identification for specified languages. For example: en-US,ja-JP. This option is only available with the C++, C#, and Python captioning samples. For more information, see Language identification.

Recognition options include:

  • --recognizing: Output Recognizing event results. The default output is Recognized event results only. These are always written to the console, never to an output file. The --quiet option overrides this. For more information, see Get speech recognition results.

Accuracy options include:

Output options include:

  • --help: Show this help and stop
  • --output FILE: Output captions to the specified file. This flag is required.
  • --srt: Output captions in SRT (SubRip Text) format. The default format is WebVTT (Web Video Text Tracks). For more information about SRT and WebVTT caption file formats, see Caption output format.
  • --quiet: Suppress console output, except errors.
  • --profanity OPTION: Valid values: raw, remove, mask. For more information, see Profanity filter concepts.
  • --threshold NUMBER: Set stable partial result threshold. The default value is 3. For more information, see Get partial results concepts.

Clean up resources

You can use the Azure portal or Azure Command Line Interface (CLI) to remove the Speech resource you created.

Reference documentation | Package (Download) | Additional Samples on GitHub

The Speech SDK for Objective-C does support getting speech recognition results for captioning, but we haven't yet included a guide here. Please select another programming language to get started and learn about the concepts, or see the Objective-C reference and samples linked from the beginning of this article.

Reference documentation | Package (Download) | Additional Samples on GitHub

The Speech SDK for Swift does support getting speech recognition results for captioning, but we haven't yet included a guide here. Please select another programming language to get started and learn about the concepts, or see the Swift reference and samples linked from the beginning of this article.

Reference documentation | Package (PyPi) | Additional Samples on GitHub

In this quickstart, you run a console app to create captions with speech to text.

Tip

Try the Speech Studio and choose a sample video clip to see real-time or offline processed captioning results.

Prerequisites

Set up the environment

The Speech SDK for Python is available as a Python Package Index (PyPI) module. The Speech SDK for Python is compatible with Windows, Linux, and macOS.

  1. Install a version of Python from 3.10 or later. First check the SDK installation guide for any more requirements
  2. You must also install GStreamer for compressed input audio.

Set environment variables

Your application must be authenticated to access Azure AI services resources. For production, use a secure way of storing and accessing your credentials. For example, after you get a key for your Speech resource, write it to a new environment variable on the local machine that runs the application.

Tip

Don't include the key directly in your code, and never post it publicly. See Azure AI services security for more authentication options such as Azure Key Vault.

To set the environment variable for your Speech resource key, open a console window, and follow the instructions for your operating system and development environment.

  • To set the SPEECH_KEY environment variable, replace your-key with one of the keys for your resource.
  • To set the SPEECH_REGION environment variable, replace your-region with one of the regions for your resource.
setx SPEECH_KEY your-key
setx SPEECH_REGION your-region

Note

If you only need to access the environment variables in the current console, you can set the environment variable with set instead of setx.

After you add the environment variables, you might need to restart any programs that need to read the environment variable, including the console window. For example, if you're using Visual Studio as your editor, restart Visual Studio before you run the example.

Create captions from speech

Follow these steps to build and run the captioning quickstart code example.

  1. Download or copy the scenarios/python/console/captioning/ sample files from GitHub into a local directory.
  2. Open a command prompt in the same directory as captioning.py.
  3. Run this command to install the Speech SDK:
    pip install azure-cognitiveservices-speech
    
  4. Run the application with your preferred command line arguments. See usage and arguments for the available options. Here is an example:
    python captioning.py --input caption.this.mp4 --format any --output caption.output.txt --srt --realTime --threshold 5 --delay 0 --profanity mask --phrases "Contoso;Jessie;Rehaan"
    

    Important

    Make sure that the paths specified by --input and --output are valid. Otherwise you must change the paths.

    Make sure that you set the SPEECH_KEY and SPEECH_REGION environment variables as described above. Otherwise use the --key and --region arguments.

Check results

When you use the realTime option in the example above, the partial results from Recognizing events are included in the output. In this example, only the final Recognized event includes the commas. Commas aren't the only differences between Recognizing and Recognized events. For more information, see Get partial results.

1
00:00:00,170 --> 00:00:00,380
The

2
00:00:00,380 --> 00:00:01,770
The rainbow

3
00:00:01,770 --> 00:00:02,560
The rainbow has seven

4
00:00:02,560 --> 00:00:03,820
The rainbow has seven colors

5
00:00:03,820 --> 00:00:05,050
The rainbow has seven colors red

6
00:00:05,050 --> 00:00:05,850
The rainbow has seven colors red
orange

7
00:00:05,850 --> 00:00:06,440
The rainbow has seven colors red
orange yellow

8
00:00:06,440 --> 00:00:06,730
The rainbow has seven colors red
orange yellow green

9
00:00:06,730 --> 00:00:07,160
orange, yellow, green, blue,
indigo and Violet.

When you use the --offline option, the results are stable from the final Recognized event. Partial results aren't included in the output:

1
00:00:00,170 --> 00:00:05,540
The rainbow has seven colors, red,
orange, yellow, green, blue,

2
00:00:05,540 --> 00:00:07,160
indigo and Violet.

The SRT (SubRip Text) timespan output format is hh:mm:ss,fff. For more information, see Caption output format.

Usage and arguments

Usage: python captioning.py --input <input file>

Connection options include:

  • --key: Your Speech resource key. Overrides the SPEECH_KEY environment variable. You must set the environment variable (recommended) or use the --key option.
  • --region REGION: Your Speech resource region. Overrides the SPEECH_REGION environment variable. You must set the environment variable (recommended) or use the --region option. Examples: westus, northeurope

Input options include:

  • --input FILE: Input audio from file. The default input is the microphone.
  • --format FORMAT: Use compressed audio format. Valid only with --file. Valid values are alaw, any, flac, mp3, mulaw, and ogg_opus. The default value is any. To use a wav file, don't specify the format. This option is not available with the JavaScript captioning sample. For compressed audio files such as MP4, install GStreamer and see How to use compressed input audio.

Language options include:

  • --language LANG: Specify a language using one of the corresponding supported locales. This is used when breaking captions into lines. Default value is en-US.

Recognition options include:

  • --offline: Output offline results. Overrides --realTime. Default output mode is offline.
  • --realTime: Output real-time results.

Real-time output includes Recognizing event results. The default offline output is Recognized event results only. These are always written to the console, never to an output file. The --quiet option overrides this. For more information, see Get speech recognition results.

Accuracy options include:

Output options include:

  • --help: Show this help and stop
  • --output FILE: Output captions to the specified file. This flag is required.
  • --srt: Output captions in SRT (SubRip Text) format. The default format is WebVTT (Web Video Text Tracks). For more information about SRT and WebVTT caption file formats, see Caption output format.
  • --maxLineLength LENGTH: Set the maximum number of characters per line for a caption to LENGTH. Minimum is 20. Default is 37 (30 for Chinese).
  • --lines LINES: Set the number of lines for a caption to LINES. Minimum is 1. Default is 2.
  • --delay MILLISECONDS: How many MILLISECONDS to delay the display of each caption, to mimic a real-time experience. This option is only applicable when you use the realTime flag. Minimum is 0.0. Default is 1000.
  • --remainTime MILLISECONDS: How many MILLISECONDS a caption should remain on screen if it is not replaced by another. Minimum is 0.0. Default is 1000.
  • --quiet: Suppress console output, except errors.
  • --profanity OPTION: Valid values: raw, remove, mask. For more information, see Profanity filter concepts.
  • --threshold NUMBER: Set stable partial result threshold. The default value is 3. This option is only applicable when you use the realTime flag. For more information, see Get partial results concepts.

Clean up resources

You can use the Azure portal or Azure Command Line Interface (CLI) to remove the Speech resource you created.

In this quickstart, you run a console app to create captions with speech to text.

Tip

Try the Speech Studio and choose a sample video clip to see real-time or offline processed captioning results.

Prerequisites

Set up the environment

Follow these steps and see the Speech CLI quickstart for other requirements for your platform.

  1. Run the following .NET CLI command to install the Speech CLI:

    dotnet tool install --global Microsoft.CognitiveServices.Speech.CLI
    
  2. Run the following commands to configure your Speech resource key and region. Replace SUBSCRIPTION-KEY with your Speech resource key and replace REGION with your Speech resource region.

    spx config @key --set SUBSCRIPTION-KEY
    spx config @region --set REGION
    

You must also install GStreamer for compressed input audio.

Create captions from speech

With the Speech CLI, you can output both SRT (SubRip Text) and WebVTT (Web Video Text Tracks) captions from any type of media that contains audio.

To recognize audio from a file and output both WebVtt (vtt) and SRT (srt) captions, follow these steps.

  1. Make sure that you have an input file named caption.this.mp4 in the path.

  2. Run the following command to output captions from the video file:

    spx recognize --file caption.this.mp4 --format any --output vtt file - --output srt file - --output each file - @output.each.detailed --property SpeechServiceResponse_StablePartialResultThreshold=5 --profanity masked --phrases "Constoso;Jessie;Rehaan"
    

    The SRT and WebVTT captions are output to the console as shown here:

    1
    00:00:00,180 --> 00:00:03,230
    Welcome to applied Mathematics course 201.
    WEBVTT
    
    00:00:00.180 --> 00:00:03.230
    Welcome to applied Mathematics course 201.
    {
      "ResultId": "561a0ea00cc14bb09bd294357df3270f",
      "Duration": "00:00:03.0500000"
    }
    

Usage and arguments

Here are details about the optional arguments from the previous command:

  • --file caption.this.mp4 --format any: Input audio from file. The default input is the microphone. For compressed audio files such as MP4, install GStreamer and see How to use compressed input audio.
  • --output vtt file - and --output srt file -: Outputs WebVTT and SRT captions to standard output. For more information about SRT and WebVTT caption file formats, see Caption output format. For more information about the --output argument, see Speech CLI output options.
  • @output.each.detailed: Outputs event results with text, offset, and duration. For more information, see Get speech recognition results.
  • --property SpeechServiceResponse_StablePartialResultThreshold=5: You can request that the Speech service return fewer Recognizing events that are more accurate. In this example, the Speech service must affirm recognition of a word at least five times before returning the partial results to you. For more information, see Get partial results concepts.
  • --profanity masked: You can specify whether to mask, remove, or show profanity in recognition results. For more information, see Profanity filter concepts.
  • --phrases "Constoso;Jessie;Rehaan": You can specify a list of phrases to be recognized, such as Contoso, Jessie, and Rehaan. For more information, see Improve recognition with phrase list.

Clean up resources

You can use the Azure portal or Azure Command Line Interface (CLI) to remove the Speech resource you created.

Next steps