Install and run Docker containers for the Speech service APIs

By using containers, you can run some of the Azure Cognitive Services Speech service APIs in your own environment. Containers are great for specific security and data governance requirements. In this article, you'll learn how to download, install, and run a Speech container.

With Speech containers, you can build a speech application architecture that's optimized for both robust cloud capabilities and edge locality. Several containers are available, which use the same pricing as the cloud-based Azure Speech services.

Important

We retired the standard speech synthesis voices and text-to-speech container on August 31, 2021. Consider migrating your applications to use the neural text-to-speech container instead. For more information on updating your application, see Migrate from standard voice to prebuilt neural voice.

Container Features Latest Release status
Speech-to-text Analyzes sentiment and transcribes continuous real-time speech or batch audio recordings with intermediate results. 3.6.0 Generally available
Custom speech-to-text Using a custom model from the Custom Speech portal, transcribes continuous real-time speech or batch audio recordings into text with intermediate results. 3.6.0 Generally available
Speech language identification Detects the language spoken in audio files. 1.5.0 Preview
Neural text-to-speech Converts text to natural-sounding speech by using deep neural network technology, which allows for more natural synthesized speech. 2.5.0 Generally available

Prerequisites

Important

  • To use the Speech containers, you must submit an online request and have it approved. For more information, see the "Request approval to run the container" section.
  • Generally available containers meet Microsoft's stability and support requirements. Containers in preview are still under development.

You must meet the following prerequisites before you use Speech service containers. If you don't have an Azure subscription, create a free account before you begin. You need:

  • Docker installed on a host computer. Docker must be configured to allow the containers to connect with and send billing data to Azure.
    • On Windows, Docker must also be configured to support Linux containers.
    • You should have a basic understanding of Docker concepts.
  • A Speech service resource with the free (F0) or standard (S) pricing tier.

Gather required parameters

Three primary parameters for all Cognitive Services containers are required. The Microsoft Software License Terms must be present with a value of accept. An Endpoint URI and API key are also needed.

Endpoint URI

The {ENDPOINT_URI} value is available on the Azure portal Overview page of the corresponding Cognitive Services resource. Go to the Overview page, hover over the endpoint, and a Copy to clipboard icon appears. Copy and use the endpoint where needed.

Screenshot that shows gathering the endpoint URI for later use.

Keys

The {API_KEY} value is used to start the container and is available on the Azure portal's Keys page of the corresponding Cognitive Services resource. Go to the Keys page, and select the Copy to clipboard icon.

Screenshot that shows getting one of the two keys for later use.

Important

These subscription keys are used to access your Cognitive Services API. Don't share your keys. Store them securely. For example, use Azure Key Vault. We also recommend that you regenerate these keys regularly. Only one key is necessary to make an API call. When you regenerate the first key, you can use the second key for continued access to the service.

Host computer requirements and recommendations

The host is an x64-based computer that runs the Docker container. It can be a computer on your premises or a Docker hosting service in Azure, such as:

Container requirements and recommendations

The following table describes the minimum and recommended allocation of resources for each Speech container:

Container Minimum Recommended
Speech-to-text 4 core, 4-GB memory 8 core, 6-GB memory
Custom speech-to-text 4 core, 4-GB memory 8 core, 6-GB memory
Speech language identification 1 core, 1-GB memory 1 core, 1-GB memory
Neural text-to-speech 6 core, 12-GB memory 8 core, 16-GB memory

Each core must be at least 2.6 gigahertz (GHz) or faster.

Core and memory correspond to the --cpus and --memory settings, which are used as part of the docker run command.

Note

The minimum and recommended allocations are based on Docker limits, not the host machine resources. For example, speech-to-text containers memory map portions of a large language model. We recommend that the entire file should fit in memory, which is an additional 4 to 6 GB. Also, the first run of either container might take longer because models are being paged into memory.

Advanced Vector Extension support

The host is the computer that runs the Docker container. The host must support Advanced Vector Extensions (AVX2). You can check for AVX2 support on Linux hosts with the following command:

grep -q avx2 /proc/cpuinfo && echo AVX2 supported || echo No AVX2 support detected

Warning

The host computer is required to support AVX2. The container will not function correctly without AVX2 support.

Request approval to run the container

Fill out and submit the request form to request access to the container.

The form requests information about you, your company, and the user scenario for which you'll use the container. After you submit the form, the Azure Cognitive Services team reviews it and emails you with a decision within 10 business days.

Important

  • On the form, you must use an email address associated with an Azure subscription ID.
  • The Azure resource you use to run the container must have been created with the approved Azure subscription ID.
  • Check your email (both inbox and junk folders) for updates on the status of your application from Microsoft.

After you're approved, you'll be able to run the container after you download it from the Microsoft Container Registry (MCR), described later in the article.

You won't be able to run the container if your Azure subscription hasn't been approved.

Get the container image with docker pull

Container images for Speech are available in the following container registry.

Container Repository
Speech-to-text mcr.microsoft.com/azure-cognitive-services/speechservices/speech-to-text:latest

Tip

You can use the docker images command to list your downloaded container images. For example, the following command lists the ID, repository, and tag of each downloaded container image, formatted as a table:

docker images --format "table {{.ID}}\t{{.Repository}}\t{{.Tag}}"

IMAGE ID         REPOSITORY                TAG
<image-id>       <repository-path/name>    <tag-name>

Docker pull for the Speech containers

Docker pull for the speech-to-text container

Use the docker pull command to download a container image from Microsoft Container Registry:

docker pull mcr.microsoft.com/azure-cognitive-services/speechservices/speech-to-text:latest

Important

The latest tag pulls the en-US locale. For additional locales, see Speech-to-text locales.

Speech-to-text locales

All tags, except for latest, are in the following format and are case sensitive:

<major>.<minor>.<patch>-<platform>-<locale>-<prerelease>

The following tag is an example of the format:

2.6.0-amd64-en-us

For all the supported locales of the speech-to-text container, see Speech-to-text image tags.

Use the container

After the container is on the host computer, use the following process to work with the container.

  1. Run the container with the required billing settings. More examples of the docker run command are available.
  2. Query the container's prediction endpoint.

Run the container with docker run

Use the docker run command to run the container. For more information on how to get the {Endpoint_URI} and {API_Key} values, see Gather required parameters. More examples of the docker run command are also available.

Run the container in disconnected environments

Starting in container version 3.0.0, select customers can run speech-to-text containers in an environment without internet accessibility. For more information, see Run Cognitive Services containers in disconnected environments.

Starting in container version 2.0.0, select customers can run neural-text-to-speech containers in an environment without internet accessibility. For more information, see Run Cognitive Services containers in disconnected environments.

To run the standard speech-to-text container, execute the following docker run command:

docker run --rm -it -p 5000:5000 --memory 4g --cpus 4 \
mcr.microsoft.com/azure-cognitive-services/speechservices/speech-to-text \
Eula=accept \
Billing={ENDPOINT_URI} \
ApiKey={API_KEY}

This command:

  • Runs a speech-to-text container from the container image.
  • Allocates 4 CPU cores and 4 GB of memory.
  • Exposes TCP port 5000 and allocates a pseudo-TTY for the container.
  • Automatically removes the container after it exits. The container image is still available on the host computer.

Note

Containers support compressed audio input to the Speech SDK by using GStreamer. To install GStreamer in a container, follow Linux instructions for GStreamer in Use codec compressed audio input with the Speech SDK.

Diarization on the speech-to-text output

Diarization is enabled by default. To get diarization in your response, use diarize_speech_config.set_service_property.

  1. Set the phrase output format to Detailed.

  2. Set the mode of diarization. The supported modes are Identity and Anonymous.

    diarize_speech_config.set_service_property(
        name='speechcontext-PhraseOutput.Format',
        value='Detailed',
        channel=speechsdk.ServicePropertyChannel.UriQueryParameter
    )
    
    diarize_speech_config.set_service_property(
        name='speechcontext-phraseDetection.speakerDiarization.mode',
        value='Identity',
        channel=speechsdk.ServicePropertyChannel.UriQueryParameter
    )
    

    Note

    "Identity" mode returns "SpeakerId": "Customer" or "SpeakerId": "Agent". "Anonymous" mode returns "SpeakerId": "Speaker 1" or "SpeakerId": "Speaker 2".

Analyze sentiment on the speech-to-text output

Starting in v2.6.0 of the speech-to-text container, you should use Language service 3.0 API endpoint instead of the preview one. For example:

  • https://eastus.api.cognitive.microsoft.com/text/analytics/v3.0/sentiment
  • https://localhost:5000/text/analytics/v3.0/sentiment

Note

The Language service v3.0 API isn't backward compatible with v3.0-preview.1. To get the latest sentiment feature support, use v2.6.0 of the speech-to-text container image and Language service v3.0.

Starting in v2.2.0 of the speech-to-text container, you can call the sentiment analysis v3 API on the output. To call sentiment analysis, you'll need a Language service API resource endpoint. For example:

  • https://eastus.api.cognitive.microsoft.com/text/analytics/v3.0-preview.1/sentiment
  • https://localhost:5000/text/analytics/v3.0-preview.1/sentiment

If you're accessing a Language service endpoint in the cloud, you'll need a key. If you're running Language service features locally, you might not need to provide this.

The key and endpoint are passed to the Speech container as arguments, as in the following example:

docker run -it --rm -p 5000:5000 \
mcr.microsoft.com/azure-cognitive-services/speechservices/speech-to-text:latest \
Eula=accept \
Billing={ENDPOINT_URI} \
ApiKey={API_KEY} \
CloudAI:SentimentAnalysisSettings:TextAnalyticsHost={TEXT_ANALYTICS_HOST} \
CloudAI:SentimentAnalysisSettings:SentimentAnalysisApiKey={SENTIMENT_APIKEY}

This command:

  • Performs the same steps as the preceding command.
  • Stores a Language service API endpoint and key, for sending sentiment analysis requests.

Phraselist v2 on the speech-to-text output

Starting in v2.6.0 of the speech-to-text container, you can get the output with your own phrases, either the whole sentence or phrases in the middle. For example, the tall man in the following sentence:

  • "This is a sentence the tall man this is another sentence."

To configure a phrase list, you need to add your own phrases when you make the call. For example:

    phrase="the tall man"
    recognizer = speechsdk.SpeechRecognizer(
        speech_config=dict_speech_config,
        audio_config=audio_config)
    phrase_list_grammer = speechsdk.PhraseListGrammar.from_recognizer(recognizer)
    phrase_list_grammer.addPhrase(phrase)
    
    dict_speech_config.set_service_property(
        name='setflight',
        value='xonlineinterp',
        channel=speechsdk.ServicePropertyChannel.UriQueryParameter
    )

If you have multiple phrases to add, call .addPhrase() for each phrase to add it to the phrase list.

Important

The Eula, Billing, and ApiKey options must be specified to run the container. Otherwise, the container won't start. For more information, see Billing.

Query the container's prediction endpoint

Note

Use a unique port number if you're running multiple containers.

Containers SDK Host URL Protocol
Standard speech-to-text and custom speech-to-text ws://localhost:5000 WS
Neural Text-to-speech, Speech language identification http://localhost:5000 HTTP

For more information on using WSS and HTTPS protocols, see Container security.

Speech-to-text (standard and custom)

The container provides websocket-based query endpoint APIs that are accessed through the Speech SDK. By default, the Speech SDK uses online speech services. To use the container, you need to change the initialization method.

Tip

When you use the Speech SDK with containers, you don't need to provide the Azure Speech resource subscription key or an authentication bearer token.

See the following examples.

Change from using this Azure-cloud initialization call:

var config = SpeechConfig.FromSubscription("YourSubscriptionKey", "YourServiceRegion");

To using this call with the container host:

var config = SpeechConfig.FromHost(
    new Uri("ws://localhost:5000"));

Analyze sentiment

If you provided your Language service API credentials to the container, you can use the Speech SDK to send speech recognition requests with sentiment analysis. You can configure the API responses to use either a simple or detailed format.

Note

v1.13 of the Speech Service Python SDK has an identified issue with sentiment analysis. Use v1.12.x or earlier if you're using sentiment analysis in the Speech Service Python SDK.

To configure the Speech client to use a simple format, add "Sentiment" as a value for Simple.Extensions. If you want to choose a specific Language service model version, replace 'latest' in the speechcontext-phraseDetection.sentimentAnalysis.modelversion property configuration.

speech_config.set_service_property(
    name='speechcontext-PhraseOutput.Simple.Extensions',
    value='["Sentiment"]',
    channel=speechsdk.ServicePropertyChannel.UriQueryParameter
)
speech_config.set_service_property(
    name='speechcontext-phraseDetection.sentimentAnalysis.modelversion',
    value='latest',
    channel=speechsdk.ServicePropertyChannel.UriQueryParameter
)

Simple.Extensions returns the sentiment result in the root layer of the response.

{
   "DisplayText":"What's the weather like?",
   "Duration":13000000,
   "Id":"6098574b79434bd4849fee7e0a50f22e",
   "Offset":4700000,
   "RecognitionStatus":"Success",
   "Sentiment":{
      "Negative":0.03,
      "Neutral":0.79,
      "Positive":0.18
   }
}

If you want to completely disable sentiment analysis, add a false value to sentimentanalysis.enabled.

speech_config.set_service_property(
    name='speechcontext-phraseDetection.sentimentanalysis.enabled',
    value='false',
    channel=speechsdk.ServicePropertyChannel.UriQueryParameter
)

Neural Text-to-Speech

The container provides REST-based endpoint APIs. Many sample source code projects for platform, framework, and language variations are available.

With the neural Text-to-Speech containers, you should rely on the locale and voice of the image tag you downloaded. For example, if you downloaded the latest tag, the default locale is en-US and the AriaNeural voice. The {VOICE_NAME} argument would then be en-US-AriaNeural. See the following example SSML:

<speak version="1.0" xml:lang="en-US">
    <voice name="en-US-AriaNeural">
        This text will get converted into synthesized speech.
    </voice>
</speak>

Run multiple containers on the same host

If you intend to run multiple containers with exposed ports, make sure to run each container with a different exposed port. For example, run the first container on port 5000 and the second container on port 5001.

You can have this container and a different Cognitive Services container running on the HOST together. You also can have multiple containers of the same Cognitive Services container running.

Validate that a container is running

There are several ways to validate that the container is running. Locate the External IP address and exposed port of the container in question, and open your favorite web browser. Use the various request URLs that follow to validate the container is running. The example request URLs listed here are http://localhost:5000, but your specific container might vary. Make sure to rely on your container's External IP address and exposed port.

Request URL Purpose
http://localhost:5000/ The container provides a home page.
http://localhost:5000/ready Requested with GET, this URL provides a verification that the container is ready to accept a query against the model. This request can be used for Kubernetes liveness and readiness probes.
http://localhost:5000/status Also requested with GET, this URL verifies if the api-key used to start the container is valid without causing an endpoint query. This request can be used for Kubernetes liveness and readiness probes.
http://localhost:5000/swagger The container provides a full set of documentation for the endpoints and a Try it out feature. With this feature, you can enter your settings into a web-based HTML form and make the query without having to write any code. After the query returns, an example CURL command is provided to demonstrate the HTTP headers and body format that's required.

Container's home page

Stop the container

To shut down the container, in the command-line environment where the container is running, select Ctrl+C.

Troubleshooting

When you start or run the container, you might experience issues. Use an output mount and enable logging. Doing so allows the container to generate log files that are helpful when you troubleshoot issues.

Tip

For more troubleshooting information and guidance, see Cognitive Services containers frequently asked questions (FAQ).

If you're having trouble running a Cognitive Services container, you can try using the Microsoft diagnostics container. Use this container to diagnose common errors in your deployment environment that might prevent Cognitive Services containers from functioning as expected.

To get the container, use the following docker pull command:

docker pull mcr.microsoft.com/azure-cognitive-services/diagnostic

Then run the container. Replace {ENDPOINT_URI} with your endpoint, and replace {API_KEY} with your key to your resource:

docker run --rm mcr.microsoft.com/azure-cognitive-services/diagnostic \
Eula=accept \
Billing={ENDPOINT_URI} \
ApiKey={API_KEY}

The container will test for network connectivity to the billing endpoint.

Billing

The Speech containers send billing information to Azure by using a Speech resource on your Azure account.

Queries to the container are billed at the pricing tier of the Azure resource that's used for the ApiKey parameter.

Azure Cognitive Services containers aren't licensed to run without being connected to the metering or billing endpoint. You must enable the containers to communicate billing information with the billing endpoint at all times. Cognitive Services containers don't send customer data, such as the image or text that's being analyzed, to Microsoft.

Connect to Azure

The container needs the billing argument values to run. These values allow the container to connect to the billing endpoint. The container reports usage about every 10 to 15 minutes. If the container doesn't connect to Azure within the allowed time window, the container continues to run but doesn't serve queries until the billing endpoint is restored. The connection is attempted 10 times at the same time interval of 10 to 15 minutes. If it can't connect to the billing endpoint within the 10 tries, the container stops serving requests. See the Cognitive Services container FAQ for an example of the information sent to Microsoft for billing.

Billing arguments

The docker run command will start the container when all three of the following options are provided with valid values:

Option Description
ApiKey The API key of the Cognitive Services resource that's used to track billing information.
The value of this option must be set to an API key for the provisioned resource that's specified in Billing.
Billing The endpoint of the Cognitive Services resource that's used to track billing information.
The value of this option must be set to the endpoint URI of a provisioned Azure resource.
Eula Indicates that you accepted the license for the container.
The value of this option must be set to accept.

For more information about these options, see Configure containers.

Summary

In this article, you learned concepts and workflow for how to download, install, and run Speech containers. In summary:

  • Speech provides four Linux containers for Docker that have various capabilities:
    • Speech-to-text
    • Custom speech-to-text
    • Neural text-to-speech
    • Speech language identification
  • Container images are downloaded from the container registry in Azure.
  • Container images run in Docker.
  • Whether you use the REST API (text-to-speech only) or the SDK (speech-to-text or text-to-speech), you specify the host URI of the container.
  • You're required to provide billing information when you instantiate a container.

Important

Cognitive Services containers aren't licensed to run without being connected to Azure for metering. Customers need to enable the containers to communicate billing information with the metering service at all times. Cognitive Services containers don't send customer data (for example, the image or text that's being analyzed) to Microsoft.

Next steps