Cuir in eagar

Comhroinn trí


Text to speech containers with Docker

The neural text to speech container converts text to natural-sounding speech by using deep neural network technology, which allows for more natural synthesized speech. In this article, you learn how to download, install, and run a Text to speech container.

For more information about prerequisites, validating that a container is running, running multiple containers on the same host, and running disconnected containers, see Install and run Speech containers with Docker.

Container images

The neural text to speech container image for all supported versions and locales can be found on the Microsoft Container Registry (MCR) syndicate. It resides within the azure-cognitive-services/speechservices/ repository and is named neural-text-to-speech.

A screenshot of the search connectors and triggers dialog.

The fully qualified container image name is, mcr.microsoft.com/azure-cognitive-services/speechservices/neural-text-to-speech. Either append a specific version or append :latest to get the most recent version.

Version Path
Latest mcr.microsoft.com/azure-cognitive-services/speechservices/neural-text-to-speech:latest

The latest tag pulls the en-US locale and en-us-arianeural voice.
3.5.0 mcr.microsoft.com/azure-cognitive-services/speechservices/neural-text-to-speech:3.5.0-amd64-en-us-arianeural

All tags, except for latest, are in the following format and are case sensitive:

<major>.<minor>.<patch>-<platform>-<voice>-<preview>

The tags are also available in JSON format for your convenience. The body includes the container path and list of tags. The tags aren't sorted by version, but "latest" is always included at the end of the list as shown in this snippet:

{
  "name": "azure-cognitive-services/speechservices/neural-text-to-speech",
  "tags": [
    <--redacted for brevity-->
    "3.5.0-amd64-uk-ua-ostapneural",
    "3.5.0-amd64-zh-cn-xiaochenneural-preview",
    "3.5.0-amd64-zh-cn-xiaohanneural",
    "3.5.0-amd64-zh-cn-xiaomoneural",
    "3.5.0-amd64-zh-cn-xiaoqiuneural-preview",
    "3.5.0-amd64-zh-cn-xiaoruineural",
    "3.5.0-amd64-zh-cn-xiaoshuangneural-preview",
    "3.5.0-amd64-zh-cn-xiaoxiaoneural",
    "3.5.0-amd64-zh-cn-xiaoyanneural-preview",
    "3.5.0-amd64-zh-cn-xiaoyouneural",
    "3.5.0-amd64-zh-cn-yunxineural",
    "3.5.0-amd64-zh-cn-yunyangneural",
    "3.5.0-amd64-zh-cn-yunyeneural",
    "latest"
  ]
}

Important

We retired the standard speech synthesis voices and standard text to speech container on August 31, 2021. You should use neural voices with the neural-text-to-speech container version 3.0 and higher instead.

Starting from February 29, 2024, the text to speech and neural text to speech container versions 2.19 and earlier aren't supported. For more information on updating your application, see Migrate from standard voice to prebuilt neural voice.

Get the container image with docker pull

You need the prerequisites including required hardware. Also see the recommended allocation of resources for each Speech container.

Use the docker pull command to download a container image from Microsoft Container Registry:

docker pull mcr.microsoft.com/azure-cognitive-services/speechservices/neural-text-to-speech:latest

Important

The latest tag pulls the en-US locale and en-us-arianeural voice. For additional locales and voices, see text to speech container images.

Run the container with docker run

Use the docker run command to run the container.

The following table represents the various docker run parameters and their corresponding descriptions:

Parameter Description
{ENDPOINT_URI} The endpoint is required for metering and billing. For more information, see billing arguments.
{API_KEY} The API key is required. For more information, see billing arguments.

When you run the text to speech container, configure the port, memory, and CPU according to the text to speech container requirements and recommendations.

Here's an example docker run command with placeholder values. You must specify the ENDPOINT_URI and API_KEY values:

docker run --rm -it -p 5000:5000 --memory 12g --cpus 6 \
mcr.microsoft.com/azure-cognitive-services/speechservices/neural-text-to-speech \
Eula=accept \
Billing={ENDPOINT_URI} \
ApiKey={API_KEY}

This command:

  • Runs a neural text to speech container from the container image.
  • Allocates 6 CPU cores and 12 GB of memory.
  • Exposes TCP port 5000 and allocates a pseudo-TTY for the container.
  • Automatically removes the container after it exits. The container image is still available on the host computer.

For more information about docker run with Speech containers, see Install and run Speech containers with Docker.

Use the container

Speech containers provide websocket-based query endpoint APIs that are accessed through the Speech SDK and Speech CLI. By default, the Speech SDK and Speech CLI use the public Speech service. To use the container, you need to change the initialization method.

Important

When you use the Speech service with containers, be sure to use host authentication. If you configure the key and region, requests will go to the public Speech service. Results from the Speech service might not be what you expect. Requests from disconnected containers will fail.

Instead of using this Azure-cloud initialization config:

var config = SpeechConfig.FromSubscription(...);

Use this config with the container host:

var config = SpeechConfig.FromHost(
    new Uri("http://localhost:5000"));

Instead of using this Azure-cloud initialization config:

auto speechConfig = SpeechConfig::FromSubscription(...);

Use this config with the container host:

auto speechConfig = SpeechConfig::FromHost("http://localhost:5000");

Instead of using this Azure-cloud initialization config:

speechConfig, err := speech.NewSpeechConfigFromSubscription(...)

Use this config with the container host:

speechConfig, err := speech.NewSpeechConfigFromHost("http://localhost:5000")

Instead of using this Azure-cloud initialization config:

SpeechConfig speechConfig = SpeechConfig.fromSubscription(...);

Use this config with the container host:

SpeechConfig speechConfig = SpeechConfig.fromHost("http://localhost:5000");

Instead of using this Azure-cloud initialization config:

const speechConfig = sdk.SpeechConfig.fromSubscription(...);

Use this config with the container host:

const speechConfig = sdk.SpeechConfig.fromHost("http://localhost:5000");

Instead of using this Azure-cloud initialization config:

SPXSpeechConfiguration *speechConfig = [[SPXSpeechConfiguration alloc] initWithSubscription:...];

Use this config with the container host:

SPXSpeechConfiguration *speechConfig = [[SPXSpeechConfiguration alloc] initWithHost:"http://localhost:5000"];

Instead of using this Azure-cloud initialization config:

let speechConfig = SPXSpeechConfiguration(subscription: "", region: "");

Use this config with the container host:

let speechConfig = SPXSpeechConfiguration(host: "http://localhost:5000");

Instead of using this Azure-cloud initialization config:

speech_config = speechsdk.SpeechConfig(
    subscription=speech_key, region=service_region)

Use this config with the container endpoint:

speech_config = speechsdk.SpeechConfig(
    host="http://localhost:5000")

When you use the Speech CLI in a container, include the --host http://localhost:5000/ option. You must also specify --key none to ensure that the CLI doesn't try to use a Speech key for authentication. For information about how to configure the Speech CLI, see Get started with the Azure AI Speech CLI.

Try the text to speech quickstart using host authentication instead of key and region.

SSML voice element

When you construct a neural text to speech HTTP POST, the SSML message requires a voice element with a name attribute. The locale of the voice must correspond to the locale of the container model. The SSML tag support is consistent for each text to speech voice both in the Azure cloud and the container environment.

For example, a model that was downloaded via the latest tag (defaults to "en-US") would have a voice name of en-US-AriaNeural.

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-AriaNeural">
        This is the text that is spoken.
    </voice>
</speak>

Next steps