Text to speech avatar overview (preview)

Note

Text to speech avatar is currently in public preview. This preview is provided without a service-level agreement, and is not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

Text to speech avatar converts text into a digital video of a photorealistic human (either a prebuilt avatar or a custom text to speech avatar) speaking with a natural-sounding voice. The text to speech avatar video can be synthesized asynchronously or in real time. Developers can build applications integrated with text to speech avatar through an API, or use a content creation tool on Speech Studio to create video content without coding.

With text to speech avatar's advanced neural network models, the feature empowers users to deliver life-like and high-quality synthetic talking avatar videos for various applications while adhering to responsible AI practices.

Note

The text to speech avatar feature is only available in the following service regions: West US 2, West Europe, and Southeast Asia.

Azure AI text to speech avatar feature capabilities include:

  • Converts text into a digital video of a photorealistic human speaking with natural-sounding voices powered by Azure AI text to speech.
  • Provides a collection of prebuilt avatars.
  • The voice of the avatar is generated by Azure AI text to speech. For more information, see Avatar voice and language.
  • Synthesizes text to speech avatar video asynchronously with the batch synthesis API or in real-time.
  • Provides a content creation tool in Speech Studio for creating video content without coding.
  • Enables real-time avatar conversations through the live chat avatar tool in Speech Studio.

With text to speech avatar's advanced neural network models, the feature empowers you to deliver lifelike and high-quality synthetic talking avatar videos for various applications while adhering to responsible AI practices.

Tip

To convert text to speech with a no-code approach, try the Text to speech avatar tool in Speech Studio.

Avatar voice and language

You can choose from a range of prebuilt voices for the avatar. The language support for text to speech avatar is the same as the language support for text to speech. For details, see Language and voice support for the Speech service. Prebuilt text to speech avatars can be accessed through the Speech Studio portal or via API.

The voice in the synthetic video could be a prebuilt neural voice available on Azure AI Speech or the custom neural voice of voice talent selected by you.

Avatar video output

Both batch synthesis and real-time synthesis resolution are 1920 x 1080, and the frames per second (FPS) are 25. Batch synthesis codec can be h264 or h265 if the format is mp4 and can set codec as vp9 if the format is webm; only webm can contain an alpha channel. Real-time synthesis codec is h264. Video bitrate can be configured for both batch synthesis and real-time synthesis in the request; the default value is 2000000; more detailed configurations can be found in the sample code.

Batch synthesis Real-Time synthesis
Resolution 1920 x 1080 1920 x 1080
FPS 25 25
Codec h264/h265/vp9 h264

Custom text to speech avatar

You can create custom text to speech avatars that are unique to your product or brand. All it takes to get started is taking 10 minutes of video recordings. If you're also creating a custom neural voice for the actor, the avatar can be highly realistic. For more information, see What is custom text to speech avatar.

Custom neural voice and custom text to speech avatar are separate features. You can use them independently or together. If you plan to also use custom neural voice with a text to speech avatar, you need to deploy or copy your custom neural voice model to one of the avatar supported regions.

Sample code

Sample code for text to speech avatar is available on GitHub. These samples cover the most popular scenarios:

Pricing

  • When utilizing the text-to-speech avatar feature, charges will be incurred based on the minutes of video output. However, with the real-time avatar, charges are based on the minutes of avatar activation, irrespective of whether the avatar is actively speaking or remaining silent. To optimize costs for real-time avatar usage, refer to the provided tips in the sample code (search "Use Local Video for Idle").
  • Throughout an avatar real-time session or batch content creation, the text-to-speech, speech-to-text, Azure OpenAI, or other Azure services are charged separately.
  • For more information, see Speech service pricing. Note that avatar pricing will only be visible for service regions where the feature is available, including West US 2, West Europe, and Southeast Asia.

Available locations

The text to speech avatar feature is only available in the following service regions: West US 2, West Europe, and Southeast Asia.

Responsible AI

We care about the people who use AI and the people who will be affected by it as much as we care about technology. For more information, see the Responsible AI transparency notes and disclosure for voice and avatar talent.

Next steps