Text to speech avatar overview

Text to speech avatar converts text into a digital video of a photorealistic human (either a prebuilt avatar or a custom text to speech avatar) speaking with a natural-sounding voice. The text to speech avatar video can be synthesized asynchronously or in real time. Developers can build applications integrated with text to speech avatar through an API, or use a content creation tool on Speech Studio to create video content without coding.

With text to speech avatar's advanced neural network models, the feature empowers users to deliver life-like and high-quality synthetic talking avatar videos for various applications while adhering to responsible AI practices.

Tip

To convert text to speech with a no-code approach, try the Text to speech avatar tool in Speech Studio.

Avatar capabilities

Text to speech avatar capabilities include:

  • Converts text into a digital video of a photorealistic human speaking with natural-sounding voices powered by Azure AI text to speech.
  • Provides a collection of prebuilt avatars.
  • The voice of the avatar is generated by Azure AI text to speech. For more information, see Avatar voice and language.
  • Synthesizes text to speech avatar video asynchronously with the batch synthesis API or in real-time.
  • Provides a content creation tool in Speech Studio for creating video content without coding.
  • Enables real-time avatar conversations through the live chat avatar tool in Speech Studio.

With text to speech avatar's advanced neural network models, the feature empowers you to deliver lifelike and high-quality synthetic talking avatar videos for various applications while adhering to responsible AI practices.

Avatar voice and language

You can choose from a range of prebuilt voices for the avatar. The language support for text to speech avatar is the same as the language support for text to speech. For details, see Language and voice support for the Speech service. Prebuilt text to speech avatars can be accessed through the Speech Studio portal or via API.

The voice in the synthetic video could be a prebuilt neural voice available on Azure AI Speech or the custom neural voice of voice talent selected by you.

Avatar video output

Both batch synthesis and real-time synthesis resolution are 1920 x 1080, and the frames per second (FPS) are 25. Batch synthesis codec can be h264, hevc or av1 if the format is mp4 and can set codec as vp9 or av1 if the format is webm; only vp9 can contain an alpha channel. Real-time synthesis codec is h264. Video bitrate can be configured for both batch synthesis and real-time synthesis in the request; the default value is 2000000; more detailed configurations can be found in the sample code.

Batch synthesis Real-time synthesis
Resolution 1920 x 1080 1920 x 1080
FPS 25 25
Codec h264/hevc/vp9/av1 h264

Custom text to speech avatar

You can create custom text to speech avatars that are unique to your product or brand. All it takes to get started is taking 10 minutes of video recordings. If you're also creating a custom neural voice for the actor, the avatar can be highly realistic. For more information, see What is custom text to speech avatar.

Custom neural voice and custom text to speech avatar are separate features. You can use them independently or together. If you plan to also use custom neural voice with a text to speech avatar, you need to deploy or copy your custom neural voice model to one of the avatar supported regions.

Sample code

Sample code for text to speech avatar is available on GitHub. These samples cover the most popular scenarios:

Pricing

  • Throughout an avatar real-time session or batch content creation, the text-to-speech, speech-to-text, Azure OpenAI, or other Azure services are charged separately.
  • Refer to text to speech avatar pricing note to learn how billing works for the text-to-speech avatar feature.
  • For the detailed pricing, see Speech service pricing. Note that avatar pricing will only be visible for service regions where the feature is available, including Southeast Asia, North Europe, West Europe, Sweden Central, South Central US, and West US 2.

Available locations

The text to speech avatar feature is only available in the following service regions: Southeast Asia, North Europe, West Europe, Sweden Central, South Central US, and West US 2.

Responsible AI

We care about the people who use AI and the people who will be affected by it as much as we care about technology. For more information, see the Responsible AI transparency notes and disclosure for voice and avatar talent.

Next steps