What are OpenAI text to speech voices?

Like Azure AI Speech voices, OpenAI text to speech voices deliver high-quality speech synthesis to convert written text into natural sounding spoken audio. This unlocks a wide range of possibilities for immersive and interactive user experiences.

OpenAI text to speech voices are available via two model variants: Neural and NeuralHD.

  • Neural: Optimized for real-time use cases with the lowest latency, but lower quality than NeuralHD.
  • NeuralHD: Optimized for quality.

For a demonstration of OpenAI voices in Azure OpenAI Studio and Speech Studio, view this introductory video.

Available text to speech voices in Azure AI services

You might ask: If I want to use an OpenAI text to speech voice, should I use it via the Azure OpenAI Service or via Azure AI Speech? What are the scenarios that guide me to use one or the other?

Each voice model offers distinct features and capabilities, allowing you to choose the one that best suits your specific needs. You want to understand the options and differences between available text to speech voices in Azure AI services.

You can choose from the following text to speech voices in Azure AI services:

  • OpenAI text to speech voices in Azure OpenAI Service. Available in the following regions: North Central US and Sweden Central.
  • OpenAI text to speech voices in Azure AI Speech. Available in the following regions: North Central US and Sweden Central.
  • Azure AI Speech service text to speech voices. Available in dozens of regions. See the region list.

OpenAI text to speech voices via Azure OpenAI Service or via Azure AI Speech?

If you want to use OpenAI text to speech voices, you can choose whether to use them via Azure OpenAI or via Azure AI Speech. In either case, the speech synthesis result is the same.

Here's a comparison of features between OpenAI text to speech voices in Azure OpenAI Service and OpenAI text to speech voices in Azure AI Speech.

Feature Azure OpenAI Service (OpenAI voices) Azure AI Speech (OpenAI voices) Azure AI Speech voices
Region North Central US, Sweden Central North Central US, Sweden Central Available in dozens of regions. See the region list.
Voice variety 6 6 More than 400
Multilingual voice number 6 6 14
Max multilingual language coverage 57 57 77
Speech Synthesis Markup Language (SSML) support Not supported Support for a subset of SSML elements. Support for the full set of SSML in Azure AI Speech.
Development options REST API Speech SDK, Speech CLI, REST API Speech SDK, Speech CLI, REST API
Deployment option Cloud only Cloud only Cloud, embedded, hybrid, and containers.
Real-time or batch synthesis Real-time Real-time and batch synthesis Real-time and batch synthesis
Latency greater than 500 ms greater than 500 ms less than 300 ms
Sample rate of synthesized audio 24 kHz 8, 16, 24, and 48 kHz 8, 16, 24, and 48 kHz
Speech output audio format opus, mp3, aac, flac opus, mp3, pcm, truesilk opus, mp3, pcm, truesilk

There are additional features and capabilities available in Azure AI Speech that aren't available with OpenAI voices. For example:

SSML elements supported by OpenAI text to speech voices in Azure AI Speech

The Speech Synthesis Markup Language (SSML) with input text determines the structure, content, and other characteristics of the text to speech output. For example, you can use SSML to define a paragraph, a sentence, a break or a pause, or silence. You can wrap text with event tags such as bookmark or viseme that can be processed later by your application.

The following table outlines the Speech Synthesis Markup Language (SSML) elements supported by OpenAI text to speech voices in Azure AI speech. Only the following subset of SSML tags are supported for OpenAI voices. See SSML document structure and events for more information.

SSML element name Description
<speak> Encloses the entire content to be spoken. It’s the root element of an SSML document.
<voice> Specifies a voice used for text to speech output.
<sub> Indicates that the alias attribute's text value should be pronounced instead of the element's enclosed text.
<say-as> Indicates the content type, such as number or date, of the element's text.

All of the interpret-as property values are supported for this element except interpret-as="name". For example, <say-as interpret-as="date" format="dmy">10-12-2016</say-as> is supported, but <say-as interpret-as="name">ED</say-as> isn't supported. For more information, see pronunciation with SSML.
<s> Denotes sentences.
<lang> Indicates the default locale for the language that you want the neural voice to speak.
<break> Use to override the default behavior of breaks or pauses between words.

Next steps