Transparency note: text to speech

What is a Transparency Note?

An AI system includes not only the technology, but also the people who will use it, the people who will be affected by it, and the environment in which it is deployed. Creating a system that is fit for its intended purpose requires an understanding of how technology works, what its capabilities and limitations are, and how to achieve the best performance. Microsoft’s Transparency Notes are intended to help you understand how our AI technology works, the choices system owners can make that influence system performance and behavior, and the importance of thinking about the whole system, including the technology, the people, and the environment. You can use Transparency Notes when developing or deploying your own system or share them with the people who will use or be affected by your system.

Microsoft’s Transparency Notes are part of a broader effort at Microsoft to put our AI Principles into practice. To find out more, see the Microsoft AI principles.

The basics of text to speech

Introduction

Text to speech, part of Azure AI Speech, is a versatile tool that can convert written text into natural-sounding speech audio. The feature takes input in the form of text and generates high-quality speech audio output that can be played on devices. For the speech audio output, text to speech offers a range of prebuilt neural voices or, for Limited Access customers, the option to create a custom neural voice for your product or brand.

Text to speech also has visual capabilities. Using text to speech avatar (Preview), customers can input text and create a synthetic video of an avatar speaking. Both prebuilt text to speech avatars and custom text to speech avatars are available, which can be used with both prebuilt neural voice and custom neural voice, though some features are only available for Limited Access customers.

In a text to speech system, customers can turn written information into audible speech and improve accessibility for users. Whether listening to documents or enhancing user experiences with synthesized speech, text to speech transforms text into natural-sounding spoken words.

Key terms

Term	Definition
Real-time speech synthesis	Use the Speech SDK or REST API to convert text to speech by using prebuilt neural voice, prebuilt text to speech avatar, custom neural voice, and custom text to speech avatar.
Voice model	In a text to speech system, a voice model refers to a machine learning-based model or algorithm that generates synthetic speech from written text. This model is trained to convert text input into spoken language output, mimicking the characteristics of a human voice, including pitch, tone, and pronunciation.
Prosody	Prosody refers to the modulation of speech elements such as pitch, duration, volume, and pauses to infuse synthetic voices with a natural and expressive quality, conveying emotional nuances and contextual meaning, thereby reducing the robotic quality of the generated speech, and making it more engaging and comprehensible to listeners.
Speech Synthesis Markup Language (“SSML”)	Speech Synthesis Markup Language (SSML) is an XML-based markup language that's used to customize text to speech outputs. With SSML, you can adjust pitch, add pauses, improve pronunciation, change speaking rate, adjust volume, and attribute multiple voices to a single document. You can use SSML to define your own lexicons or switch to different speaking styles.
Asynchronous synthesis of long audio	Use the batch synthesis API (preview) to asynchronously synthesize text to speech files longer than 10 minutes (for example, audio books or lectures). Unlike synthesis performed via the Speech SDK or Speech to text REST API, responses aren't returned in real-time. The expectation is that requests are sent asynchronously, responses are polled for, and synthesized audio is downloaded when the service makes it available.
Visemes	Visemes are the key poses in observed speech, including the position of the lips, jaw, and tongue in producing a particular phoneme. Visemes have a strong correlation with voices and phonemes.

Introduction

Prebuilt neural voice provides a wide range of voices, offering over 400 options in more than 140 languages and locales. These text to speech voices enable you to quickly integrate read-aloud functionality into your applications for enhanced accessibility.

Key terms

Term	Definition
Prebuilt neural voice	Microsoft offers a set of prebuilt neural voices, which use deep neural networks to overcome the limits of traditional speech synthesis with regard to stress and intonation in spoken language. Prosody prediction and voice synthesis happen simultaneously, which results in more fluid and natural-sounding outputs. Each prebuilt neural voice model is available at 24kHz and high-fidelity 48kHz, and the output can be upsampled or downsampled to other formats.

Term

Definition

Prebuilt neural voice

Microsoft offers a set of prebuilt neural voices, which use deep neural networks to overcome the limits of traditional speech synthesis with regard to stress and intonation in spoken language. Prosody prediction and voice synthesis happen simultaneously, which results in more fluid and natural-sounding outputs. Each prebuilt neural voice model is available at 24kHz and high-fidelity 48kHz, and the output can be upsampled or downsampled to other formats.

Introduction

Custom neural voice is a Limited Access text to speech feature that allows customers to create a one-of-a-kind customized synthetic voice for their applications by providing their own audio data of their selected voice talents. For more information on custom neural voice, see Overview of custom neural voice.

Key terms

In addition to the common terms from prebuilt neural voice, the following key terms are relevant to custom neural voice:

Term	Definition
Custom neural voice	Custom neural voice is a text to speech feature that lets you create a one-of-a-kind, customized, synthetic voice for your applications. With custom neural voice, you can build a natural-sounding voice for your brand or character by providing human speech samples as training data.
Recording scripts	A recording script is used for building a custom neural voice. This script includes a variety of sentences or phrases carefully chosen to cover a wide range of phonemes, linguistic features, and contextual scenarios. When trained on a diverse recording script, custom neural voice is capable of accurately and naturally synthesizing speech for a broad range of applications and scenarios.
Voice talent	Individuals whose voices are recorded and used to create synthetic voice models. Customers must get sufficient consent under all relevant laws and regulations from the voice talent to use their voice to create a custom neural voice.
Personal voice	Personal voice is a new feature in preview that allows customers to create synthetic voices on behalf of their users quickly from a short human voice sample as training data (also known as the prompt).

Introduction

Prebuilt text to speech avatar offers a range of prebuilt avatar choices that can speak using either a custom neural voice or a prebuilt neural Voice. Prebuilt text to speech avatars can be accessed through a UI tool on the Speech Studio or from an API and can be synthesized in real time or in batch mode. Currently we provide a small collection of prebuilt avatars and we aim to continue to expand the number and diversity of prebuilt avatars.

Key terms

In addition to the common terms from prebuilt neural voice, the following key terms are relevant to prebuilt text to speech avatar:

Term	*Definition
Real-time synthesis	The synthetic avatar video will be generated in almost real time after the system receives the text input.
Batch mode synthesis	The synthetic avatar video will be generated asynchronously after the system receives text input. The generated video output can be downloaded in batch mode synthesis.

Introduction

Limited Access customers can also create a custom text to speech avatar for their product or brand. To do this, a customer provides their own video recording of avatar talent, which the custom text to speech avatar service uses as training data to create a synthetic video of the custom avatar speaking. Custom avatars may use either a prebuilt neural voice or a custom neural voice. If a customer pairs a custom neural voice with a custom avatar using the same individual’s voice and image/likeness, the resulting custom avatar will be very similar to that individual.

Key terms

In addition to the common terms from prebuilt neural voice, custom neural voice, and prebuilt text to speech avatar, the following key terms are relevant to custom text to speech avatar:

Term	Definition
Avatar talent	Custom text to speech avatar model building requires training on a video recording of a real human speaking. This person is the avatar talent. Customers must get sufficient consent under all relevant laws and regulations from the avatar talent to use their image/likeness to create a custom avatar.

Introduction

Video translation can efficiently localize your video content to cater to diverse audiences around the globe. This service empowers you to create immersive, localized content efficiently and effectively across various use cases such as vlogs, education, news, advertising, and more.

Video translation using prebuilt neural voices is available in preview for all users. Video translation with personal voice is a Limited Access feature in preview and is subject to use case and eligibility restrictions.

Key terms

Term	Definition
Batch mode	When customers upload video and specify the source and target language, video translation will return the translated video asynchronously.
Content editing	Customer can modify the translated content with prebuilt neural voice or with authorized personal voice.

Capabilities

System behavior

Text to speech

Text to speech converts text into natural-sounding speech.

Below are the main options for calling the text to speech service.

Real-time text to speech API

This is a common API call via the Speech SDK or REST API to send a text input and receive an audio output in real time. The Speech system uses a text to speech voice model to convert the text into human-like synthetic speech. The output audio can be saved as a file or be played back to an output device such as a speaker (learn more about how to synthesize speech from text). Users can also use SSML to fine-tune the text to speech output.

Text to speech models are trained on large amounts of diverse audio across typical usage scenarios and a wide range of speakers. For example, the text to speech service is often used for voice-enabled chat bots or for audio content creation.

Batch synthesis API

Batch synthesis is another type of API call. It’s typically used to send large text files and to receive audio outputs asynchronously (that is, at a later time). To use this API, you can specify locations for multiple text files. The text to speech technology reads the text input from the file and generates audio files that are returned to the storage location that you specify. This feature is used to support larger speech synthesis jobs in which it is not necessary to provide end users with the audio output in real time. An example is to create audio books.

Text to speech – custom neural voice

Custom neural voice is a text to speech feature that allows Limited Access customers to create a one-of-a-kind customized synthetic voice for their applications by providing their own audio data of customer’s selected voice talents.

With custom neural voice, you can record your voice talent by having them read Microsoft-provided scripts in the Speech Studio and quickly create a synthetic voice that sounds like your voice talent using a lite project (Preview). A lite project is ideal for a quick trial or a proof of concept.

With a pro project, you can upload studio-recorded high-quality voice data of your selected voice talent and create a realistic-sounding voice. Pro supports highly natural voice training that even more closely resembles your voice talent’s voice and can be adapted to speak in multiple emotions and across languages, without the need for additional emotion-specific or language-specific training data.

Once a custom neural voice is created, you can deploy the voice model with a unique endpoint and use the model to generate synthetic speech with the real-time synthesis API or the batch synthesis API described above.

For more information on custom neural voice, see Overview of custom neural voice.

Personal voice

The personal voice feature allows Limited Access customers to create a voice model from a short human voice sample (also called a prompt). The feature can create a voice model based on the prompt in as little as a few seconds. This feature is typically used to power personalized voice experiences for business customers’ applications. Personal voice models can create realistic-sounding voices that can speak in close to 100 languages.

Watermarks are added to custom neural voices created with the personal voice feature. Watermarks allow users to identify whether speech is synthesized using Azure AI Speech, and specifically, which voice was used. Eligible customers can use Azure AI Speech watermark detection capabilities. To request to add watermark detection to your applications please contact mstts[at]microsoft.com.

For more information on personal voice, see personal voice.

Text to speech avatar (preview)

Text to speech avatar converts text into a digital video of a photorealistic human (either a prebuilt avatar or a custom avatar) speaking with a natural-sounding voice powered by text to speech features like prebuilt neural voice or custom neural voice. The text to speech avatar video can be synthesized asynchronously or in real time. Developers can build applications integrated with text to speech avatar through an API, or use a content creation tool on Speech Studio to create video content without coding.

With text to speech avatar’s advanced neural network models, the feature empowers users to deliver life-like and high-quality synthetic talking avatar videos for various applications.

Video translation (preview)

Video translation can efficiently localize your video content to cater to diverse audiences around the globe. Video translation will automatically extract dialogue audio, transcribe, translate and dub the content with prebuilt or personal voice to the target language, with accurate subtitles for better accessibility. Multi-speaker features will help identify the number of individuals speaking and recommend most suitable voices. Content editing with human in the loop allows for precise alignment with customer preference. Enhanced translation quality ensures precise audio and video alignment with GPT integration. Video translation enables authentic and personalized dubbing experiences with personal voice.

Use cases

Text to speech offers a variety of features catering to a wide range of intended uses across industries and domains. All text to speech features including video translation are subject to the terms and conditions applicable to customers’ Azure subscription, including the Azure Acceptable Use Policy and the Code of conduct for Azure AI Speech text to speech.

In addition, custom text to speech features like custom neural voice, personal voice, and custom text to speech avatar are limited to the approved use cases as outlined in the specific scenarios described below:

Intended uses for Custom Neural Voice Pro and Custom Neural Voice Lite

The following are the approved use cases for Custom Neural Voice Pro and Custom Neural Voice Lite:

Educational or interactive learning: To create a fictional brand or character voice for reading or speaking educational materials, online learning, interactive lesson plans, simulation learning, or guided museum tours.
Media: Entertainment: To create a fictional brand or character voice for reading or speaking entertainment content for video games, movies, TV, recorded music, podcasts, audio books, or augmented or virtual reality.
Media: Marketing: To create a fictional brand or character voice for reading or speaking marketing and product or service media, product introductions, business promotion, or advertisements.
Self-authored content: To create a voice for reading content authored by the voice talent.
Accessibility Features: For use in audio description systems and narration, including any fictional brand or character voice, or to facilitate communication by people with speech impairments.
Interactive Voice Response (IVR) Systems: To create voices, including any fictional brand or character voice, for call center operations, telephony systems, or responses for phone interactions.
Public Service and Informational Announcements: To create a fictional brand or character voice for communicating public service information, including announcements for public venues, or for informational broadcasts such as traffic, weather, event information, and schedules. This use case is not intended for journalistic or news content.
Translation and Localization: For use in translation applications for translating conversations in different languages or translating audio media.
Virtual Assistant or Chatbot: To create a fictional brand or character voice for smart assistants in or for virtual web assistants, appliances, cars, home appliances, toys, control of IoT devices, navigation systems, reading out personal messages, virtual companions, or customer service scenarios.

Intended uses for personal voice

The personal voice API (see Personal voice for more information) is available in Limited Access preview. Eligible customers can integrate the personal voice API with their applications supporting personal voices for the following use cases only:

In applications where voice output is constrained and defined by customers who meet Limited Access eligibility criteria, and where the voice does not read user-generated or open-ended content. Voice model usage must remain within the application and output must not be publishable or shareable from the application. Some examples of applications that fit this description are voice assistants in smart devices and customizing a character voice in gaming.
Dubbing voices for the purpose of translating content to different languages for films, TV, video, and audio, where customers who meet Limited Access eligibility criteria maintain sole control over the creation of, access to, and use of the voice models and their output.
Creating audio and video content for business scenarios to communicate product information, marketing materials, business promotional content, and internal business communications.

All other uses of custom neural voice, including Custom Neural Voice Pro, Custom Neural Voice Lite, and personal voice, are prohibited. In addition, custom neural voice is a Limited Access service, and registration is required for access to this service. To learn more about Microsoft’s Limited Access policy, refer to Limited Access features for Azure AI services. Certain features are only available to Microsoft managed customers and partners, and only for certain use cases approved by Microsoft at the time of registration.

Prebuilt neural voice may also be used for the custom neural voice use cases above, as well as additional use cases selected by customers and consistent with the Azure Acceptable Use Policy and the Code of conduct for Azure AI Speech text to speech. No registration or pre-approval is required for additional use cases for prebuilt neural voice that meet all applicable terms and conditions.

Intended uses for custom text to speech avatar and prebuilt text to speech avatar

The following are the approved use cases for custom text to speech avatar:

Virtual Assistant or Chatbot: To create virtual assistants, virtual companions, virtual sales assistants, or for customer service applications.
Content generation for enterprise contexts: For use to communicate product information, marketing materials, business promotional content, and internal business communications. Examples include character avatars or digital twins of a business leader to promote a brand.
Educational or interactive learning: To create a fictional brand or character avatar for presenting educational materials, online learning, interactive lesson plans, simulation learning, or guided museum tours.
Media: Entertainment: To present updates, share knowledge, create interactive media, or make talking head videos for entertainment scenarios such as videos, gaming, and augmented or virtual reality.
Accessibility Features: For use to facilitate communication by people with speech impairments.
Self-authored content: To create an avatar for reading content authored by the avatar talent.
Public Service and Informational Announcements: To create a fictional brand or character image for communicating public service information, including announcements for public venues, or for informational broadcasts such as traffic, weather, event information, and schedules. This use case is not intended for journalistic or news content.
Translation and Localization: For use in translation applications for translating conversations in different languages or translating audio media in video format.

All other uses of custom text to speech avatar are prohibited. In addition, custom text to speech avatar is a Limited Access service, and registration is required for access to this feature. To learn more about Microsoft’s Limited Access policy visit aka.ms/limitedaccesscogservices. Certain features are only available to Microsoft managed customers and partners, and only for certain use cases approved by Microsoft at the time of registration.

Prebuilt text to speech avatar may also be used for the custom avatar use cases above, as well as additional use cases selected by customers and consistent with Azure Acceptable Use Policy and the Code of conduct for Azure AI Speech text to speech. No registration or pre-approval is required for additional use cases for prebuilt text to speech avatar that meet all applicable terms and conditions.

Considerations when choosing use cases

We encourage customers to use text to speech features in their innovative solutions or applications. All text to speech features must adhere to the Azure Acceptable Use Policy and the Code of conduct for Azure AI Speech text to speech. In addition, custom neural voice and custom text to speech avatars may only be used for the use cases approved through the Limited Access registration form. Additionally, here are some considerations when choosing a use case for any text to speech feature:

Ensure use case alignment: Ensure that the intended use of any text to speech feature aligns with the capabilities and intended purpose of the text to speech feature.
Responsible AI considerations: Prioritize responsible AI practices by avoiding the creation of misleading or harmful content. Adhere to privacy, data protection, and legal regulations when using text to speech features.
Review the code of conduct: Microsoft has established a code of conduct that prohibits certain uses of all text to speech features. Ensure compliance with the code of conduct when selecting a use case for text to speech services.
Exercise editorial control: Carefully consider using synthetic voices with content that lacks proper editorial control, as synthetic voices can sound human-like and amplify the effect of incorrect or misleading content.
Disclosure: Disclose the synthetic nature of voices, images, and/or videos to users such that users are not likely to be deceived or duped—or able to prank others—into believing they are interacting with a real person.

By adhering to these considerations, users can leverage both prebuilt and custom neural voice responsibly.

Limitations

The limitations of text to speech should be considered at the intersection of technology and the human, social, and organizational factors that influence its usage and impact. While text to speech offers advanced speech synthesis capabilities, there are certain limitations to be aware of in deploying it responsibly to minimize potential errors.

Technical limitations, operational factors, and ranges

Technical limitations to consider when using text to speech include the accuracy of pronunciation and intonation. While text to speech is designed to generate natural-sounding speech, it may encounter difficulties with certain words, names, or uncommon phrases. Users should be aware that there can be instances where the system may mispronounce or emphasize words incorrectly, especially when dealing with niche or domain-specific vocabulary.

It is important to note that certain populations may be more negatively impacted by these technical limitations. For example, individuals with hearing impairments who rely heavily on synthesized speech may face challenges in understanding unclear or distorted speech output. Similarly, users with cognitive or language-related disabilities may find it difficult to comprehend speech with unnatural intonation or mispronounced words.

Linguistic limitations: While we carefully curate and prepare training data to minimize biases, especially related to gender, ethnicity, or regional accents, and while text to speech supports multiple languages and accents, there may be variations in the quality and availability of voices across different languages. Customers should be aware of potential limitations in pronunciation accuracy, intonation, and linguistic nuances specific to certain languages or dialects.
Context and emotion: Text to speech may have limitations in accurately conveying contextual information and emotions. Customers should be mindful of the system's inability to understand the emotional nuances or subtle cues present in the input text. Considerations should be made to provide additional context or utilize other methods to convey emotions effectively.
Availability: Microsoft will provide customers with 12 months’ notice before removing any prebuilt neural voices from our catalogue, unless security, legal, or system performance considerations require an expedited removal. This does not apply to previews.

Each application is different, and our base model may not match your context or cover all scenarios required for your use case. We encourage developers to thoroughly evaluate the quality of text to speech synthetic voice and video with real-world data that reflects your use case, including testing with users from different demographic groups and with different speech characteristics. Please see the Quality of the voice model trained section for best practices for building high quality voice models.

In addition to ensuring performance, it is important to consider how to minimize risks of stereotyping and erasure that may result from synthetic voices and avatar. For example, if you are creating a custom neural voice for a smart voice assistant, carefully consider what voice is appropriate to create, and seek diverse perspectives from individuals from a variety of backgrounds. When building and evaluating your system, always seek diverse input.

Fairness considerations

At Microsoft, we strive to empower every person on the planet to do more. An essential part of this goal is working to create technologies and products that are fair and inclusive. Fairness is a multi-dimensional, socio-technical topic and impacts many different aspects of our product development. You can learn more about Microsoft’s approach to fairness here.

One important dimension to consider when using AI systems, including text to speech, is how well the system performs for different groups of people. Research has shown that without conscious effort focused on improving performance for all groups, AI systems can exhibit varying levels of performance across different demographic factors such as race, ethnicity, gender, and age.

As part of our evaluation of Azure AI text to speech, we have conducted an analysis to assess potential fairness harms. We have examined the system's performance across different demographic groups, aiming to identify any disparities or differences that may exist and could potentially impact fairness.

In some cases, there may be remaining performance disparities. It is important to note that these disparities may exceed the target, and we are actively working to address and minimize any potential biases or performance gaps, carefully consider the demographic group choice of the actor, and seek diverse perspectives from a variety of backgrounds.

Regarding representational harms, such as stereotyping, demeaning, or erasing outputs, we acknowledge the risks associated with these issues. While our evaluation process aims to mitigate such risks, we encourage users to consider their specific use cases carefully and implement additional mitigations as appropriate. Having a human in the loop can provide an extra layer of oversight to address any potential biases or unintended consequences. The use of blocklists or allowlists can also help ensure that the synthesized speech aligns with desired standards and avoids any harmful or inappropriate content.

We are committed to continuously improving our fairness evaluations to gain a deeper understanding of the system's performance across various demographic groups and potential fairness concerns. The evaluation process is ongoing, and we are actively working to enhance fairness and inclusivity, and mitigate any identified disparities. We understand the importance of addressing fairness considerations and strive to ensure that text to speech delivers reliable and equitable synthesized speech outputs.

Please note that this information represents what we know so far about fairness evaluations, and we remain dedicated to refining our evaluation methodologies and addressing any fairness concerns that may arise.

System performance

Performance for the text to speech system refers to how accurately and naturally it can convert written text into synthesized speech. This is measured using various metrics to evaluate the quality and effectiveness of the generated audio output. Some common performance metrics used include:

Mean opinion score (MOS): A rating system where judges provide a score that represents the overall quality of synthesized speech and avatar video. A higher MOS indicates better quality.
MOS gap: The difference between the MOS score of human recordings and the generated audios/videos. A smaller MOS Gap indicates a closer resemblance to human speech/human likeness.
Similarity MOS (SMOS): Measures the similarity of the generated audios/videos to the human recordings. A higher SMOS signifies better similarity.
Intelligibility: The percentage of correctly intelligible words in synthesized speech.

Even with state-of-the-art models, AI systems like text to speech can produce errors. For example, the system may produce synthesized speech with subtle unnatural intonations or pronunciation errors, leading to a less-than-ideal user experience, or the system may misinterpret text or struggle with unusual linguistic constructs, resulting in unnatural or unintelligible speech.

Best practices for improving system performance

To improve system performance and adapt system behavior in text to speech, there are several best practices that can be followed. These practices involve adjusting various components and parameters to optimize the tradeoffs and meet specific use case requirements. However, it is important to consider the potential impact on different populations to ensure fairness and inclusivity.

Using SSML (Speech Synthesis Markup Language) is considered a best practice to enhance text to speech output quality. SSML allows users to exert greater control over synthesized speech, enabling the customization of pronunciation, intonation, emphasis, and other prosodic features. By incorporating SSML tags into the text, users can add pauses, adjust speech rate, specify phonetic pronunciations, and control pitch and volume, among other parameters. This level of fine-tuning helps create more natural and expressive speech, making the text to speech output sound more human-like and engaging. All the SSML markups can be passed directly to the API. We also provide an online tool, Audio Content Creation, that allows customers to fine-tune using an intuitive user interface.

If your use case involves specialized vocabulary or domain-specific content, consider using the custom lexicon feature to improve the system's ability to accurately pronounce and convey domain-specific terms or phrases.

Evaluation of text to speech

Evaluation methods

Some commonly used metrics for evaluating text to speech overall system performance include:

Mean opinion score (MOS) gap with human recording: usually used to compare the quality of the text to speech voice model against a human recording. The quality of a voice model created by custom neural voice compared to that of a human recording is expected to be close, with a gap of no more than 0.5 in the MOS score.
For custom neural voice, you also can use Similarity MOS (SMOS) to measure how similar the custom voice sounds compared to the original human recordings. With SMOS studies, judges are asked to listen to a set of paired audios, one generated using the custom voice, the other from the original human recordings in the training data, and rate if the two audios in each pair are spoken by the same person, using a five-point scale (1 being the lowest, 5 the highest). The average score is reported as the SMOS score. We recommend that a good custom neural voice should achieve an SMOS higher than 4.0.
Besides measuring naturalness with MOS and SMOS, you can also assess the intelligibility of the voice model by checking the pronunciation accuracy of the generated speech. This is done by having judges listen to a set of testing samples, determining whether they can understand the meaning and indicate any words that were unintelligible to them. Intelligibility rate is calculated using the percentage of the correctly intelligible words among the total number of words tested (i.e., the number of intelligible words/the total number of words tested * 100%). Normally a usable text to speech engine needs to reach a score of > 98% for intelligibility.

Evaluation results

Text to speech consistently delivers high-quality and natural-sounding synthesized speech, meeting the requirements of diverse industries and domains. Our evaluations include extensive testing of the system's training and test data, ensuring that it represents the intended uses and operational factors encountered in real-world scenarios, as well as testing samples of synthesized speech outputs.

The evaluation results have influenced decisions about the constraints in the system's design, such as the maximum case size and the minimum amount of training data required. By analyzing the performance of the system across different data sets, settings, and parameters, appropriate constraints have been set to optimize the system's behavior, reliability, and safety.

While the evaluation covers a wide range of use cases, it is important to note that the results are generalizable to some extent across use cases that were not directly part of the evaluation. The robustness and performance of the system provides confidence in its ability to handle various scenarios, including those that may not have been explicitly tested.

Here are some recommended tests and score ranges based on our experience:

Measurement	Definition	How it is calculated	Recommended text size	Recommended score
MOS	Mean opinion score of the quality of the audios	Average of the rating scores of each judge on each audio	> 30 generated audios	> 4.0 (normally requires the MOS of the human recording is higher than 4.5)
MOS gap	The MOS score difference between human recordings and the generated audios	The MOS score on the human recordings minus the MOS score on the generated audios	> 10 human recordings, > 30 generated audios, > 20 judges on each audio	< 0.5
SMOS	The similarity of the generated audios to the human recordings	Average of the rating scores of the similarity level on each pair of audios	> 40 pairs, > 20 judges on each pair	> 4.0, > 3.5 (secondary language)
Intelligibility	The pronunciation accuracy of the generated speech at the word level	Percentage of the correctly intelligible words among the total number of words tested	> 60 generated audios, > 10 judges on each audio	> 98%

Evaluating and integrating text to speech for your use

Below are some best practices to help you responsibly integrate text to speech features into your use cases.

Disclose when the voice is synthetic

Disclosing that a voice is computer generated not only minimizes the risk of harmful outcomes from deception but also increases trust in the organization delivering the voice. Learn more about how to disclose.

Microsoft requires its customers to disclose the synthetic nature of text to speech voices to its users.

Make sure to provide adequate disclosure to audiences, especially when using the voice of a well-known person. People make judgments about information based in part on the person who delivers it, whether they do so consciously or unconsciously. For example, a disclosure could be verbally shared at the start of a broadcast. For more information, visit disclosure patterns.
Consider proper disclosure to parents or other parties with use cases that are designed for or may be used in situations involving minors and children. If your use case is intended for minors or children, you'll need to ensure that your disclosure is clear and transparent so that parents or legal guardians can understand the role of synthetic media and make an informed decision on behalf of minors or children about whether to use the experience.

Disclose when the avatar video is synthetic

Disclosing that an avatar speaking video is computer generated not only minimizes the risk of harmful outcomes from deception but also increases trust in the organization delivering the video. Learn more about how to disclose.

Microsoft requires its customers to disclose the synthetic nature of text to speech avatars to its users.

Make sure to provide adequate disclosure to audiences, especially when using the image (and voice) of a well-known person. People make judgments about information based in part on the person who delivers it, whether they do so consciously or unconsciously. For example, a disclosure could be made with a watermark, such as, “The voice and image in this video are AI-generated,” in text or verbally shared at the start of a video. For more information visit disclosure patterns.
Consider proper disclosure to parents or other parties with use cases that are designed for or may be used in situations involving minors and children. If your use case is intended for minors or children, you'll need to ensure that your disclosure is clear and transparent so that parents or legal guardians can understand the role of synthetic media and make an informed decision on behalf of minors or children about whether to use the experience.

Select appropriate voice types for your scenario

Carefully consider the context of use and the potential harms associated with using text to speech voices or avatars. For example, high-fidelity synthetic voices may not be appropriate in high-risk scenarios, such as for personal messaging, financial transactions, or complex situations that require human adaptability or empathy.

Users may also have different expectations for voice types and avatar expressions or gestures, depending on the context. For example, when listening to sensitive news read by a synthetic voice, some users prefer a more empathetic and human-like tone, while others prefer a neutral voice. Consider testing your application to better understand user preferences.

Be transparent about capabilities and limitations

Users are more likely to have higher expectations when interacting with high-fidelity synthetic voice agents. When system capabilities don't meet those expectations, trust can suffer, and may result in unpleasant, or even harmful experiences.

Provide optional human support

In ambiguous, transactional scenarios (for example, a call support center), users don't always trust a computer agent to appropriately respond to their requests. Human support may be necessary in these situations, regardless of the realistic quality of the voice or capability of the system.

Considerations for voice talent

When customers work with voice talent to create custom neural voice, the guidelines below apply.

Voice talent should have control over their voice model (how and where it will be used) and be compensated for its use. Microsoft requires custom neural voice customers to obtain explicit written permission from voice talent to create a synthetic voice and to ensure that the customer’s agreement with each individual contemplates the duration, use, and any content limitations. If you are creating a synthetic voice of a well-known person, you should provide a way for the voice talent to edit or approve the content of the output you plan to generate with the voice model.
Some voice talent may be unaware of potential malicious uses of technology and should be educated by system owners about the capabilities of the technology. Microsoft requires customers to share Microsoft’s Disclosure for voice and avatar talent with voice talent directly or through the voice talent’s authorized representative to describe how synthetic voices are developed and operate in conjunction with text to speech services.

Considerations for avatar talent

When customers work with avatar talent to create custom avatars, the guidelines below apply.

Avatar talent should have control over their avatar model (how and where it will be used) and be compensated for its use. Microsoft requires custom avatar customers to obtain explicit written permission from their avatar talent to create a synthetic text to speech avatar and ensure that the customer’s agreement with each individual contemplates the duration, use, and any content limitations. If you are creating a custom avatar of a well-known person, you should provide a way for the avatar talent to edit or approve the content of the output you plan to generate with the voice model.
Some avatar talent may be unaware of potential malicious uses of technology and should be educated by system owners about the capabilities of the technology. Microsoft requires customers to share Microsoft’s Disclosure for voice and avatar talent with avatar talent directly or through the avatar talent’s authorized representative to describe how synthetic avatar video is developed and operates in conjunction with text to speech services.

Considerations for people with speech disorders

When working with individuals with speech disorders to create or deploy synthetic voice technology, the following guidelines apply.

Provide guidelines for contracts with talent in accessibility scenarios

Customers should develop guidelines for establishing contracts with individuals who use synthetic voices for assistance in speaking. Customers should consider specifying in their contracts with individuals the duration of use, ownership transfer and/or license criteria, procedures for deleting the voice model, and how to prevent unauthorized access.

Account for inconsistencies in speech patterns

For individuals with speech disorders who record their own voice fonts, inconsistencies in their speech pattern (slurring or inability to pronounce certain words) may complicate the recording process. In these cases, synthetic voice technology and recording sessions should be designed with appropriate accommodations determined by the customer (for example, provide breaks or additional recording sessions).

Allow modification over time

Individuals with speech disorders may wish to update their synthetic voice to reflect changes due to aging or other factors. Individuals may also have stylistic preferences that change over time, and may want to make changes to pitch, accent, or other voice characteristics.

Transparency note: text to speech

What is a Transparency Note?

The basics of text to speech

Introduction

Key terms

Introduction

Key terms

Capabilities

System behavior

Text to speech

Real-time text to speech API

Batch synthesis API

Text to speech – custom neural voice

Personal voice

Text to speech avatar (preview)

Video translation (preview)

Use cases

Intended uses for Custom Neural Voice Pro and Custom Neural Voice Lite

Intended uses for personal voice

Intended uses for custom text to speech avatar and prebuilt text to speech avatar

Considerations when choosing use cases

Limitations

Technical limitations, operational factors, and ranges

Fairness considerations

System performance

Best practices for improving system performance

Evaluation of text to speech

Evaluation methods

Evaluation results

Evaluating and integrating text to speech for your use

Disclose when the voice is synthetic

Disclose when the avatar video is synthetic

Select appropriate voice types for your scenario

Be transparent about capabilities and limitations

Provide optional human support

Considerations for voice talent

Considerations for avatar talent

Considerations for people with speech disorders

Provide guidelines for contracts with talent in accessibility scenarios

Account for inconsistencies in speech patterns

Allow modification over time

Learn more about responsible AI

Learn more about Azure Speech