Disclosure for voice talent

The goal of this article is to help voice talent understand the technology behind the text-to-speech (TTS) capabilities that their voices help create. It also contains important privacy disclosures for voice talent about how Microsoft may process, use and retain audio files containing voice talent's recorded statements and Custom Neural Voice voice models to help Microsoft prevent, and/or respond to complaints of, misuse of Cognitive Services or Custom Neural Voice services.

Microsoft is committed to designing AI responsibly. We hope this note will foster a greater shared understanding among tech builders, voice talent, and the general public about the intended and beneficial uses of this technology.

Key TTS terms

Voice model: A text-to-speech computer model that can mimic unique vocal characteristics of a target speaker. A voice model is also called as voice font or synthetic voice. A voice model is a set of parameters in binary format that is not human readable and does not contain audio recordings. It cannot be reverse engineered to derive or construct the audio recordings of a human being speaking.

Voice talent: Individuals or target speakers whose voices are recorded and used to create voice models that are intended to sound like the voice talent's voice.

Two broad categories of text-to-speech (TTS)

Following are the two broad categories for text-to-speech (TTS).

Standard TTS

How it works: The standard, or "traditional," method of TTS breaks down spoken language into phonetic snippets that can be remixed and matched using classical programming or statistical methods.

What to know about it: Standard TTS requires a large volume of voice data—in the range of 10,000 lines or more—to produce a more human-like voice model. With fewer recorded lines, a standard TTS voice model will tend to sound more obviously robotic.

Examples of how Microsoft uses it:

  • Platform Voice is a feature of the Speech Service on Azure that offers "off-the-shelf" voice models for public use. Platform Voices are also used in several Microsoft products including the Edge Browser, Narrator, Office, and Teams.
  • Custom Voice is a feature of the Speech Service on Azure that allows you to build a synthetic voice model using recordings from a voice talent to represent a specific persona for a corporation/enterprise.
  • Microsoft and/or Windows system voices are included in the Windows operating system. They are also used in several applications such as Narrator, Cortana, Edge Read Aloud, and Teams.

What to expect when recording: Contributing at least 6,000 lines to produce a good quality voice font.

Neural TTS

How it works: Neural TTS synthesizes speech using deep neural networks that have "learned" the way phonetics are combined in natural human speech rather than using classical programming or statistical methods. In addition to the recordings of a target voice talent, neural TTS uses a source library that contains voice recordings from many different speakers.

What to know about it: Because of the way it synthesizes voices, neural TTS can produce styles of speech that weren't part of the original recordings, such as changes in tone of voice and affectation. Neural TTS voices sound fluid and are good at replicating the natural pauses, idiosyncrasies, and hesitancy that people express when they're speaking. Those who hear synthetic voices made via neural TTS tend to rate them closer to human speech than standard TTS voices.

Examples of how Microsoft uses it:

  • Platform Voice is a feature of the Speech Service on Azure that offers "off-the-shelf" voice models for public use. Platform Voices are also used in several Microsoft products including the Edge Browser, Narrator, Office, and Teams.
  • Custom Neural Voice is a feature of the Speech Service on Azure that allows you to create a one-of-a-kind custom synthetic voice model for your brand. The following capabilities are used to produce Custom Neural Voices:
    • Language transfer can express in a language different from the original voice recordings.
    • Style transfer can express in a style of speaking different from the original voice recordings. For example, a newscaster voice.
    • Voice transformation can express in a manner different from the original voice recordings. For example, modifying tone or pitch to create different character voices.
    • Other voices used in Microsoft's products and services, such as Cortana.

What to expect when recording: Contributing at least 300 lines for a proof of concept voice model and about 2,000 lines to produce a new voice model for production use.

Voice talent and synthetic voices: an evolving relationship

Recognizing the integral relationship between voice talent and synthetic voices, Microsoft interviewed voice talent to better understand their perspectives on new developments in the technology. Research we conducted in 2019 showed that voice talent saw potential benefit from the capabilities introduced by neural TTS, such as saving studio time to complete recording jobs, and adding capacity to complete more voice acting assignments. At the same time, there were varying degrees of awareness about how developments in TTS technology could potentially impact their profession.

Overall, voice talent expressed a desire for transparency and clarity about:

  • Limits on what their voice likeness could and could not be used to express.
  • The duration of allowable use of their voice likeness.
  • Potential impact on future recording opportunities.
  • The persona that would be associated with their voice likeness.

Synthetic voice in wider use

Traditionally, TTS systems were somewhat limited in adoption due to their robotic sound. Most were used to support accessibility—for example as a screen reader for people who are Blind or have low vision. TTS has also been used by people with a speech impairment. For instance, the late Stephen Hawking used a TTS-generated voice.

Now, with increasingly realistic-sounding synthetic voices and the uptick in more familiar, everyday interactions between machines and humans, the uses of this technology have proliferated and expanded. TTS systems power voice assistants across an array of devices and applications. They read out news, search results, public service announcements, educational content, and much more.

Microsoft's approach to responsible use of TTS

Every day, people find new ways to apply TTS technology, and not all are for the good of individuals or society. If misused, believably human-sounding TTS voices, especially a custom voice that mimics a real person, could cause harm. For example, a misinformation campaign could become much more potent if it used the voice of a well-known public figure.

We recognize that there's no perfect way to prevent media from being modified or to unequivocally prove where it came from. Therefore, our approach to responsible use has focused on being transparent about neural TTS, evaluating appropriate use, and demonstrating our values through action.

To use Custom Neural Voice, we contractually require you to do the following:

  • Obtain explicit written permission from voice talent to use that person's voice for the purpose of creating a custom voice.
  • Provide this document to voice talent so they can understand how TTS works, and how it may be used once they complete the audio recording process.
  • Get necessary permissions from voice talent for Microsoft's processing, use and retention of voice talent's audio files to perform speaker verification against training data and our use and retention of voice models as described below.

We also recommend that you do the following:

  • Share the intended contexts of use with voice talent so they are aware of who will hear their voice, in what scenarios, and whether/how people will be able to interact with it.
  • Ensure voice talent are aware that a voice model made from their recordings can say things they didn't specifically record in the studio.
  • Discuss whether there's anything they'd be uncomfortable with the voice model being used to say.

Microsoft's processing, use and retention of voice talent data

Microsoft's use of Voice talent audio files for Speaker Verification

You must obtain permission from voice talents for use of their voice to create custom voice models for a synthetic voice. This technical safeguard is intended to help prevent misuse of our service, by, for example, preventing someone from training voice models with audio recordings and using it to spoof a voice without the speaker's knowledge or consent.

In Speech Studio, you must upload an audio file with a recorded consent statement from the voice talent. Microsoft reserves the right to use Microsoft's speaker recognition technology on this recorded statement and verify it against the training audio data in order to provide some assurance that the voices came from the same speaker or as otherwise necessary to investigate misuse of the services.

The speaker's voice signatures created from the recorded statement files and training audio data are used by Microsoft solely for the purposes stated above. Microsoft will retain the recorded statement file for as long as necessary in order to preserve the security and integrity of Microsoft's Azure Cognitive Services. Learn more about how we process, use and retain this data in the Data and Privacy section.

Microsoft's use of Custom Neural Voice models

While you maintain the exclusive usage rights to your Custom Neural Voice model, Microsoft may independently retain a copy of Custom Neural Voice models for as long as necessary. Microsoft may use your Custom Neural Voice model for the sole purpose of protecting the security and integrity of Microsoft Azure Cognitive Services.

Microsoft will secure and store a copy of Voice Talent's recorded statement and Custom Neural Voice models with the same high level security that it uses for its other Azure Services. Learn more at Microsoft Trust Center.

We will continue to identify and be explicit about the intentional, beneficial, and intended uses of TTS that are based upon existing social norms and expectations people have around media when they believe it to be real or fake. In line with Microsoft's trust principles, Microsoft does not actively monitor or moderate the audio content generated by your use of Custom Neural Voice. You are solely responsible for ensuring that usage complies with all applicable laws and regulations and in accordance with the terms of its agreement with voice talent.

Microsoft's use of Voice Talent data with Custom Neural Voice Lite

Custom Neural Voice Lite is a project type in public preview that allows you to record 20-50 voice samples on Speech Studio and create a lightweight custom voice model for demonstration and evaluation purposes. Both the recording script and the testing script are pre-defined by Microsoft. The synthetic voice model created using the Custom Neural Voice Lite project could be deployed and used at your discretion, after you apply and full access to Custom Neural Voice is granted.

The synthetic voice and the related audio recording submitted via the Speech Studio will automatically be deleted within 90 days from the Speech Studio portal unless you decide to deploy the synthetic voice, in which case, you will control the duration of its retention. If the Voice Talent would like to have the synthetic voice and the related audio recordings deleted before 90 days, they can delete them on the portal directly, or contact their enterprise to do so.

Before you can deploy the Synthetic Voice model created using a Custom Neural Voice Lite project, it's required that the Voice Talent provide an additional audio recording of the Voice Talent acknowledging the synthetic voice will be used by their enterprise outside of the demonstration and evaluation purpose.

Guidelines for responsible deployment

Because TTS is an adaptable technology, there are grey areas in determining how it should or shouldn't be used. To navigate these, we've formulated the following guidelines for using synthetic voice models:

  • Protect owners of voices from misuse or identity theft.
  • Prevent the proliferation of fake and misleading content.
  • Encourage use in scenarios where consumers expect to be interacting with synthetic content.
  • Encourage use in scenarios where consumers observe the generation of the synthetic content.

Examples of inappropriate use

TTS must not be used to:

  • Deceive people and/or intentionally misinform;
  • Claim to be from any person, company, government body, or entity without explicit permission to make that representation and/or impersonate to gain unauthorized information or privileges;
  • Create, incite, or disguise hate speech, discrimination, defamation, terrorism, or acts of violence;
  • Exploit or manipulate children;
  • Make unsolicited phone calls, bulk communications, posts, or messages;
  • Disguise policy positions or political ideologies;
  • Disseminate unattributed content or misrepresent the source

Examples of appropriate use

Appropriate TTS use cases could include, but are not limited to:

  • Virtual agents based on fictional personas. For example, on-demand web searching, IoT control, or customer support provided by a company's branded character.
  • Entertainment media for use in fictional content. For example, movies, video games, tv, recorded music, or audio books.
  • Accredited educational institutions or educational media. For example, interactive lesson plans or guided museum tours.
  • Assistive technology and real-time translation. For example, ALS-afflicted individuals preserving their voices.
  • Public service announcements using fictional personas. For example, airport or train terminal announcements.

See also