Transparency note and use cases for Custom Neural Voice

This Transparency Note discusses Custom Neural Voice and the key considerations for making use of this technology responsibly.

What is a Transparency note?

An AI system includes not only the technology, but also the people who will use it, the people who will be affected by it, and the environment in which it is deployed. Creating a system that is fit for its intended purpose requires an understanding of how the technology works, its capabilities and limitations, and how to achieve the best performance. Microsoft's Transparency Notes are intended to help you understand how our AI technology works, the choices system owners can make that influence system performance and behavior, and the importance of thinking about the whole system, including the technology, the people, and the environment. You can use Transparency Notes when developing or deploying your own system, or share them with the people who will use or be affected by your system.

Microsoft's Transparency notes are part of a broader effort at Microsoft to put our AI principles into practice. To find out more, see Responsible AI principles from Microsoft.

Introduction to Custom Neural Voice

Custom Neural Voice is a Text-to-Speech (TTS) feature, part of Speech Service in Azure Cognitive Services, that allows customers to create a one-of-a-kind customized synthetic voice for their applications by providing their own audio data of their selected voice talents. For more information on Custom Neural Voice, see Overview of Custom Neural Voice.

Limited Access to Custom Neural Voice

Custom Neural Voice is a Limited Access service, and registration is required for access to some features. To learn more about Microsoft’s Limited Access policy visit aka.ms/limitedaccesscogservices. Certain features are only available to Microsoft managed customers and partners, and only for certain use cases selected at the time of registration.

Approved use cases

The following use cases are approved for customers:

  • Media: Educational or interactive learning: For use to create a fictional brand or character voice for reading or speaking educational materials, online learning, interactive lesson plans, simulation learning, standardized testing, or guided museum tours.

  • Media: Entertainment: For use to create a fictional brand or character voice for reading or speaking entertainment content for video games, movies, TV, recorded music, podcasts, audio books, or augmented or virtual reality.

  • Media: Journalistic or news: For use to create voices for reading news or journalistic content that must be accompanied by a published, text version of the same content.

  • Media: Marketing: For use to create a fictional brand or character voice for reading or speaking marketing and product or service media, product introductions, business promotion, or advertisements.

  • Media: Self-authored content: For use to create a voice for reading content authored by the voice talent except where the voice is used to enhance the authority or credibility of the content in connection with financial, health, legal, political, or spiritual matters.

  • Accessibility Features: For use in audio description systems, narration, or to facilitate communication by speaking impaired individuals.

  • Interactive Voice Response (IVR) Systems: For use to create voices for call center operations, telephony systems, or responses for phone interactions.

  • Public Service Announcement: For use to create a fictional brand or character voice for announcements for public venues.

  • Translation and Localization: For use in real-time translation applications for translating conversations in different languages or translating audio media.

  • Virtual Assistant or Chatbot: For use to create a fictional brand or character voice for smart assistants in or for virtual web assistants, appliances, cars, home appliances, toys, control of IoT devices, navigation systems, reading out personal messages, virtual companions, or customer service scenarios.

Considerations when using Custom Neural Voice

The ability to produce synthetic media generatively, rapidly, and at scale offers unique opportunities for augmenting personal and creative expression, but also poses unprecedented challenges to public safety by making it easier to misappropriate, misinform, mislead, propagandize, or libel; while simultaneously undermining the believability of legitimate recordings and other digital artifacts. For this reason, Microsoft has established the following Code of Conduct that prohibits certain uses of Custom Neural Voice.

In addition to reviewing the Code of Conduct when choosing a use case to use Custom Neural Voice, take the following considerations into account:

  • Avoid photo realistic avatars with synthetic voices to represent real people - One of the key principles of the responsible use of Custom Neural Voice is to ensure that our consumers understand and expect the content they are interacting with is synthetic. Pairing a photo realistic avatar with the custom neural voice of a real person could potentially create an illusion to consumers that they are interacting with a real and known person. This would erode trust in the application and potentially cause harm to consumers.

  • Carefully consider using a synthetic voice with contents without editorial control - Synthetic voice can sound like a human, which could amplify the effect of fake or misleading content.

See also