Data, privacy, and security for Custom Neural Voice

This article provides details regarding how Custom Neural Voice data provided by you is processed, used and stored. As an important reminder, you are responsible for your use and the implementation of this technology and are required to obtain all necessary permissions from voice talents for the processing of their voice data to develop a synthetic voice as well as any licenses, permissions or other proprietary rights required for the content you input into the text-to-speech (“TTS”) service, part of Speech in Azure Cognitive Services, to generate audio content in the synthetic voice. Some jurisdictions may impose special legal requirements for the collection, processing and storage of certain categories of data, such as biometric data and mandate disclosing the use of synthetic voices to users. Before using Custom Neural Voice and the TTS service for the processing and storage of data and creation of synthetic speech, you must ensure compliance with any such legal requirements that may apply to you.

What data does Custom Neural Voice and TTS process?

Custom Neural Voice processes the following types of data:

  • Recorded statement file of voice talent. When using the Speech Studio, customers are required to upload a recorded statement of the voice talent that acknowledges that their voice will be used by customer to create synthetic voice(s).


    When preparing your recording script, make sure you include the statement sentence to acquire the voice talent acknowledgement. You can find the statement in multiple languages here. The language of the verbal statement must be the same as your recording.

  • Training data (including audio files and related text transcripts). This includes audio recordings from the voice talent who has agreed to use their voice for the model training and the related text transcripts. You can provide their own text transcriptions of audio or use the automated speech recognition transcription feature available within the Speech Studio to generate a text transcription of the audio. Both the audio recordings and the text transcription files will be used as the voice model Training Data.

  • Text as the test script. You can upload your own text-based scripts to evaluate and test the quality of the custom voice model by generating speech synthesis audio samples.

  • Text input for speech synthesis. This is the text you select and send to TTS to generate audio content using your custom neural voice.

How does Custom Neural Voice and TTS process data?

The diagram below illustrates how your data is processed. This diagram covers three different types of processing: how Microsoft verifies voice files of the voice talent prior to the custom neural voice model training, how Microsoft creates a custom neural voice model with your training data, and how TTS processes your text input to generate audio content.

How Custom Neural Voice processes data

Voice file verification

Microsoft requires customers to upload an audio file with a recorded statement from its voice talent acknowledging Customer’s use of their voice to the Speech Studio. Microsoft may use Microsoft’s speech-to-text/speech recognition technology on this recorded statement to transcribe it to text and verify the content in the recording matches the pre-defined script provided by Microsoft. This audio statement, along with the description information you provide with the audio is used to create a voice talent profile. You must associate training data with the relevant voice talent profile when initiating custom neural voice training.

Microsoft may process biometric voice signatures from the recorded voice statement file of the voice talent and from randomized audios from the training datasets in order to ascertain that the voice signature in each of the audio recordings matches the same speaker with reasonable confidence using the Speaker Verification feature of Speech, in Azure Cognitive Services. A voice signature may also be called a “voice template” or “voiceprint” and it is a numeric vector that represents an individual’s voice characteristics that is extracted from audio recordings of a person speaking. This technical safeguard is intended to help prevent misuse of Custom Neural Voice, by, for example, preventing customers from training voice models with audio recordings and using it to spoof a voice without a speaker’s knowledge or consent.

The voice signatures are used by Microsoft solely for the purposes of speaker verification or as otherwise necessary to investigate misuse of the services

The Online Services Data Protection Addendum (“DPA”) sets forth customers and Microsoft’s obligations with respect to the processing and security of Customer Data and Personal Data in connection with Azure and is incorporated by reference into customers enterprise agreement for Azure services. Microsoft’s data processing in this section is governed under the Legitimate Interest Business operations section of the Data Protection Addendum.

Training a custom neural voice model

The training data (speech audios) submitted to the Speech Studio is pre-processed using automated tools for quality checking including data format check, pronunciation scoring, noise detection, script mapping, etc.. The training data is then imported to the model training component of the custom voice platform. During the training process, the training data (both voice audio and text transcriptions) are decomposed into fine-grained mappings of voice acoustics and text, such as a sequence of phonemes. Through further complex machine leaning modeling, this is built into a voice model, which then can be used to generates voice that sounds like the voice talent, and can be in different languages from the recording. The voice model is a text-to-speech computer model that can mimic unique vocal characteristics of a target speaker. It represents a set of parameters in binary format that is not human readable and does not contain audio recordings.

Customer’s training data is only used to develop customer’s custom voice model and is not used by Microsoft to train or improve any Microsoft TTS voice models.

Speech synthesis/audio content generation

Once the voice model is created, you can use it to create audio content through the TTS service with two different options.

For real time speech synthesis, you send the input text to the TTS service via the TTS SDK or RESTful API. TTS processes the input text and returns output audio content files in real time to the your application that made the request.

For asynchronous synthesis of long audio (batch synthesis), you submit the input text files to the TTS batch service via the Long Audio API to asynchronously create audios longer than 10 minutes (for example audio books or lectures). Unlike synthesis performed using the text-to-speech API, responses aren't returned in real time with the Long Audio API. Audios are created asynchronously, and you can access and download the synthesized audios when it is made available from the batch synthesis service.

You can also use your custom voice to generate audio content through a no-code Audio Content Creation tool, and choose to save your text input or output audio content with the tool in Azure storage.

Data processing for Custom Neural Voice Lite

Custom Neural Voice Lite is a project type in public preview that allows you to record 20-50 voice samples on Speech Studio and create a lightweight custom voice model for demonstration and evaluation purposes. Both the recording script and the testing script are pre-defined by Microsoft. The synthetic voice model created using the Custom Neural Voice Lite project could be deployed and used at your discretion, after you apply and full access to Custom Neural Voice is granted.

The synthetic voice and the related audio recording submitted via the Speech Studio will automatically be deleted within 90 days from the Speech Studio portal unless you decide to deploy the synthetic voice, in which case, you will control the duration of its retention. If the Voice Talent would like to have the synthetic voice and the related audio recordings deleted before 90 days, they can delete them on the portal directly, or contact their enterprise to do so.

Before you can deploy the Synthetic Voice model created using a Custom Neural Voice Lite project, it's required that the Voice Talent provide an additional audio recording of the Voice Talent acknowledging the synthetic voice will be used by their enterprise outside of the demonstration and evaluation purpose.

Data storage and retention

Recorded statement and Speaker Verification data: The voice signatures are used by Microsoft solely for the purposes of speaker verification or as otherwise necessary to investigate misuse of the services. The voice signatures will be retained only for the time duration necessary to perform such speaker verification, which may occur from time to time. Microsoft may require this verification before allowing you to train or retrain custom voice models in the Speech Studio, or as otherwise necessary. Microsoft will retain the recorded statement file and voice talent profile data for as long as necessary in order to preserve the security and integrity of Speech in Azure Cognitive Services.

Custom Neural Voice models: While you maintain the exclusive usage rights to your Custom Neural Voice model, Microsoft may independently retain a copy of Custom Neural Voice models for as long as necessary. Microsoft may use your Custom Neural Voice model for the sole purpose of protecting the security and integrity of Microsoft Azure Cognitive Services.

Microsoft will secure and store a copy of Voice Talent's recorded statement and Custom Neural Voice models with the same high level security that it uses for its other Azure Services. Learn more at Microsoft Trust Center.

Training data: You submit voice training data to generate voice models via Speech Studio, and it will be retained and stored by default in Azure storage (See Azure Storage encryption for data at REST for details). You can access and delete any of the training data used to build the voice models via Speech Studio.

You can manage storage of your training data via BYOS (Bring Your Own Storage). With this storage method, training data may be accessed only for the purposes of voice model training and will otherwise be stored via BYOS.

Text input for speech synthesis: Microsoft does not retain or store the the text that you provide with the real-time synthesis TTS API. Scripts provided via the Long Audio API for TTS are stored in Azure storage to process the batch synthesis request. The input text can be deleted via the delete API at any time.

Output audio content: Microsoft does not store the audio content that are generated with the real-time synthesis API. If you are using the Long Audio API, the output audio content is stored in Azure storage. Thse audios can be removed at any time via the delete operation.

To learn more about Microsoft's privacy and security commitments visit the Microsoft Trust Center.

See also