Data and Privacy for Speech-to-Text


This article is provided for informational purposes only and not for the purpose of providing legal advice. We strongly recommend seeking specialist legal advice when implementing Speech Services.

This article provides some high-level details regarding how Speech-to-Text processes data provided by Customers. Note that audio data of humans speaking and the related text transcripts may be considered personal data and/or sensitive data under various privacy regulations and laws because it contains not only the voice of humans, but the content of the audio may also contain personal information depending on the context within which the audio was collected. Audio data and the related text transcripts may also be regulated under various communications laws or other law and regulations. As an important reminder, you are responsible for the implementation of this technology and are required to obtain all necessary permissions for processing of the data, as well as any licenses, permissions or other proprietary rights required for the content you input into the speech to text service. It is your responsibility to comply with all applicable laws and regulations in your jurisdiction.

What data does Speech-to-Text process?

Speech-to-Text processes the following types of data:

  • Audio input or voice audio: All Speech-to-Text features accept voice audio as an input that is streamed via Speech SDK/REST API into the service endpoint. In batch transcription, audio input will be sent to a storage location instructed by the customer, and the Speech Service accesses and processes the audio input for the purposes of providing the transcription services requested. See more information about how to specify storage in How to use batch transcription.

  • Input transcription text: In the pronunciation assessment, transcribed text is sent together with an input voice audio as "correct" text. Pronunciations are assessed based on the input transcriptions.

  • Transcription for speech translation: When the speech translation feature is used, transcribed text that Speech-to-Text generated is translated into a specified language through Translator Service.

The text translation service is used only to convert text from one language to another. No input/output data is retained by Speech Service after completion of translation request. See What is the Translator service for more information about the text translation service.

If users need transcribed/translated text in an audio format, the feature sends the output text to Text-to-Speech (TTS). Again, no data is persisted in the TTS data processing.

How does Speech-to-Text process data?

Real-time Speech-to-Text

When a client application sends audio input to Speech-to-Text, the speech recognition engine parses audio and converts it to text. Relying upon its acoustic and linguistic or language understanding features, Speech-to-Text selects candidate words and phrases that may be uttered in the audio input. The transcription output represents the best inference or prediction in text format of what was spoken in the audio input.
For real-time Speech-to-Text, audio input is processed only on the Azure’s server memory, and no data is stored at rest. All data in-transit are encrypted for protection. See Trusted Cloud: security, privacy, compliance, resiliency, and IP for more information about Azure-wide security and privacy protection.

Batch transcription

In batch transcription, customers specify their chosen storage location of both audio input and output transcription text files for Speech Services to access, process, and provide the transcription output . The Customer controls the storage of this data, including the retention of such data. Customers may set a retention time for generated transcription text files by using a parameter called "timeToLive". See Batch Transcription -- Configuration Properties for more detail.

Please see the data flows for each Speech-to-Text feature:

Data flow diagram fro Speech-to-Text

Speaker Separation/Diarization

This feature is available for the Batch API only. When customers enable the speaker separation (diarization) option (disabled by default), the Speech-to-Text engine analyzes and extracts unique voice characteristics signals from the audio input to differentiate the audio between two speakers. These voice characteristics signals are used and temporarily retained for the sole purpose of annotating the transcription output with markers next to text for Speaker 1 or Speaker 2. Upon completion of the batch process, all signal data used to separate the speakers is discarded. The speaker separation feature only supports the separation of two speakers in a single audio file. Speaker Separation does not support speaker identity recognition enrollment or the ability to track unique speakers across multiple audio files.

Language Detection

Language detection is similar to speech recognition except that the model calculates probabilities of mapping between phonemes and languages. Each language has specific phonemes and phoneme combinations, which characterize the language. The language detection model identifies the characteristics in phonemes to calculate likelihood of languages used in an input voice.

Speech Translation

When Speech Translation is used, first, an audio input is used to generate machine-transcribed text with Speech-to-Text. Then the machine-transcribed text is sent to the text translation service to convert the text (in the source language) to another language. If customers need translated text in an audio format, this feature can send the translated text to text-to-speech (TTS). Customers have the option to produce translated text only or translated voice output.

Speech Containers

With Speech Containers, customers deploy Speech Services APIs to their own environment via Docker containers. Since all speech components run on customers’ controlled environment, audio data inputs and transcription outputs are processed within Customers’ container and is not sent to the cloud based Speech Service. See Install and run Docker containers for the Speech service APIs for more information.

Security for customers' data in Speech Container

The security of customer data is a shared responsibility. Details on the security model of Azure Cognitive Services containers, like the speech container can be found here - Azure Cognitive Services container security.

You are responsible for securing and maintaining the equipment and infrastructure required to operate speech containers located on your premises, such as your edge device and network.

To learn more about Microsoft's privacy and security commitments visit the Microsoft Trust Center.

Data storage and retention

No data trace

When using real-time Speech-to-Text, pronunciation assessment and speech translation, Microsoft does not retain or store the data provided by customers. In batch transcription, customers specify their own storage locations to send the audio input. Generated transcription text may be stored either in customer’s own storage or Microsoft storage if no storage is specified. If output transcriptions are stored in Microsoft storage, Customers may delete the data either by calling a deletion API or setting the timeToLive parameter to automatically delete the data in a specified time. See more details in How to use batch transcription - Speech service - Azure Cognitive Services.

To learn more about Microsoft's privacy and security commitments visit the Microsoft [Trust Center]{.ul}.