Azure Batch Speech-to-text

Accurately transcribe audio to text in more than 100 languages and variants. As part of Azure AI Speech service, Batch Transcription enables you to transcribe a large amount of audio in storage. You can point to audio files with a shared access signature (SAS) URI and asynchronously receive transcription results.

This connector is available in the following products and regions:

Service Class Regions
Logic Apps Standard All Logic Apps regions except the following:
     -   Azure China regions
Power Automate Standard All Power Automate regions except the following:
     -   China Cloud operated by 21Vianet
Power Apps Standard All Power Apps regions except the following:
     -   China Cloud operated by 21Vianet
Contact
Name Speech Service Power Platform Team
URL https://docs.microsoft.com/azure/cognitive-services/speech-service/support
Email speechpowerplatform@microsoft.com
Connector Metadata
Publisher Microsoft
Website https://docs.microsoft.com/azure/cognitive-services/speech-service/
Privacy policy https://privacy.microsoft.com
Categories AI;Website

The Speech Services batch transcription API is a cloud-based service that provides batch speech recognition asynchronous processing over provided audio contents. This connector exposes these functions as operations in Microsoft Power Automate and Power Apps.

Pre-requisites

You will need the following to proceed:

Creating a connection

The connector supports the following authentication types:

Api Key ApiKey All regions Shareable
Azure AD Integrated Use Azure Active Directory to access your speech service. US Government (GCC) only Not shareable
Azure AD Integrated (Azure Government) Use Azure Active Directory to access your speech service. Azure Government and Department of Defense (DoD) in Azure Government and US Government (GCC-High) only Not shareable
Microsoft Entra ID Integrated Use Microsoft Entra ID to access your speech service. All regions except Azure Government and Department of Defense (DoD) in Azure Government and US Government (GCC) and US Government (GCC-High) Not shareable
Default [DEPRECATED] This option is only for older connections without an explicit authentication type, and is only provided for backward compatibility. All regions Not shareable

Api Key

Auth ID: keyBasedAuth

Applicable: All regions

ApiKey

This is shareable connection. If the power app is shared with another user, connection is shared as well. For more information, please see the Connectors overview for canvas apps - Power Apps | Microsoft Docs

Name Type Description Required
Account Key securestring Speech service key True
Region string Speech service region (Example: eastus) True

Azure AD Integrated

Auth ID: tokenBasedAuth

Applicable: US Government (GCC) only

Use Azure Active Directory to access your speech service.

This is not shareable connection. If the power app is shared with another user, another user will be prompted to create new connection explicitly.

Name Type Description Required
Custom Subdomain string Custom subdomain endpoint url (Example: contoso) True

Azure AD Integrated (Azure Government)

Auth ID: tokenBasedAuth

Applicable: Azure Government and Department of Defense (DoD) in Azure Government and US Government (GCC-High) only

Use Azure Active Directory to access your speech service.

This is not shareable connection. If the power app is shared with another user, another user will be prompted to create new connection explicitly.

Name Type Description Required
Custom Subdomain string Custom subdomain endpoint url (Example: contoso) True

Microsoft Entra ID Integrated

Auth ID: tokenBasedAuth

Applicable: All regions except Azure Government and Department of Defense (DoD) in Azure Government and US Government (GCC) and US Government (GCC-High)

Use Microsoft Entra ID to access your speech service.

This is not shareable connection. If the power app is shared with another user, another user will be prompted to create new connection explicitly.

Name Type Description Required
Custom Subdomain string Custom subdomain endpoint url (Example: contoso) True

Default [DEPRECATED]

Applicable: All regions

This option is only for older connections without an explicit authentication type, and is only provided for backward compatibility.

This is not shareable connection. If the power app is shared with another user, another user will be prompted to create new connection explicitly.

Name Type Description Required
Account Key securestring Azure Cognitive Services for Batch Speech-to-text Account Key True
Region string Speech service region (Example: eastus) True

Throttling Limits

Name Calls Renewal Period
API calls per connection 100 60 seconds

Actions

Create transcription (V3.1)

Creates a new transcription.

Delete transcription (V3.1)

Deletes the specified transcription task.

Get supported locales (V3.1)

Gets a list of supported locales for offline transcriptions.

Get transcription file (V3.1)

Gets one specific file (identified with fileId) from a transcription (identified with id).

Get transcriptions (V3.1)

Gets the transcription identified by the given ID.

Get transcriptions list (V3.1)

Gets a list of transcriptions for the authenticated subscription.

Get transcriptions list files (V3.1)

Gets the files of the transcription identified by the given ID.

Update transcription (V3.1)

Updates the mutable details of the transcription identified by its ID.

Create transcription (V3.1)

Creates a new transcription.

Parameters

Name Key Required Type Description
contentUrls
contentUrls array of uri

You can provide a list of content urls to get audio files to transcribe. Up to 1000 urls are allowed.This property will not be returned in a response.

contentContainerUrl
contentContainerUrl uri

Alternatively, you can provide a URL for an Azure blob container that contains the audio files. A container is allowed to have a maximum size of 5GB and a maximum number of 10000 blobs.The maximum size for a blob is 2.5GB.Container SAS should contain 'r' (read) and 'l' (list) permissions.This property will not be returned in a response.

locale
locale True string

The locale of the contained data. If Language Identification is used, this locale is used to transcribe speech for which no language could be detected.

displayName
displayName True string

The display name of the object.

model
self uri

The location of the referenced entity.

diarizationEnabled
diarizationEnabled boolean

A value indicating whether diarization (speaker identification) is requested. The default valueis false.If only this field is set to true and the improved diarization system is not enabled by specifyingDiarizationProperties, a basic diarization system will distinguish between up to two speakers. Noextra charges are applied in this case. The improved diarization system provides diarization for aconfigurable range of speakers. It can be configured in the DiarizationProperties field. DEPRECATED: The basic diarization system is deprecated and will be removed along with thediarizationEnabled setting in the next major version of the API.

wordLevelTimestampsEnabled
wordLevelTimestampsEnabled boolean

A value indicating whether word level timestamps are requested. The default value isfalse.

displayFormWordLevelTimestampsEnabled
displayFormWordLevelTimestampsEnabled boolean

A value indicating whether word level timestamps for the display form are requested. The default value is false.

channels
channels array of integer

A collection of the requested channel numbers.In the default case, the channels 0 and 1 are considered.

destinationContainerUrl
destinationContainerUrl uri

The requested destination container.### Remarks ###When a destination container is used in combination with a timeToLive, the metadata of atranscription will be deleted normally, but the data stored in the destination container, includingtranscription results, will remain untouched, because no delete permissions are required for thiscontainer.
To support automatic cleanup, either configure blob lifetimes on the container, or use "Bring your own Storage (BYOS)"instead of destinationContainerUrl, where blobs can be cleaned up.

punctuationMode
punctuationMode string

The mode used for punctuation.

profanityFilterMode
profanityFilterMode string

Mode of profanity filtering.

timeToLive
timeToLive string

How long the transcription will be kept in the system after it has completed. Once thetranscription reaches the time to live after completion (successful or failed) it will be automaticallydeleted. Not setting this value or setting it to 0 will disable automatic deletion. The longest supportedduration is 31 days.The duration is encoded as ISO 8601 duration ("PnYnMnDTnHnMnS", see https://en.wikipedia.org/wiki/ISO_8601#Durations).

minCount
minCount integer

A hint for the minimum number of speakers for diarization. Must be smaller than or equal to the maxSpeakers property.

maxCount
maxCount integer

The maximum number of speakers for diarization. Must be less than 36 and larger than or equal to the minSpeakers property.

candidateLocales
candidateLocales True array of string

The candidate locales for language identification (example ["en-US", "de-DE", "es-ES"]). A minimum of 2 and a maximum of 10 candidate locales, including the main locale for the transcription, is supported.

speechModelMapping
speechModelMapping object

An optional mapping of locales to speech model entities. If no model is given for a locale, the default base model is used.Keys must be locales contained in the candidate locales, values are entities for models of the respective locales.

email
email string

The email address to send email notifications to in case the operation completes.The value will be removed after successfully sending the email.

Returns

Delete transcription (V3.1)

Deletes the specified transcription task.

Parameters

Name Key Required Type Description
Id
id True uuid

The identifier of the transcription.

Get supported locales (V3.1)

Gets a list of supported locales for offline transcriptions.

Returns

Name Path Type Description
array of string

Get transcription file (V3.1)

Gets one specific file (identified with fileId) from a transcription (identified with id).

Parameters

Name Key Required Type Description
Id
id True uuid

The identifier of the transcription.

File Id
fileId True uuid

The identifier of the file.

Sas Validity In Seconds
sasValidityInSeconds integer

The duration in seconds that an SAS url should be valid. The default duration is 12 hours. When using BYOS (https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/speech-encryption-of-data-at-rest#bring-your-own-storage-byos-for-customization-and-logging): A value of 0 means that a plain blob URI without SAS token will be generated.

Returns

Body
File

Get transcriptions (V3.1)

Gets the transcription identified by the given ID.

Parameters

Name Key Required Type Description
Id
id True uuid

The identifier of the transcription.

Returns

Get transcriptions list (V3.1)

Gets a list of transcriptions for the authenticated subscription.

Parameters

Name Key Required Type Description
Skip
skip integer

Number of datasets that will be skipped.

Top
top integer

Number of datasets that will be included after skipping.

Filter
filter string

A filtering expression for selecting a subset of the available transcriptions.

  • Supported properties: displayName, description, createdDateTime, lastActionDateTime, status, locale.
  • Operators:
    - eq, ne are supported for all properties.
    - gt, ge, lt, le are supported for createdDateTime and lastActionDateTime.
    - and, or, not are supported.
  • Example: filter=createdDateTime gt 2022-02-01T11:00:00Z

Returns

Get transcriptions list files (V3.1)

Gets the files of the transcription identified by the given ID.

Parameters

Name Key Required Type Description
Id
id True uuid

The identifier of the transcription.

Sas Validity In Seconds
sasValidityInSeconds integer

The duration in seconds that an SAS url should be valid. The default duration is 12 hours. When using BYOS (https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/speech-encryption-of-data-at-rest#bring-your-own-storage-byos-for-customization-and-logging): A value of 0 means that a plain blob URI without SAS token will be generated.

Skip
skip integer

Number of datasets that will be skipped.

Top
top integer

Number of datasets that will be included after skipping.

Filter
filter string

A filtering expression for selecting a subset of the available files.

  • Supported properties: name, createdDateTime, kind.
  • Operators:
    - eq, ne are supported for all properties.
    - gt, ge, lt, le are supported for createdDateTime.
    - and, or, not are supported.
  • Example: filter=name eq 'myaudio.wav.json' and kind eq 'Transcription'

Returns

Update transcription (V3.1)

Updates the mutable details of the transcription identified by its ID.

Parameters

Name Key Required Type Description
Id
id True uuid

The identifier of the transcription.

self
self True uri

The location of the referenced entity.

displayName
displayName string

The name of the object.

description
description string

The description of the object.

customProperties
customProperties object

The custom properties of this entity. The maximum allowed key length is 64 characters, the maximumallowed value length is 256 characters and the count of allowed entries is 10.

Returns

Definitions

DiarizationProperties

Name Path Type Description
speakers
speakers DiarizationSpeakersProperties

DiarizationSpeakersProperties

Name Path Type Description
minCount
minCount integer

A hint for the minimum number of speakers for diarization. Must be smaller than or equal to the maxSpeakers property.

maxCount
maxCount integer

The maximum number of speakers for diarization. Must be less than 36 and larger than or equal to the minSpeakers property.

File

Name Path Type Description
kind
kind FileKind

Type of data.

links
links FileLinks
createdDateTime
createdDateTime date-time

The creation time of this file.The time stamp is encoded as ISO 8601 date and time format(see https://en.wikipedia.org/wiki/ISO_8601#Combined_date_and_time_representations).

properties
properties FileProperties
name
name string

The name of this file.

FileKind

Type of data.

Type of data.

Name Path Type Description
contentUrl
contentUrl uri

The url to retrieve the content of this file.

FileProperties

Name Path Type Description
size
size integer

The size of the data in bytes.

duration
duration string

The duration in case this file is an audio file. The duration is encoded as ISO 8601duration ("PnYnMnDTnHnMnS", see https://en.wikipedia.org/wiki/ISO_8601#Durations).

LanguageIdentificationProperties

Name Path Type Description
candidateLocales
candidateLocales array of string

The candidate locales for language identification (example ["en-US", "de-DE", "es-ES"]). A minimum of 2 and a maximum of 10 candidate locales, including the main locale for the transcription, is supported.

speechModelMapping
speechModelMapping object

An optional mapping of locales to speech model entities. If no model is given for a locale, the default base model is used.Keys must be locales contained in the candidate locales, values are entities for models of the respective locales.

PaginatedFiles

Name Path Type Description
values
values array of File

A list of entities limited by either the passed query parameters 'skip' and 'top' or their default values. When iterating through a list using pagination and deleting entities in parallel, some entities will be skipped in the results.It's recommended to build a list on the client and delete after the fetching of the complete list.

@nextLink
@nextLink uri

A link to the next set of paginated results if there are more entities available; otherwise null.

PaginatedTranscriptions

Name Path Type Description
values
values array of Transcription

A list of entities limited by either the passed query parameters 'skip' and 'top' or their default values. When iterating through a list using pagination and deleting entities in parallel, some entities will be skipped in the results.It's recommended to build a list on the client and delete after the fetching of the complete list.

@nextLink
@nextLink uri

A link to the next set of paginated results if there are more entities available; otherwise null.

ProfanityFilterMode

Mode of profanity filtering.

Mode of profanity filtering.

PunctuationMode

The mode used for punctuation.

The mode used for punctuation.

Transcription

Name Path Type Description
contentUrls
contentUrls array of uri

You can provide a list of content urls to get audio files to transcribe. Up to 1000 urls are allowed.This property will not be returned in a response.

contentContainerUrl
contentContainerUrl uri

Alternatively, you can provide a URL for an Azure blob container that contains the audio files. A container is allowed to have a maximum size of 5GB and a maximum number of 10000 blobs.The maximum size for a blob is 2.5GB.Container SAS should contain 'r' (read) and 'l' (list) permissions.This property will not be returned in a response.

locale
locale string

The locale of the contained data. If Language Identification is used, this locale is used to transcribe speech for which no language could be detected.

displayName
displayName string

The display name of the object.

model
model.self uri

The location of the referenced entity.

properties
properties TranscriptionProperties

TranscriptionProperties

Name Path Type Description
diarizationEnabled
diarizationEnabled boolean

A value indicating whether diarization (speaker identification) is requested. The default valueis false.If only this field is set to true and the improved diarization system is not enabled by specifyingDiarizationProperties, a basic diarization system will distinguish between up to two speakers. Noextra charges are applied in this case. The improved diarization system provides diarization for aconfigurable range of speakers. It can be configured in the DiarizationProperties field. DEPRECATED: The basic diarization system is deprecated and will be removed along with thediarizationEnabled setting in the next major version of the API.

wordLevelTimestampsEnabled
wordLevelTimestampsEnabled boolean

A value indicating whether word level timestamps are requested. The default value isfalse.

displayFormWordLevelTimestampsEnabled
displayFormWordLevelTimestampsEnabled boolean

A value indicating whether word level timestamps for the display form are requested. The default value is false.

channels
channels array of integer

A collection of the requested channel numbers.In the default case, the channels 0 and 1 are considered.

destinationContainerUrl
destinationContainerUrl uri

The requested destination container.### Remarks ###When a destination container is used in combination with a timeToLive, the metadata of atranscription will be deleted normally, but the data stored in the destination container, includingtranscription results, will remain untouched, because no delete permissions are required for thiscontainer.
To support automatic cleanup, either configure blob lifetimes on the container, or use "Bring your own Storage (BYOS)"instead of destinationContainerUrl, where blobs can be cleaned up.

punctuationMode
punctuationMode PunctuationMode

The mode used for punctuation.

profanityFilterMode
profanityFilterMode ProfanityFilterMode

Mode of profanity filtering.

timeToLive
timeToLive string

How long the transcription will be kept in the system after it has completed. Once thetranscription reaches the time to live after completion (successful or failed) it will be automaticallydeleted. Not setting this value or setting it to 0 will disable automatic deletion. The longest supportedduration is 31 days.The duration is encoded as ISO 8601 duration ("PnYnMnDTnHnMnS", see https://en.wikipedia.org/wiki/ISO_8601#Durations).

diarization
diarization DiarizationProperties
Language Identification -
languageIdentification LanguageIdentificationProperties
email
email string

The email address to send email notifications to in case the operation completes.The value will be removed after successfully sending the email.