Create a batch transcription
With batch transcriptions, you submit audio data in a batch. The service transcribes the audio data and stores the results in a storage container. You can then retrieve the results from the storage container.
Important
New pricing is in effect for batch transcription that uses the speech to text REST API v3.2. For more information, see the pricing guide.
Prerequisites
You need a standard (S0) Speech resource. Free resources (F0) aren't supported.
Create a transcription job
To create a batch transcription job, use the Transcriptions_Create operation of the speech to text REST API. Construct the request body according to the following instructions:
- You must set either the
contentContainerUrl
orcontentUrls
property. For more information about Azure blob storage for batch transcription, see Locate audio files for batch transcription. - Set the required
locale
property. This value should match the expected locale of the audio data to transcribe. You can't change the locale later. - Set the required
displayName
property. Choose a transcription name that you can refer to later. The transcription name doesn't have to be unique and can be changed later. - Optionally, to use a model other than the base model, set the
model
property to the model ID. For more information, see Use a custom model and Use a Whisper model. - Optionally, set the
wordLevelTimestampsEnabled
property totrue
to enable word-level timestamps in the transcription results. The default value isfalse
. For Whisper models, set thedisplayFormWordLevelTimestampsEnabled
property instead. Whisper is a display-only model, so the lexical field isn't populated in the transcription. - Optionally, set the
languageIdentification
property. Language identification is used to identify languages spoken in audio when compared against a list of supported languages. If you set thelanguageIdentification
property, then you must also setlanguageIdentification.candidateLocales
with candidate locales.
For more information, see Request configuration options.
Make an HTTP POST request that uses the URI as shown in the following Transcriptions_Create example.
- Replace
YourSubscriptionKey
with your Speech resource key. - Replace
YourServiceRegion
with your Speech resource region. - Set the request body properties as previously described.
curl -v -X POST -H "Ocp-Apim-Subscription-Key: YourSubscriptionKey" -H "Content-Type: application/json" -d '{
"contentUrls": [
"https://crbn.us/hello.wav",
"https://crbn.us/whatstheweatherlike.wav"
],
"locale": "en-US",
"displayName": "My Transcription",
"model": null,
"properties": {
"wordLevelTimestampsEnabled": true,
"languageIdentification": {
"candidateLocales": [
"en-US", "de-DE", "es-ES"
],
}
},
}' "https://YourServiceRegion.api.cognitive.microsoft.com/speechtotext/v3.2/transcriptions"
You should receive a response body in the following format:
{
"self": "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.2/transcriptions/db474955-ab85-4c6c-ba6e-3bfe63d041ba",
"model": {
"self": "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.2/models/base/5988d691-0893-472c-851e-8e36a0fe7aaf"
},
"links": {
"files": "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.2/transcriptions/db474955-ab85-4c6c-ba6e-3bfe63d041ba/files"
},
"properties": {
"diarizationEnabled": false,
"wordLevelTimestampsEnabled": true,
"channels": [
0,
1
],
"punctuationMode": "DictatedAndAutomatic",
"profanityFilterMode": "Masked",
"languageIdentification": {
"candidateLocales": [
"en-US",
"de-DE",
"es-ES"
]
}
},
"lastActionDateTime": "2024-05-21T14:18:06Z",
"status": "NotStarted",
"createdDateTime": "2024-05-21T14:18:06Z",
"locale": "en-US",
"displayName": "My Transcription"
}
The top-level self
property in the response body is the transcription's URI. Use this URI to get details such as the URI of the transcriptions and transcription report files. You also use this URI to update or delete a transcription.
You can query the status of your transcriptions with the Transcriptions_Get operation.
Call Transcriptions_Delete
regularly from the service, after you retrieve the results. Alternatively, set the timeToLive
property to ensure the eventual deletion of the results.
Tip
You can also try the Batch Transcription API using Python, C#, or Node.js on GitHub.
To create a transcription, use the spx batch transcription create
command. Construct the request parameters according to the following instructions:
- Set the required
content
parameter. You can specify a comma delimited list of individual files or the URL for an entire container. For more information about Azure blob storage for batch transcription, see Locate audio files for batch transcription. - Set the required
language
property. This value should match the expected locale of the audio data to transcribe. You can't change the locale later. The Speech CLIlanguage
parameter corresponds to thelocale
property in the JSON request and response. - Set the required
name
property. Choose a transcription name that you can refer to later. The transcription name doesn't have to be unique and can be changed later. The Speech CLIname
parameter corresponds to thedisplayName
property in the JSON request and response.
Here's an example Speech CLI command that creates a transcription job:
spx batch transcription create --name "My Transcription" --language "en-US" --content https://crbn.us/hello.wav,https://crbn.us/whatstheweatherlike.wav
You should receive a response body in the following format:
{
"self": "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.2/transcriptions/7f4232d5-9873-47a7-a6f7-4a3f00d00dc0",
"model": {
"self": "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.2/models/base/5988d691-0893-472c-851e-8e36a0fe7aaf"
},
"links": {
"files": "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.2/transcriptions/7f4232d5-9873-47a7-a6f7-4a3f00d00dc0/files"
},
"properties": {
"diarizationEnabled": false,
"wordLevelTimestampsEnabled": false,
"channels": [
0,
1
],
"punctuationMode": "DictatedAndAutomatic",
"profanityFilterMode": "Masked"
},
"lastActionDateTime": "2024-05-21T14:21:59Z",
"status": "NotStarted",
"createdDateTime": "2024-05-21T14:21:59Z",
"locale": "en-US",
"displayName": "My Transcription",
"description": ""
}
The top-level self
property in the response body is the transcription's URI. Use this URI to get details such as the URI of the transcriptions and transcription report files. You also use this URI to update or delete a transcription.
For Speech CLI help with transcriptions, run the following command:
spx help batch transcription
Request configuration options
Here are some property options to configure a transcription when you call the Transcriptions_Create operation. You can find more examples on the same page, such as creating a transcription with language identification.
Property | Description |
---|---|
channels |
An array of channel numbers to process. Channels 0 and 1 are transcribed by default. |
contentContainerUrl |
You can submit individual audio files or a whole storage container. You must specify the audio data location by using either the contentContainerUrl or contentUrls property. For more information about Azure blob storage for batch transcription, see Locate audio files for batch transcription.This property isn't returned in the response. |
contentUrls |
You can submit individual audio files or a whole storage container. You must specify the audio data location by using either the contentContainerUrl or contentUrls property. For more information, see Locate audio files for batch transcription.This property isn't returned in the response. |
destinationContainerUrl |
The result can be stored in an Azure container. If you don't specify a container, the Speech service stores the results in a container managed by Microsoft. When the transcription job is deleted, the transcription result data is also deleted. For more information, such as the supported security scenarios, see Specify a destination container URL. |
diarization |
Indicates that the Speech service should attempt diarization analysis on the input, which is expected to be a mono channel that contains multiple voices. The feature isn't available with stereo recordings. Diarization is the process of separating speakers in audio data. The batch pipeline can recognize and separate multiple speakers on mono channel recordings. Specify the minimum and maximum number of people who might be speaking. You must also set the diarizationEnabled property to true . The transcription file contains a speaker entry for each transcribed phrase.You need to use this property when you expect three or more speakers. For two speakers, setting diarizationEnabled property to true is enough. For an example of the property usage, see Transcriptions_Create.The maximum number of speakers for diarization must be less than 36 and more or equal to the minSpeakers property. For an example, see Transcriptions_Create.When this property is selected, source audio length can't exceed 240 minutes per file. Note: This property is only available with Speech to text REST API version 3.1 and later. If you set this property with any previous version, such as version 3.0, it's ignored and only two speakers are identified. |
diarizationEnabled |
Specifies that the Speech service should attempt diarization analysis on the input, which is expected to be a mono channel that contains two voices. The default value is false .For three or more voices you also need to use property diarization . Use only with Speech to text REST API version 3.1 and later.When this property is selected, source audio length can't exceed 240 minutes per file. |
displayName |
The name of the batch transcription. Choose a name that you can refer to later. The display name doesn't have to be unique. This property is required. |
displayFormWordLevelTimestampsEnabled |
Specifies whether to include word-level timestamps on the display form of the transcription results. The results are returned in the displayWords property of the transcription file. The default value is false .Note: This property is only available with Speech to text REST API version 3.1 and later. |
languageIdentification |
Language identification is used to identify languages spoken in audio when compared against a list of supported languages. If you set the languageIdentification property, then you must also set its enclosed candidateLocales property. |
languageIdentification.candidateLocales |
The candidate locales for language identification, such as "properties": { "languageIdentification": { "candidateLocales": ["en-US", "de-DE", "es-ES"]}} . A minimum of two and a maximum of ten candidate locales, including the main locale for the transcription, is supported. |
locale |
The locale of the batch transcription. This value should match the expected locale of the audio data to transcribe. The locale can't be changed later. This property is required. |
model |
You can set the model property to use a specific base model or custom speech model. If you don't specify the model , the default base model for the locale is used. For more information, see Use a custom model and Use a Whisper model. |
profanityFilterMode |
Specifies how to handle profanity in recognition results. Accepted values are None to disable profanity filtering, Masked to replace profanity with asterisks, Removed to remove all profanity from the result, or Tags to add profanity tags. The default value is Masked . |
punctuationMode |
Specifies how to handle punctuation in recognition results. Accepted values are None to disable punctuation, Dictated to imply explicit (spoken) punctuation, Automatic to let the decoder deal with punctuation, or DictatedAndAutomatic to use dictated and automatic punctuation. The default value is DictatedAndAutomatic .This property isn't applicable for Whisper models. |
timeToLive |
A duration after the transcription job is created, when the transcription results will be automatically deleted. The value is an ISO 8601 encoded duration. For example, specify PT12H for 12 hours. As an alternative, you can call Transcriptions_Delete regularly after you retrieve the transcription results. |
wordLevelTimestampsEnabled |
Specifies if word level timestamps should be included in the output. The default value is false .This property isn't applicable for Whisper models. Whisper is a display-only model, so the lexical field isn't populated in the transcription. |
For Speech CLI help with transcription configuration options, run the following command:
spx help batch transcription create advanced
Use a custom model
Batch transcription uses the default base model for the locale that you specify. You don't need to set any properties to use the default base model.
Optionally, you can modify the previous create transcription example by setting the model
property to use a specific base model or custom speech model.
curl -v -X POST -H "Ocp-Apim-Subscription-Key: YourSubscriptionKey" -H "Content-Type: application/json" -d '{
"contentUrls": [
"https://crbn.us/hello.wav",
"https://crbn.us/whatstheweatherlike.wav"
],
"locale": "en-US",
"displayName": "My Transcription",
"model": {
"self": "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.2/models/base/5988d691-0893-472c-851e-8e36a0fe7aaf"
},
"properties": {
"wordLevelTimestampsEnabled": true,
},
}' "https://YourServiceRegion.api.cognitive.microsoft.com/speechtotext/v3.2/transcriptions"
spx batch transcription create --name "My Transcription" --language "en-US" --content https://crbn.us/hello.wav,https://crbn.us/whatstheweatherlike.wav --model "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.2/models/base/5988d691-0893-472c-851e-8e36a0fe7aaf"
To use a custom speech model for batch transcription, you need the model's URI. The top-level self
property in the response body is the model's URI. You can retrieve the model location when you create or get a model. For more information, see the JSON response example in Create a model.
Tip
A hosted deployment endpoint isn't required to use custom speech with the batch transcription service. You can conserve resources if you use the custom speech model only for batch transcription.
Batch transcription requests for expired models fail with a 4xx error. Set the model
property to a base model or custom model that isn't expired. Otherwise don't include the model
property to always use the latest base model. For more information, see Choose a model and Custom speech model lifecycle.
Use a Whisper model
Azure AI Speech supports OpenAI's Whisper model by using the batch transcription API. You can use the Whisper model for batch transcription.
Note
Azure OpenAI Service also supports OpenAI's Whisper model for speech to text with a synchronous REST API. To learn more, see Speech to text with the Azure OpenAI Whisper model. For more information about when to use Azure AI Speech vs. Azure OpenAI Service, see What is the Whisper model?
To use a Whisper model for batch transcription, you need to set the model
property. Whisper is a display-only model, so the lexical field isn't populated in the response.
Important
For Whisper models, you should always use version 3.2 of the speech to text API.
Batch transcription using Whisper models is supported in the Australia East, Central US, East US, North Central US, South Central US, Southeast Asia, and West Europe regions.
You can make a Models_ListBaseModels request to get available base models for all locales.
Make an HTTP GET request as shown in the following example for the eastus
region. Replace YourSubscriptionKey
with your Speech resource key. Replace eastus
if you're using a different region.
curl -v -X GET "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.2/models/base" -H "Ocp-Apim-Subscription-Key: YourSubscriptionKey"
By default, only the 100 oldest base models are returned. Use the skip
and top
query parameters to page through the results. For example, the following request returns the next 100 base models after the first 100.
curl -v -X GET "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.2/models/base?skip=100&top=100" -H "Ocp-Apim-Subscription-Key: YourSubscriptionKey"
Make sure that you set the configuration variables for a Speech resource in one of the supported regions. You can run the spx csr list --base
command to get available base models for all locales.
spx csr list --base --api-version v3.2
The displayName
property of a Whisper model contains "Whisper" as shown in this example. Whisper is a display-only model, so the lexical field isn't populated in the transcription.
{
"self": "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.2/models/base/e418c4a9-9937-4db7-b2c9-8afbff72d950",
"links": {
"manifest": "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.2/models/base/e418c4a9-9937-4db7-b2c9-8afbff72d950/manifest"
},
"properties": {
"deprecationDates": {
"adaptationDateTime": "2025-04-15T00:00:00Z",
"transcriptionDateTime": "2026-04-15T00:00:00Z"
},
"features": {
"supportsTranscriptions": true,
"supportsEndpoints": false,
"supportsTranscriptionsOnSpeechContainers": false,
"supportsAdaptationsWith": [
"Acoustic"
],
"supportedOutputFormats": [
"Display"
]
},
"chargeForAdaptation": true
},
"lastActionDateTime": "2024-02-29T15:53:28Z",
"status": "Succeeded",
"createdDateTime": "2024-02-29T15:46:07Z",
"locale": "en-US",
"displayName": "20240228 Whisper Large V2",
"description": "OpenAI Whisper Model in Azure AI Speech (Whisper v2-large)"
},
You set the full model URI as shown in this example for the eastus
region. Replace YourSubscriptionKey
with your Speech resource key. Replace eastus
if you're using a different region.
curl -v -X POST -H "Ocp-Apim-Subscription-Key: YourSubscriptionKey" -H "Content-Type: application/json" -d '{
"contentUrls": [
"https://crbn.us/hello.wav",
"https://crbn.us/whatstheweatherlike.wav"
],
"locale": "en-US",
"displayName": "My Transcription",
"model": {
"self": "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.2/models/base/e418c4a9-9937-4db7-b2c9-8afbff72d950"
},
"properties": {
"wordLevelTimestampsEnabled": true,
},
}' "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.2/transcriptions"
You set the full model URI as shown in this example for the eastus
region. Replace eastus
if you're using a different region.
spx batch transcription create --name "My Transcription" --language "en-US" --content https://crbn.us/hello.wav,https://crbn.us/whatstheweatherlike.wav --model "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.2/models/base/e418c4a9-9937-4db7-b2c9-8afbff72d950" --api-version v3.2
Specify a destination container URL
The transcription result can be stored in an Azure container. If you don't specify a container, the Speech service stores the results in a container managed by Microsoft. In that case, when the transcription job is deleted, the transcription result data is also deleted.
You can store the results of a batch transcription to a writable Azure Blob storage container using option destinationContainerUrl
in the batch transcription creation request. This option uses only an ad hoc SAS URI and doesn't support Trusted Azure services security mechanism. This option also doesn't support Access policy based SAS. The Storage account resource of the destination container must allow all external traffic.
If you want to store the transcription results in an Azure Blob storage container by using the Trusted Azure services security mechanism, consider using Bring-your-own-storage (BYOS). For more information, see Use the Bring your own storage (BYOS) Speech resource for speech to text.