Create a batch transcription
Important
New pricing is in effect for batch transcription via Speech to text REST API v3.2. For more information, see the pricing guide.
With batch transcriptions, you submit the audio data, and then retrieve transcription results asynchronously. The service transcribes the audio data and stores the results in a storage container. You can then retrieve the results from the storage container.
Note
To use batch transcription, you need to use a standard (S0) Speech resource. Free resources (F0) aren't supported. For more information, see pricing and limits.
Create a transcription job
To create a transcription, use the Transcriptions_Create operation of the Speech to text REST API. Construct the request body according to the following instructions:
- You must set either the
contentContainerUrl
orcontentUrls
property. For more information about Azure blob storage for batch transcription, see Locate audio files for batch transcription. - Set the required
locale
property. This should match the expected locale of the audio data to transcribe. The locale can't be changed later. - Set the required
displayName
property. Choose a transcription name that you can refer to later. The transcription name doesn't have to be unique and can be changed later. - Optionally to use a model other than the base model, set the
model
property to the model ID. For more information, see Using custom models and Using Whisper models. - Optionally you can set the
wordLevelTimestampsEnabled
property totrue
to enable word-level timestamps in the transcription results. The default value isfalse
. - Optionally you can set the
languageIdentification
property. Language identification is used to identify languages spoken in audio when compared against a list of supported languages. If you set thelanguageIdentification
property, then you must also setlanguageIdentification.candidateLocales
with candidate locales.
For more information, see request configuration options.
Make an HTTP POST request using the URI as shown in the following Transcriptions_Create example. Replace YourSubscriptionKey
with your Speech resource key, replace YourServiceRegion
with your Speech resource region, and set the request body properties as previously described.
curl -v -X POST -H "Ocp-Apim-Subscription-Key: YourSubscriptionKey" -H "Content-Type: application/json" -d '{
"contentUrls": [
"https://crbn.us/hello.wav",
"https://crbn.us/whatstheweatherlike.wav"
],
"locale": "en-US",
"displayName": "My Transcription",
"model": null,
"properties": {
"wordLevelTimestampsEnabled": true,
"languageIdentification": {
"candidateLocales": [
"en-US", "de-DE", "es-ES"
],
}
},
}' "https://YourServiceRegion.api.cognitive.microsoft.com/speechtotext/v3.1/transcriptions"
You should receive a response body in the following format:
{
"self": "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.1/transcriptions/db474955-ab85-4c6c-ba6e-3bfe63d041ba",
"model": {
"self": "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.1/models/base/13fb305e-09ad-4bce-b3a1-938c9124dda3"
},
"links": {
"files": "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.1/transcriptions/db474955-ab85-4c6c-ba6e-3bfe63d041ba/files"
},
"properties": {
"diarizationEnabled": false,
"wordLevelTimestampsEnabled": true,
"channels": [
0,
1
],
"punctuationMode": "DictatedAndAutomatic",
"profanityFilterMode": "Masked",
"languageIdentification": {
"candidateLocales": [
"en-US",
"de-DE",
"es-ES"
]
}
},
"lastActionDateTime": "2022-10-21T14:18:06Z",
"status": "NotStarted",
"createdDateTime": "2022-10-21T14:18:06Z",
"locale": "en-US",
"displayName": "My Transcription"
}
The top-level self
property in the response body is the transcription's URI. Use this URI to get details such as the URI of the transcriptions and transcription report files. You also use this URI to update or delete a transcription.
You can query the status of your transcriptions with the Transcriptions_Get operation.
Call Transcriptions_Delete
regularly from the service, after you retrieve the results. Alternatively, set the timeToLive
property to ensure the eventual deletion of the results.
To create a transcription, use the spx batch transcription create
command. Construct the request parameters according to the following instructions:
- Set the required
content
parameter. You can specify either a semi-colon delimited list of individual files, or the URL for an entire container. For more information about Azure blob storage for batch transcription, see Locate audio files for batch transcription. - Set the required
language
property. This should match the expected locale of the audio data to transcribe. The locale can't be changed later. The Speech CLIlanguage
parameter corresponds to thelocale
property in the JSON request and response. - Set the required
name
property. Choose a transcription name that you can refer to later. The transcription name doesn't have to be unique and can be changed later. The Speech CLIname
parameter corresponds to thedisplayName
property in the JSON request and response.
Here's an example Speech CLI command that creates a transcription job:
spx batch transcription create --name "My Transcription" --language "en-US" --content https://crbn.us/hello.wav;https://crbn.us/whatstheweatherlike.wav
You should receive a response body in the following format:
{
"self": "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.1/transcriptions/7f4232d5-9873-47a7-a6f7-4a3f00d00dc0",
"model": {
"self": "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.1/models/base/13fb305e-09ad-4bce-b3a1-938c9124dda3"
},
"links": {
"files": "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.1/transcriptions/7f4232d5-9873-47a7-a6f7-4a3f00d00dc0/files"
},
"properties": {
"diarizationEnabled": false,
"wordLevelTimestampsEnabled": false,
"channels": [
0,
1
],
"punctuationMode": "DictatedAndAutomatic",
"profanityFilterMode": "Masked"
},
"lastActionDateTime": "2022-10-21T14:21:59Z",
"status": "NotStarted",
"createdDateTime": "2022-10-21T14:21:59Z",
"locale": "en-US",
"displayName": "My Transcription",
"description": ""
}
The top-level self
property in the response body is the transcription's URI. Use this URI to get details such as the URI of the transcriptions and transcription report files. You also use this URI to update or delete a transcription.
For Speech CLI help with transcriptions, run the following command:
spx help batch transcription
Request configuration options
Here are some property options that you can use to configure a transcription when you call the Transcriptions_Create operation.
Property | Description |
---|---|
channels |
An array of channel numbers to process. Channels 0 and 1 are transcribed by default. |
contentContainerUrl |
You can submit individual audio files, or a whole storage container. You must specify the audio data location via either the contentContainerUrl or contentUrls property. For more information about Azure blob storage for batch transcription, see Locate audio files for batch transcription.This property won't be returned in the response. |
contentUrls |
You can submit individual audio files, or a whole storage container. You must specify the audio data location via either the contentContainerUrl or contentUrls property. For more information, see Locate audio files for batch transcription.This property won't be returned in the response. |
destinationContainerUrl |
The result can be stored in an Azure container. If you don't specify a container, the Speech service stores the results in a container managed by Microsoft. When the transcription job is deleted, the transcription result data is also deleted. For more information such as the supported security scenarios, see Destination container URL. |
diarization |
Indicates that diarization analysis should be carried out on the input, which is expected to be a mono channel that contains multiple voices. Specify the minimum and maximum number of people who might be speaking. You must also set the diarizationEnabled property to true . The transcription file will contain a speaker entry for each transcribed phrase.You need to use this property when you expect three or more speakers. For two speakers setting diarizationEnabled property to true is enough. See an example of the property usage in Transcriptions_Create operation description.Diarization is the process of separating speakers in audio data. The batch pipeline can recognize and separate multiple speakers on mono channel recordings. The maximum number of speakers for diarization must be less than 36 and more or equal to the minSpeakers property (see example). The feature isn't available with stereo recordings.When this property is selected, source audio length can't exceed 240 minutes per file. Note: This property is only available with Speech to text REST API version 3.1 and later. |
diarizationEnabled |
Specifies that diarization analysis should be carried out on the input, which is expected to be a mono channel that contains two voices. The default value is false .For three or more voices you also need to use property diarization (only with Speech to text REST API version 3.1 and later).When this property is selected, source audio length can't exceed 240 minutes per file. |
displayName |
The name of the batch transcription. Choose a name that you can refer to later. The display name doesn't have to be unique. This property is required. |
displayFormWordLevelTimestampsEnabled |
Specifies whether to include word-level timestamps on the display form of the transcription results. The results are returned in the displayWords property of the transcription file. The default value is false .Note: This property is only available with Speech to text REST API version 3.1 and later. |
languageIdentification |
Language identification is used to identify languages spoken in audio when compared against a list of supported languages. If you set the languageIdentification property, then you must also set its enclosed candidateLocales property. |
languageIdentification.candidateLocales |
The candidate locales for language identification such as "properties": { "languageIdentification": { "candidateLocales": ["en-US", "de-DE", "es-ES"]}} . A minimum of 2 and a maximum of 10 candidate locales, including the main locale for the transcription, is supported. |
locale |
The locale of the batch transcription. This should match the expected locale of the audio data to transcribe. The locale can't be changed later. This property is required. |
model |
You can set the model property to use a specific base model or Custom Speech model. If you don't specify the model , the default base model for the locale is used. For more information, see Using custom models and Using Whisper models. |
profanityFilterMode |
Specifies how to handle profanity in recognition results. Accepted values are None to disable profanity filtering, Masked to replace profanity with asterisks, Removed to remove all profanity from the result, or Tags to add profanity tags. The default value is Masked . |
punctuationMode |
Specifies how to handle punctuation in recognition results. Accepted values are None to disable punctuation, Dictated to imply explicit (spoken) punctuation, Automatic to let the decoder deal with punctuation, or DictatedAndAutomatic to use dictated and automatic punctuation. The default value is DictatedAndAutomatic . |
timeToLive |
A duration after the transcription job is created, when the transcription results will be automatically deleted. The value is an ISO 8601 encoded duration. For example, specify PT12H for 12 hours. As an alternative, you can call Transcriptions_Delete regularly after you retrieve the transcription results. |
wordLevelTimestampsEnabled |
Specifies if word level timestamps should be included in the output. The default value is false .This property isn't applicable for Whisper models. Whisper is a display-only model, so the lexical field isn't populated in the transcription. |
For Speech CLI help with transcription configuration options, run the following command:
spx help batch transcription create advanced
Using custom models
Batch transcription uses the default base model for the locale that you specify. You don't need to set any properties to use the default base model.
Optionally, you can modify the previous create transcription example by setting the model
property to use a specific base model or Custom Speech model.
curl -v -X POST -H "Ocp-Apim-Subscription-Key: YourSubscriptionKey" -H "Content-Type: application/json" -d '{
"contentUrls": [
"https://crbn.us/hello.wav",
"https://crbn.us/whatstheweatherlike.wav"
],
"locale": "en-US",
"displayName": "My Transcription",
"model": {
"self": "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.1/models/base/1aae1070-7972-47e9-a977-87e3b05c457d"
},
"properties": {
"wordLevelTimestampsEnabled": true,
},
}' "https://YourServiceRegion.api.cognitive.microsoft.com/speechtotext/v3.1/transcriptions"
spx batch transcription create --name "My Transcription" --language "en-US" --content https://crbn.us/hello.wav;https://crbn.us/whatstheweatherlike.wav --model "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.1/models/base/1aae1070-7972-47e9-a977-87e3b05c457d"
To use a Custom Speech model for batch transcription, you need the model's URI. You can retrieve the model location when you create or get a model. The top-level self
property in the response body is the model's URI. For an example, see the JSON response example in the Create a model guide.
Tip
A hosted deployment endpoint isn't required to use custom speech with the batch transcription service. You can conserve resources if the custom speech model is only used for batch transcription.
Batch transcription requests for expired models will fail with a 4xx error. You'll want to set the model
property to a base model or custom model that hasn't yet expired. Otherwise don't include the model
property to always use the latest base model. For more information, see Choose a model and Custom Speech model lifecycle.
Using Whisper models
Azure AI Speech supports OpenAI's Whisper model via the batch transcription API.
Note
Azure OpenAI Service also supports OpenAI's Whisper model for speech to text with a synchronous REST API. To learn more, check out the quickstart. Check out What is the Whisper model? to learn more about when to use Azure AI Speech vs. Azure OpenAI Service.
To use a Whisper model for batch transcription, you also need to set the model
property. Whisper is a display-only model, so the lexical field isn't populated in the response.
Important
Whisper models are currently in preview. And you should always use version 3.2 of the speech to text API (that's available in a seperate preview) for Whisper models.
Whisper models via batch transcription are supported in the East US, Southeast Asia, and West Europe regions.
You can make a Models_ListBaseModels request to get available base models for all locales.
Make an HTTP GET request as shown in the following example for the eastus
region. Replace YourSubscriptionKey
with your Speech resource key. Replace eastus
if you're using a different region.
curl -v -X GET "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.2-preview.1/models/base" -H "Ocp-Apim-Subscription-Key: YourSubscriptionKey"
Make sure that you set the configuration variables for a Speech resource in one of the supported regions. You can run the spx csr list --base
command to get available base models for all locales.
spx csr list --base --api-version v3.2-preview.1
The displayName
property of a Whisper model will contain "Whisper Preview" as shown in this example. Whisper is a display-only model, so the lexical field isn't populated in the transcription.
{
"self": "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.2-preview.1/models/base/d9cbeee6-582b-47ad-b5c1-6226583c92b6",
"links": {
"manifest": "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.2-preview.1/models/base/d9cbeee6-582b-47ad-b5c1-6226583c92b6/manifest"
},
"properties": {
"deprecationDates": {
"adaptationDateTime": "2024-10-15T00:00:00Z",
"transcriptionDateTime": "2025-10-15T00:00:00Z"
},
"features": {
"supportsTranscriptions": true,
"supportsEndpoints": false,
"supportsTranscriptionsOnSpeechContainers": false,
"supportsAdaptationsWith": [],
"supportedOutputFormats": [
"Display"
]
},
"chargeForAdaptation": false
},
"lastActionDateTime": "2023-07-19T12:46:27Z",
"status": "Succeeded",
"createdDateTime": "2023-07-19T12:39:52Z",
"locale": "en-US",
"displayName": "20230707 Whisper Preview",
"description": "en-US base model"
},
You set the full model URI as shown in this example for the eastus
region. Replace YourSubscriptionKey
with your Speech resource key. Replace eastus
if you're using a different region.
curl -v -X POST -H "Ocp-Apim-Subscription-Key: YourSubscriptionKey" -H "Content-Type: application/json" -d '{
"contentUrls": [
"https://crbn.us/hello.wav",
"https://crbn.us/whatstheweatherlike.wav"
],
"locale": "en-US",
"displayName": "My Transcription",
"model": {
"self": "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.2-preview.1/models/base/d9cbeee6-582b-47ad-b5c1-6226583c92b6"
},
"properties": {
"wordLevelTimestampsEnabled": true,
},
}' "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.2-preview.1/transcriptions"
spx batch transcription create --name "My Transcription" --language "en-US" --content https://crbn.us/hello.wav;https://crbn.us/whatstheweatherlike.wav --model "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.2-preview.1/models/base/d9cbeee6-582b-47ad-b5c1-6226583c92b6" --api-version v3.2-preview.1
Destination container URL
The transcription result can be stored in an Azure container. If you don't specify a container, the Speech service stores the results in a container managed by Microsoft. In that case, when the transcription job is deleted, the transcription result data is also deleted.
You can store the results of a batch transcription to a writable Azure Blob storage container using option destinationContainerUrl
in the batch transcription creation request. Note however that this option is only using ad hoc SAS URI and doesn't support Trusted Azure services security mechanism. This option also doesn't support Access policy based SAS. The Storage account resource of the destination container must allow all external traffic.
If you would like to store the transcription results in an Azure Blob storage container via the Trusted Azure services security mechanism, then you should consider using Bring-your-own-storage (BYOS). See details on how to use BYOS-enabled Speech resource for Batch transcription in this article.
Next steps
Feedback
Submit and view feedback for