Create a batch transcription

Important

New pricing is in effect for batch transcription via Speech to text REST API v3.2. For more information, see the pricing guide.

With batch transcriptions, you submit the audio data, and then retrieve transcription results asynchronously. The service transcribes the audio data and stores the results in a storage container. You can then retrieve the results from the storage container.

Note

To use batch transcription, you need to use a standard (S0) Speech resource. Free resources (F0) aren't supported. For more information, see pricing and limits.

Create a transcription job

To create a transcription, use the Transcriptions_Create operation of the Speech to text REST API. Construct the request body according to the following instructions:

  • You must set either the contentContainerUrl or contentUrls property. For more information about Azure blob storage for batch transcription, see Locate audio files for batch transcription.
  • Set the required locale property. This should match the expected locale of the audio data to transcribe. The locale can't be changed later.
  • Set the required displayName property. Choose a transcription name that you can refer to later. The transcription name doesn't have to be unique and can be changed later.
  • Optionally to use a model other than the base model, set the model property to the model ID. For more information, see Using custom models and Using Whisper models.
  • Optionally you can set the wordLevelTimestampsEnabled property to true to enable word-level timestamps in the transcription results. The default value is false.
  • Optionally you can set the languageIdentification property. Language identification is used to identify languages spoken in audio when compared against a list of supported languages. If you set the languageIdentification property, then you must also set languageIdentification.candidateLocales with candidate locales.

For more information, see request configuration options.

Make an HTTP POST request using the URI as shown in the following Transcriptions_Create example. Replace YourSubscriptionKey with your Speech resource key, replace YourServiceRegion with your Speech resource region, and set the request body properties as previously described.

curl -v -X POST -H "Ocp-Apim-Subscription-Key: YourSubscriptionKey" -H "Content-Type: application/json" -d '{
  "contentUrls": [
    "https://crbn.us/hello.wav",
    "https://crbn.us/whatstheweatherlike.wav"
  ],
  "locale": "en-US",
  "displayName": "My Transcription",
  "model": null,
  "properties": {
    "wordLevelTimestampsEnabled": true,
    "languageIdentification": {
      "candidateLocales": [
        "en-US", "de-DE", "es-ES"
      ],
    }
  },
}'  "https://YourServiceRegion.api.cognitive.microsoft.com/speechtotext/v3.1/transcriptions"

You should receive a response body in the following format:

{
  "self": "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.1/transcriptions/db474955-ab85-4c6c-ba6e-3bfe63d041ba",
  "model": {
    "self": "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.1/models/base/13fb305e-09ad-4bce-b3a1-938c9124dda3"
  },
  "links": {
    "files": "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.1/transcriptions/db474955-ab85-4c6c-ba6e-3bfe63d041ba/files"
  },
  "properties": {
    "diarizationEnabled": false,
    "wordLevelTimestampsEnabled": true,
    "channels": [
      0,
      1
    ],
    "punctuationMode": "DictatedAndAutomatic",
    "profanityFilterMode": "Masked",
    "languageIdentification": {
      "candidateLocales": [
        "en-US",
        "de-DE",
        "es-ES"
      ]
    }
  },
  "lastActionDateTime": "2022-10-21T14:18:06Z",
  "status": "NotStarted",
  "createdDateTime": "2022-10-21T14:18:06Z",
  "locale": "en-US",
  "displayName": "My Transcription"
}

The top-level self property in the response body is the transcription's URI. Use this URI to get details such as the URI of the transcriptions and transcription report files. You also use this URI to update or delete a transcription.

You can query the status of your transcriptions with the Transcriptions_Get operation.

Call Transcriptions_Delete regularly from the service, after you retrieve the results. Alternatively, set the timeToLive property to ensure the eventual deletion of the results.

To create a transcription, use the spx batch transcription create command. Construct the request parameters according to the following instructions:

  • Set the required content parameter. You can specify either a semi-colon delimited list of individual files, or the URL for an entire container. For more information about Azure blob storage for batch transcription, see Locate audio files for batch transcription.
  • Set the required language property. This should match the expected locale of the audio data to transcribe. The locale can't be changed later. The Speech CLI language parameter corresponds to the locale property in the JSON request and response.
  • Set the required name property. Choose a transcription name that you can refer to later. The transcription name doesn't have to be unique and can be changed later. The Speech CLI name parameter corresponds to the displayName property in the JSON request and response.

Here's an example Speech CLI command that creates a transcription job:

spx batch transcription create --name "My Transcription" --language "en-US" --content https://crbn.us/hello.wav;https://crbn.us/whatstheweatherlike.wav

You should receive a response body in the following format:

{
  "self": "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.1/transcriptions/7f4232d5-9873-47a7-a6f7-4a3f00d00dc0",
  "model": {
    "self": "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.1/models/base/13fb305e-09ad-4bce-b3a1-938c9124dda3"
  },
  "links": {
    "files": "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.1/transcriptions/7f4232d5-9873-47a7-a6f7-4a3f00d00dc0/files"
  },
  "properties": {
    "diarizationEnabled": false,
    "wordLevelTimestampsEnabled": false,
    "channels": [
      0,
      1
    ],
    "punctuationMode": "DictatedAndAutomatic",
    "profanityFilterMode": "Masked"
  },
  "lastActionDateTime": "2022-10-21T14:21:59Z",
  "status": "NotStarted",
  "createdDateTime": "2022-10-21T14:21:59Z",
  "locale": "en-US",
  "displayName": "My Transcription",
  "description": ""
}

The top-level self property in the response body is the transcription's URI. Use this URI to get details such as the URI of the transcriptions and transcription report files. You also use this URI to update or delete a transcription.

For Speech CLI help with transcriptions, run the following command:

spx help batch transcription

Request configuration options

Here are some property options that you can use to configure a transcription when you call the Transcriptions_Create operation.

Property Description
channels An array of channel numbers to process. Channels 0 and 1 are transcribed by default.
contentContainerUrl You can submit individual audio files, or a whole storage container.

You must specify the audio data location via either the contentContainerUrl or contentUrls property. For more information about Azure blob storage for batch transcription, see Locate audio files for batch transcription.

This property won't be returned in the response.
contentUrls You can submit individual audio files, or a whole storage container.

You must specify the audio data location via either the contentContainerUrl or contentUrls property. For more information, see Locate audio files for batch transcription.

This property won't be returned in the response.
destinationContainerUrl The result can be stored in an Azure container. If you don't specify a container, the Speech service stores the results in a container managed by Microsoft. When the transcription job is deleted, the transcription result data is also deleted. For more information such as the supported security scenarios, see Destination container URL.
diarization Indicates that diarization analysis should be carried out on the input, which is expected to be a mono channel that contains multiple voices. Specify the minimum and maximum number of people who might be speaking. You must also set the diarizationEnabled property to true. The transcription file will contain a speaker entry for each transcribed phrase.

You need to use this property when you expect three or more speakers. For two speakers setting diarizationEnabled property to true is enough. See an example of the property usage in Transcriptions_Create operation description.

Diarization is the process of separating speakers in audio data. The batch pipeline can recognize and separate multiple speakers on mono channel recordings. The maximum number of speakers for diarization must be less than 36 and more or equal to the minSpeakers property (see example). The feature isn't available with stereo recordings.

When this property is selected, source audio length can't exceed 240 minutes per file.

Note: This property is only available with Speech to text REST API version 3.1 and later.
diarizationEnabled Specifies that diarization analysis should be carried out on the input, which is expected to be a mono channel that contains two voices. The default value is false.

For three or more voices you also need to use property diarization (only with Speech to text REST API version 3.1 and later).

When this property is selected, source audio length can't exceed 240 minutes per file.
displayName The name of the batch transcription. Choose a name that you can refer to later. The display name doesn't have to be unique.

This property is required.
displayFormWordLevelTimestampsEnabled Specifies whether to include word-level timestamps on the display form of the transcription results. The results are returned in the displayWords property of the transcription file. The default value is false.

Note: This property is only available with Speech to text REST API version 3.1 and later.
languageIdentification Language identification is used to identify languages spoken in audio when compared against a list of supported languages.

If you set the languageIdentification property, then you must also set its enclosed candidateLocales property.
languageIdentification.candidateLocales The candidate locales for language identification such as "properties": { "languageIdentification": { "candidateLocales": ["en-US", "de-DE", "es-ES"]}}. A minimum of 2 and a maximum of 10 candidate locales, including the main locale for the transcription, is supported.
locale The locale of the batch transcription. This should match the expected locale of the audio data to transcribe. The locale can't be changed later.

This property is required.
model You can set the model property to use a specific base model or Custom Speech model. If you don't specify the model, the default base model for the locale is used. For more information, see Using custom models and Using Whisper models.
profanityFilterMode Specifies how to handle profanity in recognition results. Accepted values are None to disable profanity filtering, Masked to replace profanity with asterisks, Removed to remove all profanity from the result, or Tags to add profanity tags. The default value is Masked.
punctuationMode Specifies how to handle punctuation in recognition results. Accepted values are None to disable punctuation, Dictated to imply explicit (spoken) punctuation, Automatic to let the decoder deal with punctuation, or DictatedAndAutomatic to use dictated and automatic punctuation. The default value is DictatedAndAutomatic.
timeToLive A duration after the transcription job is created, when the transcription results will be automatically deleted. The value is an ISO 8601 encoded duration. For example, specify PT12H for 12 hours. As an alternative, you can call Transcriptions_Delete regularly after you retrieve the transcription results.
wordLevelTimestampsEnabled Specifies if word level timestamps should be included in the output. The default value is false.

This property isn't applicable for Whisper models. Whisper is a display-only model, so the lexical field isn't populated in the transcription.

For Speech CLI help with transcription configuration options, run the following command:

spx help batch transcription create advanced

Using custom models

Batch transcription uses the default base model for the locale that you specify. You don't need to set any properties to use the default base model.

Optionally, you can modify the previous create transcription example by setting the model property to use a specific base model or Custom Speech model.

curl -v -X POST -H "Ocp-Apim-Subscription-Key: YourSubscriptionKey" -H "Content-Type: application/json" -d '{
  "contentUrls": [
    "https://crbn.us/hello.wav",
    "https://crbn.us/whatstheweatherlike.wav"
  ],
  "locale": "en-US",
  "displayName": "My Transcription",
  "model": {
    "self": "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.1/models/base/1aae1070-7972-47e9-a977-87e3b05c457d"
  },
  "properties": {
    "wordLevelTimestampsEnabled": true,
  },
}'  "https://YourServiceRegion.api.cognitive.microsoft.com/speechtotext/v3.1/transcriptions"
spx batch transcription create --name "My Transcription" --language "en-US" --content https://crbn.us/hello.wav;https://crbn.us/whatstheweatherlike.wav --model "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.1/models/base/1aae1070-7972-47e9-a977-87e3b05c457d"

To use a Custom Speech model for batch transcription, you need the model's URI. You can retrieve the model location when you create or get a model. The top-level self property in the response body is the model's URI. For an example, see the JSON response example in the Create a model guide.

Tip

A hosted deployment endpoint isn't required to use custom speech with the batch transcription service. You can conserve resources if the custom speech model is only used for batch transcription.

Batch transcription requests for expired models will fail with a 4xx error. You'll want to set the model property to a base model or custom model that hasn't yet expired. Otherwise don't include the model property to always use the latest base model. For more information, see Choose a model and Custom Speech model lifecycle.

Using Whisper models

Azure AI Speech supports OpenAI's Whisper model via the batch transcription API.

Note

Azure OpenAI Service also supports OpenAI's Whisper model for speech to text with a synchronous REST API. To learn more, check out the quickstart. Check out What is the Whisper model? to learn more about when to use Azure AI Speech vs. Azure OpenAI Service.

To use a Whisper model for batch transcription, you also need to set the model property. Whisper is a display-only model, so the lexical field isn't populated in the response.

Important

Whisper models are currently in preview. And you should always use version 3.2 of the speech to text API (that's available in a seperate preview) for Whisper models.

Whisper models via batch transcription are supported in the East US, Southeast Asia, and West Europe regions.

You can make a Models_ListBaseModels request to get available base models for all locales.

Make an HTTP GET request as shown in the following example for the eastus region. Replace YourSubscriptionKey with your Speech resource key. Replace eastus if you're using a different region.

curl -v -X GET "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.2-preview.1/models/base" -H "Ocp-Apim-Subscription-Key: YourSubscriptionKey"

Make sure that you set the configuration variables for a Speech resource in one of the supported regions. You can run the spx csr list --base command to get available base models for all locales.

spx csr list --base --api-version v3.2-preview.1

The displayName property of a Whisper model will contain "Whisper Preview" as shown in this example. Whisper is a display-only model, so the lexical field isn't populated in the transcription.

{
  "self": "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.2-preview.1/models/base/d9cbeee6-582b-47ad-b5c1-6226583c92b6",
  "links": {
    "manifest": "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.2-preview.1/models/base/d9cbeee6-582b-47ad-b5c1-6226583c92b6/manifest"
  },
  "properties": {
    "deprecationDates": {
      "adaptationDateTime": "2024-10-15T00:00:00Z",
      "transcriptionDateTime": "2025-10-15T00:00:00Z"
    },
    "features": {
      "supportsTranscriptions": true,
      "supportsEndpoints": false,
      "supportsTranscriptionsOnSpeechContainers": false,
      "supportsAdaptationsWith": [],
      "supportedOutputFormats": [
        "Display"
      ]
    },
    "chargeForAdaptation": false
  },
  "lastActionDateTime": "2023-07-19T12:46:27Z",
  "status": "Succeeded",
  "createdDateTime": "2023-07-19T12:39:52Z",
  "locale": "en-US",
  "displayName": "20230707 Whisper Preview",
  "description": "en-US base model"
},

You set the full model URI as shown in this example for the eastus region. Replace YourSubscriptionKey with your Speech resource key. Replace eastus if you're using a different region.

curl -v -X POST -H "Ocp-Apim-Subscription-Key: YourSubscriptionKey" -H "Content-Type: application/json" -d '{
  "contentUrls": [
    "https://crbn.us/hello.wav",
    "https://crbn.us/whatstheweatherlike.wav"
  ],
  "locale": "en-US",
  "displayName": "My Transcription",
  "model": {
    "self": "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.2-preview.1/models/base/d9cbeee6-582b-47ad-b5c1-6226583c92b6"
  },
  "properties": {
    "wordLevelTimestampsEnabled": true,
  },
}'  "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.2-preview.1/transcriptions"
spx batch transcription create --name "My Transcription" --language "en-US" --content https://crbn.us/hello.wav;https://crbn.us/whatstheweatherlike.wav --model "https://eastus.api.cognitive.microsoft.com/speechtotext/v3.2-preview.1/models/base/d9cbeee6-582b-47ad-b5c1-6226583c92b6" --api-version v3.2-preview.1

Destination container URL

The transcription result can be stored in an Azure container. If you don't specify a container, the Speech service stores the results in a container managed by Microsoft. In that case, when the transcription job is deleted, the transcription result data is also deleted.

You can store the results of a batch transcription to a writable Azure Blob storage container using option destinationContainerUrl in the batch transcription creation request. Note however that this option is only using ad hoc SAS URI and doesn't support Trusted Azure services security mechanism. This option also doesn't support Access policy based SAS. The Storage account resource of the destination container must allow all external traffic.

If you would like to store the transcription results in an Azure Blob storage container via the Trusted Azure services security mechanism, then you should consider using Bring-your-own-storage (BYOS). See details on how to use BYOS-enabled Speech resource for Batch transcription in this article.

Next steps