Transcriptions - Submit

Service:: Azure AI Services

API Version:: 2024-11-15

Submits a new transcription job.

POST {endpoint}/speechtotext/transcriptions:submit?api-version=2024-11-15

URI Parameters

Name	In	Required	Type	Description
endpoint	path	True	string	Supported Cognitive Services endpoints (protocol and hostname, for example: https://westus.api.cognitive.microsoft.com).
api-version	query	True	string	The requested api version.

Request Header

Name	Required	Type	Description
Ocp-Apim-Subscription-Key	True	string	Provide your cognitive services account key here.

Request Body

Name	Required	Type	Description
displayName	True	string minLength: 1	The display name of the object.
locale	True	string minLength: 1	The locale of the contained data. If Language Identification is used, this locale is used to transcribe speech for which no language could be detected.
properties	True	TranscriptionProperties	TranscriptionProperties
contentContainerUrl		string (uri)	A URL for an Azure blob container that contains the audio files. A container is allowed to have a maximum size of 5GB and a maximum number of 10000 blobs. The maximum size for a blob is 2.5GB. Container SAS should contain 'r' (read) and 'l' (list) permissions. This property will not be returned in a response.
contentUrls		string[] (uri)	A list of content urls to get audio files to transcribe. Up to 1000 urls are allowed. This property will not be returned in a response.
customProperties		object	The custom properties of this entity. The maximum allowed key length is 64 characters, the maximum allowed value length is 256 characters and the count of allowed entries is 10.
dataset		EntityReference	EntityReference
description		string	The description of the object.
model		EntityReference	EntityReference

Responses

Name	Type	Description
201 Created	Transcription	The response contains information about the entity as payload and its location as header. Headers Location: string
Other Status Codes	Error	An error occurred.

Name

Type

Description

201 Created

Transcription

The response contains information about the entity as payload and its location as header.

Headers

Location: string

Other Status Codes

Error

An error occurred.

Security

Ocp-Apim-Subscription-Key

Provide your cognitive services account key here.

Type: apiKey
In: header

Examples

Create a transcription for URIs

Create a transcription from blob container

Create a transcription with language identification

Create a transcription with multispeaker diarization

Create a transcription for URIs

Sample request

HTTP

POST {endpoint}/speechtotext/transcriptions:submit?api-version=2024-11-15


{
  "displayName": "Transcription using default model for en-US",
  "locale": "en-US",
  "contentUrls": [
    "https://contoso.com/mystoragelocation",
    "https://contoso.com/myotherstoragelocation"
  ],
  "properties": {
    "wordLevelTimestampsEnabled": false,
    "displayFormWordLevelTimestampsEnabled": false,
    "punctuationMode": "DictatedAndAutomatic",
    "profanityFilterMode": "Masked",
    "timeToLiveHours": 48
  }
}

Sample response

Status code:: 201

{
  "self": "https://westus.api.cognitive.microsoft.com/speechtotext/transcriptions/ba7ea6f5-3065-40b7-b49a-a90f48584683?api-version=2024-11-15",
  "displayName": "Transcription using adapted model en-US",
  "customProperties": {
    "key": "value"
  },
  "locale": "en-US",
  "createdDateTime": "2019-01-07T11:34:12Z",
  "lastActionDateTime": "2019-01-07T11:36:07Z",
  "model": {
    "self": "https://westus.api.cognitive.microsoft.com/speechtotext/models/827712a5-f942-4997-91c3-7c6cde35600b?api-version=2024-11-15"
  },
  "links": {
    "files": "https://westus.api.cognitive.microsoft.com/speechtotext/transcriptions/ba7ea6f5-3065-40b7-b49a-a90f48584683/files?api-version=2024-11-15"
  },
  "properties": {
    "wordLevelTimestampsEnabled": false,
    "displayFormWordLevelTimestampsEnabled": false,
    "channels": [
      0,
      1
    ],
    "punctuationMode": "DictatedAndAutomatic",
    "profanityFilterMode": "Masked",
    "timeToLiveHours": 48,
    "durationMilliseconds": 42000
  },
  "status": "Succeeded"
}

Create a transcription from blob container

Sample request

HTTP

POST {endpoint}/speechtotext/transcriptions:submit?api-version=2024-11-15


{
  "displayName": "Transcription of storage container using default model for en-US",
  "locale": "en-US",
  "properties": {
    "wordLevelTimestampsEnabled": false,
    "displayFormWordLevelTimestampsEnabled": false,
    "punctuationMode": "DictatedAndAutomatic",
    "profanityFilterMode": "Masked",
    "timeToLiveHours": 48
  },
  "contentContainerUrl": "https://customspeech-usw.blob.core.windows.net/artifacts/audiofiles/"
}

Sample response

Status code:: 201

Location: https://westus.api.cognitive.microsoft.com/speechtotext/transcriptions/ba7ea6f5-3065-40b7-b49a-a90f48584683?api-version=2024-11-15

{
  "self": "https://westus.api.cognitive.microsoft.com/speechtotext/transcriptions/ba7ea6f5-3065-40b7-b49a-a90f48584683?api-version=2024-11-15",
  "displayName": "Transcription using adapted model en-US",
  "customProperties": {
    "key": "value"
  },
  "locale": "en-US",
  "createdDateTime": "2019-01-07T11:34:12Z",
  "lastActionDateTime": "2019-01-07T11:36:07Z",
  "model": {
    "self": "https://westus.api.cognitive.microsoft.com/speechtotext/models/827712a5-f942-4997-91c3-7c6cde35600b?api-version=2024-11-15"
  },
  "links": {
    "files": "https://westus.api.cognitive.microsoft.com/speechtotext/transcriptions/ba7ea6f5-3065-40b7-b49a-a90f48584683/files?api-version=2024-11-15"
  },
  "properties": {
    "wordLevelTimestampsEnabled": false,
    "displayFormWordLevelTimestampsEnabled": false,
    "channels": [
      0,
      1
    ],
    "punctuationMode": "DictatedAndAutomatic",
    "profanityFilterMode": "Masked",
    "timeToLiveHours": 48,
    "durationMilliseconds": 42000
  },
  "status": "Succeeded"
}

Create a transcription with language identification

Sample request

HTTP

POST {endpoint}/speechtotext/transcriptions:submit?api-version=2024-11-15


{
  "displayName": "Transcription using language identification with three candidate languages, 'fr-FR' as fallback locale and a custom model for transcribing utterances that were classified as 'nl-NL' locale.",
  "locale": "fr-FR",
  "contentUrls": [
    "https://contoso.com/mystoragelocation"
  ],
  "properties": {
    "wordLevelTimestampsEnabled": false,
    "displayFormWordLevelTimestampsEnabled": false,
    "channels": [
      0,
      1
    ],
    "punctuationMode": "DictatedAndAutomatic",
    "profanityFilterMode": "Masked",
    "timeToLiveHours": 48,
    "languageIdentification": {
      "candidateLocales": [
        "fr-FR",
        "nl-NL",
        "el-GR"
      ],
      "speechModelMapping": {
        "nl-NL": {
          "self": "https://westus.api.cognitive.microsoft.com/speechtotext/models/827712a5-f942-4997-91c3-7c6cde35600b?api-version=2024-11-15"
        }
      },
      "mode": "Single"
    }
  }
}

Sample response

Status code:: 201

{
  "self": "https://westus.api.cognitive.microsoft.com/speechtotext/transcriptions/ba7ea6f5-3065-40b7-b49a-a90f48584683?api-version=2024-11-15",
  "displayName": "Transcription using language identification with three candidate languages, 'fr-FR' as fallback locale and a custom model for transcribing utterances that were classified as 'nl-NL' locale.",
  "customProperties": {
    "key": "value"
  },
  "locale": "fr-FR",
  "createdDateTime": "2019-01-07T11:34:12Z",
  "lastActionDateTime": "2019-01-07T11:36:07Z",
  "model": {
    "self": "https://westus.api.cognitive.microsoft.com/speechtotext/models/827712a5-f942-4997-91c3-7c6cde35600b?api-version=2024-11-15"
  },
  "links": {
    "files": "https://westus.api.cognitive.microsoft.com/speechtotext/transcriptions/ba7ea6f5-3065-40b7-b49a-a90f48584683/files?api-version=2024-11-15"
  },
  "properties": {
    "wordLevelTimestampsEnabled": false,
    "displayFormWordLevelTimestampsEnabled": false,
    "channels": [
      0,
      1
    ],
    "punctuationMode": "DictatedAndAutomatic",
    "profanityFilterMode": "Masked",
    "timeToLiveHours": 48,
    "languageIdentification": {
      "candidateLocales": [
        "fr-FR",
        "nl-NL",
        "el-GR"
      ],
      "speechModelMapping": {
        "nl-NL": {
          "self": "https://westus.api.cognitive.microsoft.com/speechtotext/models/827712a5-f942-4997-91c3-7c6cde35600b?api-version=2024-11-15"
        }
      },
      "mode": "Single"
    },
    "durationMilliseconds": 42000
  },
  "status": "Succeeded"
}

Create a transcription with multispeaker diarization

Sample request

HTTP

POST {endpoint}/speechtotext/transcriptions:submit?api-version=2024-11-15


{
  "displayName": "Transcription using diarization for audio that is known to contain speech from up to 5 speakers",
  "locale": "en-US",
  "contentUrls": [
    "https://contoso.com/mystoragelocation"
  ],
  "properties": {
    "wordLevelTimestampsEnabled": false,
    "displayFormWordLevelTimestampsEnabled": false,
    "channels": [
      0,
      1
    ],
    "punctuationMode": "DictatedAndAutomatic",
    "profanityFilterMode": "Masked",
    "timeToLiveHours": 48,
    "diarization": {
      "enabled": true,
      "maxSpeakers": 5
    }
  }
}

Sample response

Status code:: 201

{
  "self": "https://westus.api.cognitive.microsoft.com/speechtotext/transcriptions/ba7ea6f5-3065-40b7-b49a-a90f48584683?api-version=2024-11-15",
  "displayName": "Transcription using diarization for audio that is known to contain speech from up to 5 speakers",
  "customProperties": {
    "key": "value"
  },
  "locale": "en-US",
  "createdDateTime": "2019-01-07T11:34:12Z",
  "lastActionDateTime": "2019-01-07T11:36:07Z",
  "model": {
    "self": "https://westus.api.cognitive.microsoft.com/speechtotext/models/827712a5-f942-4997-91c3-7c6cde35600b?api-version=2024-11-15"
  },
  "links": {
    "files": "https://westus.api.cognitive.microsoft.com/speechtotext/transcriptions/ba7ea6f5-3065-40b7-b49a-a90f48584683/files?api-version=2024-11-15"
  },
  "properties": {
    "wordLevelTimestampsEnabled": false,
    "displayFormWordLevelTimestampsEnabled": false,
    "channels": [
      0,
      1
    ],
    "punctuationMode": "DictatedAndAutomatic",
    "profanityFilterMode": "Masked",
    "timeToLiveHours": 48,
    "diarization": {
      "enabled": true,
      "maxSpeakers": 5
    },
    "durationMilliseconds": 42000
  },
  "status": "Succeeded"
}

Definitions

Name	Description
DetailedErrorCode	DetailedErrorCode
DiarizationProperties	DiarizationProperties
EntityError	EntityError
EntityReference	EntityReference
Error	Error
ErrorCode	ErrorCode
InnerError	InnerError
LanguageIdentificationMode	LanguageIdentificationMode
LanguageIdentificationProperties	LanguageIdentificationProperties
ProfanityFilterMode	ProfanityFilterMode
PunctuationMode	PunctuationMode
Status	Status
Transcription	Transcription
TranscriptionLinks	TranscriptionLinks
TranscriptionProperties	TranscriptionProperties

DetailedErrorCode

Enumeration

DetailedErrorCode

Value	Description
InvalidParameterValue	Invalid parameter value.
InvalidRequestBodyFormat	Invalid request body format.
EmptyRequest	Empty Request.
MissingInputRecords	Missing Input Records.
InvalidDocument	Invalid Document.
ModelVersionIncorrect	Model Version Incorrect.
InvalidDocumentBatch	Invalid Document Batch.
UnsupportedLanguageCode	Unsupported language code.
DataImportFailed	Data import failed.
InUseViolation	In use violation.
InvalidLocale	Invalid locale.
InvalidBaseModel	Invalid base model.
InvalidAdaptationMapping	Invalid adaptation mapping.
InvalidDataset	Invalid dataset.
InvalidTest	Invalid test.
FailedDataset	Failed dataset.
InvalidModel	Invalid model.
InvalidTranscription	Invalid transcription.
InvalidPayload	Invalid payload.
InvalidParameter	Invalid parameter.
EndpointWithoutLogging	Endpoint without logging.
InvalidPermissions	Invalid permissions.
InvalidPrerequisite	Invalid prerequisite.
InvalidProductId	Invalid product id.
InvalidSubscription	Invalid subscription.
InvalidProject	Invalid project.
InvalidProjectKind	Invalid project kind.
InvalidRecordingsUri	Invalid recordings uri.
OnlyOneOfUrlsOrContainerOrDataset	Only one of urls or container or dataset.
ExceededNumberOfRecordingsUris	Exceeded number of recordings uris.
InvalidChannels	Invalid channels.
ModelMismatch	Model mismatch.
ProjectGenderMismatch	Project gender mismatch.
ModelDeprecated	Model deprecated.
ModelExists	Model exists.
ModelNotDeployable	Model not deployable.
EndpointNotUpdatable	Endpoint not updatable.
SingleDefaultEndpoint	Single default endpoint.
EndpointCannotBeDefault	Endpoint cannot be default.
InvalidModelUri	Invalid model uri.
SubscriptionNotFound	Subscription not found.
QuotaViolation	Quota violation.
UnsupportedDelta	Unsupported delta.
UnsupportedFilter	Unsupported filter.
UnsupportedPagination	Unsupported pagination.
UnsupportedDynamicConfiguration	Unsupported dynamic configuration.
UnsupportedOrderBy	Unsupported order by.
NoUtf8WithBom	No utf8 with bom.
ModelDeploymentNotCompleteState	Model deployment not complete state.
SkuLimitsExist	Sku limits exist.
DeployingFailedModel	Deploying failed model.
UnsupportedTimeRange	Unsupported time range.
InvalidLogDate	Invalid log date.
InvalidLogId	Invalid log id.
InvalidLogStartTime	Invalid log start time.
InvalidLogEndTime	Invalid log end time.
InvalidTopForLogs	Invalid top for logs.
InvalidSkipTokenForLogs	Invalid skip token for logs.
DeleteNotAllowed	Delete not allowed.
Forbidden	Forbidden.
DeployNotAllowed	Deploy not allowed.
UnexpectedError	Unexpected error.
InvalidCollection	Invalid collection.
InvalidCallbackUri	Invalid callback uri.
InvalidSasValidityDuration	Invalid sas validity duration.
InaccessibleCustomerStorage	Inaccessible customer storage.
UnsupportedClassBasedAdaptation	Unsupported class based adaptation.
InvalidWebHookEventKind	Invalid web hook event kind.
InvalidTimeToLive	Invalid time to live.
InvalidSourceAzureResourceId	Invalid source Azure resource ID.
ModelCopyAuthorizationExpired	Expired ModelCopyAuthorization.
EndpointLoggingNotSupported	Endpoint logging not supported.
NoLanguageIdentified	Language Identification did not recognize any language.
MultipleLanguagesIdentified	Language Identification recognized multiple languages. No dominant language could be determined.
InvalidAudioFormat	The format of input audio is not supported.
BadChannelConfiguration	There is a mismatch between audio channels in the data, in the configuration, or the requirements of the application.
InvalidChannelSpecification	The selection of channels in the transcription request is not supported (e.g., neither 0 nor 1 have been selected.)
AudioLengthLimitExceeded	The audio file is longer than the maximum allowed duration.
EmptyAudioFile	The audio file is empty.

DiarizationProperties

Object

DiarizationProperties

Name	Type	Description
enabled	boolean	A value indicating whether speaker diarization is enabled.
maxSpeakers	integer (int32) minimum: 2 maximum: 35	A hint for the maximum number of speakers for diarization. Must be greater than 1 and less than 36.

EntityError

Object

EntityError

Name	Type	Description
code	string	The code of this error.
message	string	The message for this error.

EntityReference

Object

EntityReference

Name	Type	Description
self	string (uri)	The location of the referenced entity.

Error

Object

Error

Name	Type	Description
code	ErrorCode	ErrorCode High level error codes.
details	Error[]	Additional supportive details regarding the error and/or expected policies.
innerError	InnerError	InnerError New Inner Error format which conforms to Cognitive Services API Guidelines which is available at https://microsoft.sharepoint.com/%3Aw%3A/t/CognitiveServicesPMO/EUoytcrjuJdKpeOKIK_QRC8BPtUYQpKBi8JsWyeDMRsWlQ?e=CPq8ow. This contains required properties ErrorCode, message and optional properties target, details(key value pair), inner error(this can be nested).
message	string	High level error message.
target	string	The source of the error. For example it would be "documents" or "document id" in case of invalid document.

ErrorCode

Enumeration

ErrorCode

Value	Description
InvalidRequest	Representing the invalid request error code.
InvalidArgument	Representing the invalid argument error code.
InternalServerError	Representing the internal server error error code.
ServiceUnavailable	Representing the service unavailable error code.
NotFound	Representing the not found error code.
PipelineError	Representing the pipeline error error code.
Conflict	Representing the conflict error code.
InternalCommunicationFailed	Representing the internal communication failed error code.
Forbidden	Representing the forbidden error code.
NotAllowed	Representing the not allowed error code.
Unauthorized	Representing the unauthorized error code.
UnsupportedMediaType	Representing the unsupported media type error code.
TooManyRequests	Representing the too many requests error code.
UnprocessableEntity	Representing the unprocessable entity error code.

InnerError

Object

InnerError

Name	Type	Description
code	DetailedErrorCode	DetailedErrorCode Detailed error code enum.
details	object	Additional supportive details regarding the error and/or expected policies.
innerError	InnerError	InnerError New Inner Error format which conforms to Cognitive Services API Guidelines which is available at https://microsoft.sharepoint.com/%3Aw%3A/t/CognitiveServicesPMO/EUoytcrjuJdKpeOKIK_QRC8BPtUYQpKBi8JsWyeDMRsWlQ?e=CPq8ow. This contains required properties ErrorCode, message and optional properties target, details(key value pair), inner error(this can be nested).
message	string	High level error message.
target	string	The source of the error. For example it would be "documents" or "document id" in case of invalid document.

LanguageIdentificationMode

Enumeration

LanguageIdentificationMode

Value	Description
Continuous	Continuous language identification (Default).
Single	Single language identification. If no language can be identified, the error code NoLanguageIdentified is returned to the user. If there is ambiguity between multiple languages, the error code MultipleLanguagesIdentified is returned to the user.

LanguageIdentificationProperties

Object

LanguageIdentificationProperties

Name	Type	Default value	Description
candidateLocales	string[]		The candidate locales for language identification (example ["en-US", "de-DE", "es-ES"]). A minimum of 2 and a maximum of 10 candidate locales, including the main locale for the transcription, is supported for continuous mode. For single language identification, the maximum number of candidate locales is unbounded.
mode	LanguageIdentificationMode	Continuous	LanguageIdentificationMode The mode used for language identification.
speechModelMapping	<string, EntityReference>		An optional mapping of locales to speech model entities. If no model is given for a locale, the default base model is used. Keys must be locales contained in the candidate locales, values are entities for models of the respective locales.

ProfanityFilterMode

Enumeration

ProfanityFilterMode

Value	Description
None	Disable profanity filtering.
Removed	Remove profanity.
Tags	Add "profanity" XML tags</Profanity>
Masked	Mask the profanity with * except of the first letter, e.g., f***

PunctuationMode

Enumeration

PunctuationMode

Value	Description
None	No punctuation.
Dictated	Dictated punctuation marks only, i.e., explicit punctuation.
Automatic	Automatic punctuation.
DictatedAndAutomatic	Dictated punctuation marks or automatic punctuation.

Status

Enumeration

Status

Value	Description
NotStarted	The long running operation has not yet started.
Running	The long running operation is currently processing.
Succeeded	The long running operation has successfully completed.
Failed	The long running operation has failed.

Transcription

Object

Transcription

Name	Type	Description
contentContainerUrl	string (uri)	A URL for an Azure blob container that contains the audio files. A container is allowed to have a maximum size of 5GB and a maximum number of 10000 blobs. The maximum size for a blob is 2.5GB. Container SAS should contain 'r' (read) and 'l' (list) permissions. This property will not be returned in a response.
contentUrls	string[] (uri)	A list of content urls to get audio files to transcribe. Up to 1000 urls are allowed. This property will not be returned in a response.
createdDateTime	string (date-time)	The time-stamp when the object was created. The time stamp is encoded as ISO 8601 date and time format ("YYYY-MM-DDThh:mm:ssZ", see https://en.wikipedia.org/wiki/ISO_8601#Combined_date_and_time_representations).
customProperties	object	The custom properties of this entity. The maximum allowed key length is 64 characters, the maximum allowed value length is 256 characters and the count of allowed entries is 10.
dataset	EntityReference	EntityReference
description	string	The description of the object.
displayName	string minLength: 1	The display name of the object.
lastActionDateTime	string (date-time)	The time-stamp when the current status was entered. The time stamp is encoded as ISO 8601 date and time format ("YYYY-MM-DDThh:mm:ssZ", see https://en.wikipedia.org/wiki/ISO_8601#Combined_date_and_time_representations).
links	TranscriptionLinks	TranscriptionLinks
locale	string minLength: 1	The locale of the contained data. If Language Identification is used, this locale is used to transcribe speech for which no language could be detected.
model	EntityReference	EntityReference
properties	TranscriptionProperties	TranscriptionProperties
self	string (uri)	The location of this entity.
status	Status	Status Describe the current state of the API.

TranscriptionLinks

Object

TranscriptionLinks

Name	Type	Description
files	string (uri)	The location to get all files of this entity. See operation "Transcriptions_ListFiles" for more details.

TranscriptionProperties

Object

TranscriptionProperties

Name	Type	Default value	Description
channels	integer[] (int32)		A collection of the requested channel numbers. In the default case, the channels 0 and 1 are considered.
destinationContainerUrl	string (uri)		The requested destination container. Remarks When a destination container is used in combination with a `timeToLive`, the metadata of a transcription will be deleted normally, but the data stored in the destination container, including transcription results, will remain untouched, because no delete permissions are required for this container. To support automatic cleanup, either configure blob lifetimes on the container, or use "Bring your own Storage (BYOS)" instead of `destinationContainerUrl`, where blobs can be cleaned up.
diarization	DiarizationProperties		DiarizationProperties
displayFormWordLevelTimestampsEnabled	boolean		A value indicating whether word level timestamps for the display form are requested. The default value is `false`.
durationMilliseconds	integer (int64)	0	The duration in milliseconds of the transcription. Durations larger than 2^53-1 are not supported to ensure compatibility with JavaScript integers.
error	EntityError		EntityError
languageIdentification	LanguageIdentificationProperties		LanguageIdentificationProperties
profanityFilterMode	ProfanityFilterMode		ProfanityFilterMode Mode of profanity filtering.
punctuationMode	PunctuationMode		PunctuationMode The mode used for punctuation.
timeToLiveHours	integer (int32)		How long the transcription will be kept in the system after it has completed. Once the transcription reaches the time to live after completion(successful or failed) it will be automatically deleted. Note: When using BYOS (bring your own storage), the result files on the customer owned storage account will also be deleted.Use either destinationContainerUrl to specify a separate container for result files which will not be deleted when the timeToLive expires, or retrieve the result files through the API and store them as needed. The shortest supported duration is 6 hours, the longest supported duration is 31 days. 2 days (48 hours) is the recommended default value when data is consumed directly.
wordLevelTimestampsEnabled	boolean		A value indicating whether word level timestamps are requested. The default value is `false`.

Share via

Transcriptions - Submit

URI Parameters

Request Header

Request Body

Responses

Security

Ocp-Apim-Subscription-Key

Examples

Create a transcription for URIs

Sample request

Sample response

Create a transcription from blob container

Sample request

Sample response

Create a transcription with language identification

Sample request

Sample response

Create a transcription with multispeaker diarization

Sample request

Sample response

Definitions

DetailedErrorCode

DiarizationProperties

EntityError

EntityReference

Error

ErrorCode

InnerError

LanguageIdentificationMode

LanguageIdentificationProperties

ProfanityFilterMode

PunctuationMode

Status

Transcription

TranscriptionLinks

TranscriptionProperties

Remarks