Transcriptions - Submit
Submits a new transcription job.
POST {endpoint}/speechtotext/transcriptions:submit?api-version=2024-11-15
URI Parameters
Name | In | Required | Type | Description |
---|---|---|---|---|
endpoint
|
path | True |
string |
Supported Cognitive Services endpoints (protocol and hostname, for example: https://westus.api.cognitive.microsoft.com). |
api-version
|
query | True |
string |
The requested api version. |
Request Header
Name | Required | Type | Description |
---|---|---|---|
Ocp-Apim-Subscription-Key | True |
string |
Provide your cognitive services account key here. |
Request Body
Name | Required | Type | Description |
---|---|---|---|
displayName | True |
string minLength: 1 |
The display name of the object. |
locale | True |
string minLength: 1 |
The locale of the contained data. If Language Identification is used, this locale is used to transcribe speech for which no language could be detected. |
properties | True |
TranscriptionProperties |
|
contentContainerUrl |
string (uri) |
A URL for an Azure blob container that contains the audio files. A container is allowed to have a maximum size of 5GB and a maximum number of 10000 blobs. The maximum size for a blob is 2.5GB. Container SAS should contain 'r' (read) and 'l' (list) permissions. This property will not be returned in a response. |
|
contentUrls |
string[] (uri) |
A list of content urls to get audio files to transcribe. Up to 1000 urls are allowed. This property will not be returned in a response. |
|
customProperties |
object |
The custom properties of this entity. The maximum allowed key length is 64 characters, the maximum allowed value length is 256 characters and the count of allowed entries is 10. |
|
dataset |
EntityReference |
||
description |
string |
The description of the object. |
|
model |
EntityReference |
Responses
Name | Type | Description |
---|---|---|
201 Created |
The response contains information about the entity as payload and its location as header. Headers Location: string |
|
Other Status Codes |
An error occurred. |
Security
Ocp-Apim-Subscription-Key
Provide your cognitive services account key here.
Type:
apiKey
In:
header
Examples
Create a transcription for URIs |
Create a transcription from blob container |
Create a transcription with language identification |
Create a transcription with multispeaker diarization |
Create a transcription for URIs
Sample request
POST {endpoint}/speechtotext/transcriptions:submit?api-version=2024-11-15
{
"displayName": "Transcription using default model for en-US",
"locale": "en-US",
"contentUrls": [
"https://contoso.com/mystoragelocation",
"https://contoso.com/myotherstoragelocation"
],
"properties": {
"wordLevelTimestampsEnabled": false,
"displayFormWordLevelTimestampsEnabled": false,
"punctuationMode": "DictatedAndAutomatic",
"profanityFilterMode": "Masked",
"timeToLiveHours": 48
}
}
Sample response
{
"self": "https://westus.api.cognitive.microsoft.com/speechtotext/transcriptions/ba7ea6f5-3065-40b7-b49a-a90f48584683?api-version=2024-11-15",
"displayName": "Transcription using adapted model en-US",
"customProperties": {
"key": "value"
},
"locale": "en-US",
"createdDateTime": "2019-01-07T11:34:12Z",
"lastActionDateTime": "2019-01-07T11:36:07Z",
"model": {
"self": "https://westus.api.cognitive.microsoft.com/speechtotext/models/827712a5-f942-4997-91c3-7c6cde35600b?api-version=2024-11-15"
},
"links": {
"files": "https://westus.api.cognitive.microsoft.com/speechtotext/transcriptions/ba7ea6f5-3065-40b7-b49a-a90f48584683/files?api-version=2024-11-15"
},
"properties": {
"wordLevelTimestampsEnabled": false,
"displayFormWordLevelTimestampsEnabled": false,
"channels": [
0,
1
],
"punctuationMode": "DictatedAndAutomatic",
"profanityFilterMode": "Masked",
"timeToLiveHours": 48,
"durationMilliseconds": 42000
},
"status": "Succeeded"
}
Create a transcription from blob container
Sample request
POST {endpoint}/speechtotext/transcriptions:submit?api-version=2024-11-15
{
"displayName": "Transcription of storage container using default model for en-US",
"locale": "en-US",
"properties": {
"wordLevelTimestampsEnabled": false,
"displayFormWordLevelTimestampsEnabled": false,
"punctuationMode": "DictatedAndAutomatic",
"profanityFilterMode": "Masked",
"timeToLiveHours": 48
},
"contentContainerUrl": "https://customspeech-usw.blob.core.windows.net/artifacts/audiofiles/"
}
Sample response
Location: https://westus.api.cognitive.microsoft.com/speechtotext/transcriptions/ba7ea6f5-3065-40b7-b49a-a90f48584683?api-version=2024-11-15
{
"self": "https://westus.api.cognitive.microsoft.com/speechtotext/transcriptions/ba7ea6f5-3065-40b7-b49a-a90f48584683?api-version=2024-11-15",
"displayName": "Transcription using adapted model en-US",
"customProperties": {
"key": "value"
},
"locale": "en-US",
"createdDateTime": "2019-01-07T11:34:12Z",
"lastActionDateTime": "2019-01-07T11:36:07Z",
"model": {
"self": "https://westus.api.cognitive.microsoft.com/speechtotext/models/827712a5-f942-4997-91c3-7c6cde35600b?api-version=2024-11-15"
},
"links": {
"files": "https://westus.api.cognitive.microsoft.com/speechtotext/transcriptions/ba7ea6f5-3065-40b7-b49a-a90f48584683/files?api-version=2024-11-15"
},
"properties": {
"wordLevelTimestampsEnabled": false,
"displayFormWordLevelTimestampsEnabled": false,
"channels": [
0,
1
],
"punctuationMode": "DictatedAndAutomatic",
"profanityFilterMode": "Masked",
"timeToLiveHours": 48,
"durationMilliseconds": 42000
},
"status": "Succeeded"
}
Create a transcription with language identification
Sample request
POST {endpoint}/speechtotext/transcriptions:submit?api-version=2024-11-15
{
"displayName": "Transcription using language identification with three candidate languages, 'fr-FR' as fallback locale and a custom model for transcribing utterances that were classified as 'nl-NL' locale.",
"locale": "fr-FR",
"contentUrls": [
"https://contoso.com/mystoragelocation"
],
"properties": {
"wordLevelTimestampsEnabled": false,
"displayFormWordLevelTimestampsEnabled": false,
"channels": [
0,
1
],
"punctuationMode": "DictatedAndAutomatic",
"profanityFilterMode": "Masked",
"timeToLiveHours": 48,
"languageIdentification": {
"candidateLocales": [
"fr-FR",
"nl-NL",
"el-GR"
],
"speechModelMapping": {
"nl-NL": {
"self": "https://westus.api.cognitive.microsoft.com/speechtotext/models/827712a5-f942-4997-91c3-7c6cde35600b?api-version=2024-11-15"
}
},
"mode": "Single"
}
}
}
Sample response
{
"self": "https://westus.api.cognitive.microsoft.com/speechtotext/transcriptions/ba7ea6f5-3065-40b7-b49a-a90f48584683?api-version=2024-11-15",
"displayName": "Transcription using language identification with three candidate languages, 'fr-FR' as fallback locale and a custom model for transcribing utterances that were classified as 'nl-NL' locale.",
"customProperties": {
"key": "value"
},
"locale": "fr-FR",
"createdDateTime": "2019-01-07T11:34:12Z",
"lastActionDateTime": "2019-01-07T11:36:07Z",
"model": {
"self": "https://westus.api.cognitive.microsoft.com/speechtotext/models/827712a5-f942-4997-91c3-7c6cde35600b?api-version=2024-11-15"
},
"links": {
"files": "https://westus.api.cognitive.microsoft.com/speechtotext/transcriptions/ba7ea6f5-3065-40b7-b49a-a90f48584683/files?api-version=2024-11-15"
},
"properties": {
"wordLevelTimestampsEnabled": false,
"displayFormWordLevelTimestampsEnabled": false,
"channels": [
0,
1
],
"punctuationMode": "DictatedAndAutomatic",
"profanityFilterMode": "Masked",
"timeToLiveHours": 48,
"languageIdentification": {
"candidateLocales": [
"fr-FR",
"nl-NL",
"el-GR"
],
"speechModelMapping": {
"nl-NL": {
"self": "https://westus.api.cognitive.microsoft.com/speechtotext/models/827712a5-f942-4997-91c3-7c6cde35600b?api-version=2024-11-15"
}
},
"mode": "Single"
},
"durationMilliseconds": 42000
},
"status": "Succeeded"
}
Create a transcription with multispeaker diarization
Sample request
POST {endpoint}/speechtotext/transcriptions:submit?api-version=2024-11-15
{
"displayName": "Transcription using diarization for audio that is known to contain speech from up to 5 speakers",
"locale": "en-US",
"contentUrls": [
"https://contoso.com/mystoragelocation"
],
"properties": {
"wordLevelTimestampsEnabled": false,
"displayFormWordLevelTimestampsEnabled": false,
"channels": [
0,
1
],
"punctuationMode": "DictatedAndAutomatic",
"profanityFilterMode": "Masked",
"timeToLiveHours": 48,
"diarization": {
"enabled": true,
"maxSpeakers": 5
}
}
}
Sample response
{
"self": "https://westus.api.cognitive.microsoft.com/speechtotext/transcriptions/ba7ea6f5-3065-40b7-b49a-a90f48584683?api-version=2024-11-15",
"displayName": "Transcription using diarization for audio that is known to contain speech from up to 5 speakers",
"customProperties": {
"key": "value"
},
"locale": "en-US",
"createdDateTime": "2019-01-07T11:34:12Z",
"lastActionDateTime": "2019-01-07T11:36:07Z",
"model": {
"self": "https://westus.api.cognitive.microsoft.com/speechtotext/models/827712a5-f942-4997-91c3-7c6cde35600b?api-version=2024-11-15"
},
"links": {
"files": "https://westus.api.cognitive.microsoft.com/speechtotext/transcriptions/ba7ea6f5-3065-40b7-b49a-a90f48584683/files?api-version=2024-11-15"
},
"properties": {
"wordLevelTimestampsEnabled": false,
"displayFormWordLevelTimestampsEnabled": false,
"channels": [
0,
1
],
"punctuationMode": "DictatedAndAutomatic",
"profanityFilterMode": "Masked",
"timeToLiveHours": 48,
"diarization": {
"enabled": true,
"maxSpeakers": 5
},
"durationMilliseconds": 42000
},
"status": "Succeeded"
}
Definitions
Name | Description |
---|---|
Detailed |
DetailedErrorCode |
Diarization |
DiarizationProperties |
Entity |
EntityError |
Entity |
EntityReference |
Error |
Error |
Error |
ErrorCode |
Inner |
InnerError |
Language |
LanguageIdentificationMode |
Language |
LanguageIdentificationProperties |
Profanity |
ProfanityFilterMode |
Punctuation |
PunctuationMode |
Status |
Status |
Transcription |
Transcription |
Transcription |
TranscriptionLinks |
Transcription |
TranscriptionProperties |
DetailedErrorCode
DetailedErrorCode
Value | Description |
---|---|
InvalidParameterValue |
Invalid parameter value. |
InvalidRequestBodyFormat |
Invalid request body format. |
EmptyRequest |
Empty Request. |
MissingInputRecords |
Missing Input Records. |
InvalidDocument |
Invalid Document. |
ModelVersionIncorrect |
Model Version Incorrect. |
InvalidDocumentBatch |
Invalid Document Batch. |
UnsupportedLanguageCode |
Unsupported language code. |
DataImportFailed |
Data import failed. |
InUseViolation |
In use violation. |
InvalidLocale |
Invalid locale. |
InvalidBaseModel |
Invalid base model. |
InvalidAdaptationMapping |
Invalid adaptation mapping. |
InvalidDataset |
Invalid dataset. |
InvalidTest |
Invalid test. |
FailedDataset |
Failed dataset. |
InvalidModel |
Invalid model. |
InvalidTranscription |
Invalid transcription. |
InvalidPayload |
Invalid payload. |
InvalidParameter |
Invalid parameter. |
EndpointWithoutLogging |
Endpoint without logging. |
InvalidPermissions |
Invalid permissions. |
InvalidPrerequisite |
Invalid prerequisite. |
InvalidProductId |
Invalid product id. |
InvalidSubscription |
Invalid subscription. |
InvalidProject |
Invalid project. |
InvalidProjectKind |
Invalid project kind. |
InvalidRecordingsUri |
Invalid recordings uri. |
OnlyOneOfUrlsOrContainerOrDataset |
Only one of urls or container or dataset. |
ExceededNumberOfRecordingsUris |
Exceeded number of recordings uris. |
InvalidChannels |
Invalid channels. |
ModelMismatch |
Model mismatch. |
ProjectGenderMismatch |
Project gender mismatch. |
ModelDeprecated |
Model deprecated. |
ModelExists |
Model exists. |
ModelNotDeployable |
Model not deployable. |
EndpointNotUpdatable |
Endpoint not updatable. |
SingleDefaultEndpoint |
Single default endpoint. |
EndpointCannotBeDefault |
Endpoint cannot be default. |
InvalidModelUri |
Invalid model uri. |
SubscriptionNotFound |
Subscription not found. |
QuotaViolation |
Quota violation. |
UnsupportedDelta |
Unsupported delta. |
UnsupportedFilter |
Unsupported filter. |
UnsupportedPagination |
Unsupported pagination. |
UnsupportedDynamicConfiguration |
Unsupported dynamic configuration. |
UnsupportedOrderBy |
Unsupported order by. |
NoUtf8WithBom |
No utf8 with bom. |
ModelDeploymentNotCompleteState |
Model deployment not complete state. |
SkuLimitsExist |
Sku limits exist. |
DeployingFailedModel |
Deploying failed model. |
UnsupportedTimeRange |
Unsupported time range. |
InvalidLogDate |
Invalid log date. |
InvalidLogId |
Invalid log id. |
InvalidLogStartTime |
Invalid log start time. |
InvalidLogEndTime |
Invalid log end time. |
InvalidTopForLogs |
Invalid top for logs. |
InvalidSkipTokenForLogs |
Invalid skip token for logs. |
DeleteNotAllowed |
Delete not allowed. |
Forbidden |
Forbidden. |
DeployNotAllowed |
Deploy not allowed. |
UnexpectedError |
Unexpected error. |
InvalidCollection |
Invalid collection. |
InvalidCallbackUri |
Invalid callback uri. |
InvalidSasValidityDuration |
Invalid sas validity duration. |
InaccessibleCustomerStorage |
Inaccessible customer storage. |
UnsupportedClassBasedAdaptation |
Unsupported class based adaptation. |
InvalidWebHookEventKind |
Invalid web hook event kind. |
InvalidTimeToLive |
Invalid time to live. |
InvalidSourceAzureResourceId |
Invalid source Azure resource ID. |
ModelCopyAuthorizationExpired |
Expired ModelCopyAuthorization. |
EndpointLoggingNotSupported |
Endpoint logging not supported. |
NoLanguageIdentified |
Language Identification did not recognize any language. |
MultipleLanguagesIdentified |
Language Identification recognized multiple languages. No dominant language could be determined. |
InvalidAudioFormat |
The format of input audio is not supported. |
BadChannelConfiguration |
There is a mismatch between audio channels in the data, in the configuration, or the requirements of the application. |
InvalidChannelSpecification |
The selection of channels in the transcription request is not supported (e.g., neither 0 nor 1 have been selected.) |
AudioLengthLimitExceeded |
The audio file is longer than the maximum allowed duration. |
EmptyAudioFile |
The audio file is empty. |
DiarizationProperties
DiarizationProperties
Name | Type | Description |
---|---|---|
enabled |
boolean |
A value indicating whether speaker diarization is enabled. |
maxSpeakers |
integer (int32) minimum: 2maximum: 35 |
A hint for the maximum number of speakers for diarization. Must be greater than 1 and less than 36. |
EntityError
EntityError
Name | Type | Description |
---|---|---|
code |
string |
The code of this error. |
message |
string |
The message for this error. |
EntityReference
EntityReference
Name | Type | Description |
---|---|---|
self |
string (uri) |
The location of the referenced entity. |
Error
Error
Name | Type | Description |
---|---|---|
code |
ErrorCode |
|
details |
Error[] |
Additional supportive details regarding the error and/or expected policies. |
innerError |
InnerError |
|
message |
string |
High level error message. |
target |
string |
The source of the error. For example it would be "documents" or "document id" in case of invalid document. |
ErrorCode
ErrorCode
Value | Description |
---|---|
InvalidRequest |
Representing the invalid request error code. |
InvalidArgument |
Representing the invalid argument error code. |
InternalServerError |
Representing the internal server error error code. |
ServiceUnavailable |
Representing the service unavailable error code. |
NotFound |
Representing the not found error code. |
PipelineError |
Representing the pipeline error error code. |
Conflict |
Representing the conflict error code. |
InternalCommunicationFailed |
Representing the internal communication failed error code. |
Forbidden |
Representing the forbidden error code. |
NotAllowed |
Representing the not allowed error code. |
Unauthorized |
Representing the unauthorized error code. |
UnsupportedMediaType |
Representing the unsupported media type error code. |
TooManyRequests |
Representing the too many requests error code. |
UnprocessableEntity |
Representing the unprocessable entity error code. |
InnerError
InnerError
Name | Type | Description |
---|---|---|
code |
DetailedErrorCode |
|
details |
object |
Additional supportive details regarding the error and/or expected policies. |
innerError |
InnerError |
|
message |
string |
High level error message. |
target |
string |
The source of the error. For example it would be "documents" or "document id" in case of invalid document. |
LanguageIdentificationMode
LanguageIdentificationMode
Value | Description |
---|---|
Continuous |
Continuous language identification (Default). |
Single |
Single language identification. If no language can be identified, the error code NoLanguageIdentified is returned to the user. If there is ambiguity between multiple languages, the error code MultipleLanguagesIdentified is returned to the user. |
LanguageIdentificationProperties
LanguageIdentificationProperties
Name | Type | Default value | Description |
---|---|---|---|
candidateLocales |
string[] |
The candidate locales for language identification (example ["en-US", "de-DE", "es-ES"]). A minimum of 2 and a maximum of 10 candidate locales, including the main locale for the transcription, is supported for continuous mode. For single language identification, the maximum number of candidate locales is unbounded. |
|
mode | Continuous |
LanguageIdentificationMode |
|
speechModelMapping |
<string,
Entity |
An optional mapping of locales to speech model entities. If no model is given for a locale, the default base model is used. Keys must be locales contained in the candidate locales, values are entities for models of the respective locales. |
ProfanityFilterMode
ProfanityFilterMode
Value | Description |
---|---|
None |
Disable profanity filtering. |
Removed |
Remove profanity. |
Tags |
Add "profanity" XML tags</Profanity> |
Masked |
Mask the profanity with * except of the first letter, e.g., f*** |
PunctuationMode
PunctuationMode
Value | Description |
---|---|
None |
No punctuation. |
Dictated |
Dictated punctuation marks only, i.e., explicit punctuation. |
Automatic |
Automatic punctuation. |
DictatedAndAutomatic |
Dictated punctuation marks or automatic punctuation. |
Status
Status
Value | Description |
---|---|
NotStarted |
The long running operation has not yet started. |
Running |
The long running operation is currently processing. |
Succeeded |
The long running operation has successfully completed. |
Failed |
The long running operation has failed. |
Transcription
Transcription
Name | Type | Description |
---|---|---|
contentContainerUrl |
string (uri) |
A URL for an Azure blob container that contains the audio files. A container is allowed to have a maximum size of 5GB and a maximum number of 10000 blobs. The maximum size for a blob is 2.5GB. Container SAS should contain 'r' (read) and 'l' (list) permissions. This property will not be returned in a response. |
contentUrls |
string[] (uri) |
A list of content urls to get audio files to transcribe. Up to 1000 urls are allowed. This property will not be returned in a response. |
createdDateTime |
string (date-time) |
The time-stamp when the object was created. The time stamp is encoded as ISO 8601 date and time format ("YYYY-MM-DDThh:mm:ssZ", see https://en.wikipedia.org/wiki/ISO_8601#Combined_date_and_time_representations). |
customProperties |
object |
The custom properties of this entity. The maximum allowed key length is 64 characters, the maximum allowed value length is 256 characters and the count of allowed entries is 10. |
dataset |
EntityReference |
|
description |
string |
The description of the object. |
displayName |
string minLength: 1 |
The display name of the object. |
lastActionDateTime |
string (date-time) |
The time-stamp when the current status was entered. The time stamp is encoded as ISO 8601 date and time format ("YYYY-MM-DDThh:mm:ssZ", see https://en.wikipedia.org/wiki/ISO_8601#Combined_date_and_time_representations). |
links |
TranscriptionLinks |
|
locale |
string minLength: 1 |
The locale of the contained data. If Language Identification is used, this locale is used to transcribe speech for which no language could be detected. |
model |
EntityReference |
|
properties |
TranscriptionProperties |
|
self |
string (uri) |
The location of this entity. |
status |
Status |
TranscriptionLinks
TranscriptionLinks
Name | Type | Description |
---|---|---|
files |
string (uri) |
The location to get all files of this entity. See operation "Transcriptions_ListFiles" for more details. |
TranscriptionProperties
TranscriptionProperties
Name | Type | Default value | Description |
---|---|---|---|
channels |
integer[] (int32) |
A collection of the requested channel numbers. In the default case, the channels 0 and 1 are considered. |
|
destinationContainerUrl |
string (uri) |
The requested destination container. RemarksWhen a destination container is used in combination with a To support automatic cleanup, either configure blob lifetimes on the container, or use "Bring your own Storage (BYOS)"
instead of |
|
diarization |
DiarizationProperties |
||
displayFormWordLevelTimestampsEnabled |
boolean |
A value indicating whether word level timestamps for the display form are requested. The default value is |
|
durationMilliseconds |
integer (int64) |
0 |
The duration in milliseconds of the transcription. Durations larger than 2^53-1 are not supported to ensure compatibility with JavaScript integers. |
error |
EntityError |
||
languageIdentification |
LanguageIdentificationProperties |
||
profanityFilterMode |
ProfanityFilterMode |
||
punctuationMode |
PunctuationMode |
||
timeToLiveHours |
integer (int32) |
How long the transcription will be kept in the system after it has completed. Once the transcription reaches the time to live after completion(successful or failed) it will be automatically deleted. Note: When using BYOS (bring your own storage), the result files on the customer owned storage account will also be deleted.Use either destinationContainerUrl to specify a separate container for result files which will not be deleted when the timeToLive expires, or retrieve the result files through the API and store them as needed. The shortest supported duration is 6 hours, the longest supported duration is 31 days. 2 days (48 hours) is the recommended default value when data is consumed directly. |
|
wordLevelTimestampsEnabled |
boolean |
A value indicating whether word level timestamps are requested. The default value is
|