Transcriptions - Create
Creates a new transcription.
POST {endpoint}/speechtotext/v3.2-preview.2/transcriptions
URI Parameters
Name | In | Required | Type | Description |
---|---|---|---|---|
endpoint
|
path | True |
string |
Supported Cognitive Services endpoints (protocol and hostname, for example: https://westus.api.cognitive.microsoft.com). |
Request Body
Name | Required | Type | Description |
---|---|---|---|
displayName | True |
string |
The display name of the object. |
locale | True |
string |
The locale of the contained data. If Language Identification is used, this locale is used to transcribe speech for which no language could be detected. |
contentContainerUrl |
string |
A URL for an Azure blob container that contains the audio files. A container is allowed to have a maximum size of 5GB and a maximum number of 10000 blobs. The maximum size for a blob is 2.5GB. Container SAS should contain 'r' (read) and 'l' (list) permissions. This property will not be returned in a response. |
|
contentUrls |
string[] |
A list of content urls to get audio files to transcribe. Up to 1000 urls are allowed. This property will not be returned in a response. |
|
customProperties |
object |
The custom properties of this entity. The maximum allowed key length is 64 characters, the maximum allowed value length is 256 characters and the count of allowed entries is 10. |
|
dataset |
EntityReference |
||
description |
string |
The description of the object. |
|
model |
EntityReference |
||
project |
EntityReference |
||
properties |
TranscriptionProperties |
Responses
Name | Type | Description |
---|---|---|
201 Created |
The response contains information about the entity as payload and its location as header. Headers Location: string |
|
Other Status Codes |
An error occurred. |
Security
Ocp-Apim-Subscription-Key
Provide your cognitive services account key here.
Type:
apiKey
In:
header
Authorization
Provide an access token from the JWT returned by the STS of this region. Make sure to add the management scope to the token by adding the following query string to the STS URL: ?scope=speechservicesmanagement
Type:
apiKey
In:
header
Examples
Create a transcription for URIs
Sample request
POST {endpoint}/speechtotext/v3.2-preview.2/transcriptions
{
"contentUrls": [
"https://contoso.com/mystoragelocation",
"https://contoso.com/myotherstoragelocation"
],
"properties": {
"diarizationEnabled": false,
"wordLevelTimestampsEnabled": false,
"punctuationMode": "DictatedAndAutomatic",
"profanityFilterMode": "Masked"
},
"locale": "en-US",
"displayName": "Transcription using default model for en-US"
}
Sample response
{
"self": "https://westus.api.cognitive.microsoft.com/speechtotext/v3.2-preview.2/transcriptions/ba7ea6f5-3065-40b7-b49a-a90f48584683",
"model": {
"self": "https://westus.api.cognitive.microsoft.com/speechtotext/v3.2-preview.2/models/827712a5-f942-4997-91c3-7c6cde35600b"
},
"links": {
"files": "https://westus.api.cognitive.microsoft.com/speechtotext/v3.2-preview.2/transcriptions/ba7ea6f5-3065-40b7-b49a-a90f48584683/files"
},
"properties": {
"diarizationEnabled": false,
"wordLevelTimestampsEnabled": false,
"displayFormWordLevelTimestampsEnabled": false,
"channels": [
0,
1
],
"punctuationMode": "DictatedAndAutomatic",
"profanityFilterMode": "Masked",
"duration": "PT42S"
},
"lastActionDateTime": "2019-01-07T11:36:07Z",
"status": "Succeeded",
"createdDateTime": "2019-01-07T11:34:12Z",
"locale": "en-US",
"displayName": "Transcription using adapted model en-US",
"customProperties": {
"key": "value"
}
}
Create a transcription from blob container
Sample request
POST {endpoint}/speechtotext/v3.2-preview.2/transcriptions
{
"contentContainerUrl": "https://customspeech-usw.blob.core.windows.net/artifacts/audiofiles/",
"properties": {
"diarizationEnabled": false,
"wordLevelTimestampsEnabled": false,
"punctuationMode": "DictatedAndAutomatic",
"profanityFilterMode": "Masked"
},
"locale": "en-US",
"displayName": "Transcription of storage container using default model for en-US"
}
Sample response
Location: https://westus.api.cognitive.microsoft.com/speechtotext/v3.2-preview.2/transcriptions/ba7ea6f5-3065-40b7-b49a-a90f48584683
{
"self": "https://westus.api.cognitive.microsoft.com/speechtotext/v3.2-preview.2/transcriptions/ba7ea6f5-3065-40b7-b49a-a90f48584683",
"model": {
"self": "https://westus.api.cognitive.microsoft.com/speechtotext/v3.2-preview.2/models/827712a5-f942-4997-91c3-7c6cde35600b"
},
"links": {
"files": "https://westus.api.cognitive.microsoft.com/speechtotext/v3.2-preview.2/transcriptions/ba7ea6f5-3065-40b7-b49a-a90f48584683/files"
},
"properties": {
"diarizationEnabled": false,
"wordLevelTimestampsEnabled": false,
"displayFormWordLevelTimestampsEnabled": false,
"channels": [
0,
1
],
"punctuationMode": "DictatedAndAutomatic",
"profanityFilterMode": "Masked",
"duration": "PT42S"
},
"lastActionDateTime": "2019-01-07T11:36:07Z",
"status": "Succeeded",
"createdDateTime": "2019-01-07T11:34:12Z",
"locale": "en-US",
"displayName": "Transcription using adapted model en-US",
"customProperties": {
"key": "value"
}
}
Create a transcription with basic two-speaker diarization
Sample request
POST {endpoint}/speechtotext/v3.2-preview.2/transcriptions
{
"contentUrls": [
"https://contoso.com/mystoragelocation"
],
"properties": {
"diarizationEnabled": true,
"wordLevelTimestampsEnabled": false,
"displayFormWordLevelTimestampsEnabled": false,
"channels": [
0,
1
],
"punctuationMode": "DictatedAndAutomatic",
"profanityFilterMode": "Masked"
},
"locale": "en-US",
"displayName": "Transcription using basic two-speaker diarization"
}
Sample response
{
"self": "https://westus.api.cognitive.microsoft.com/speechtotext/v3.2-preview.2/transcriptions/ba7ea6f5-3065-40b7-b49a-a90f48584683",
"model": {
"self": "https://westus.api.cognitive.microsoft.com/speechtotext/v3.2-preview.2/models/827712a5-f942-4997-91c3-7c6cde35600b"
},
"links": {
"files": "https://westus.api.cognitive.microsoft.com/speechtotext/v3.2-preview.2/transcriptions/ba7ea6f5-3065-40b7-b49a-a90f48584683/files"
},
"properties": {
"diarizationEnabled": true,
"wordLevelTimestampsEnabled": false,
"displayFormWordLevelTimestampsEnabled": false,
"channels": [
0,
1
],
"punctuationMode": "DictatedAndAutomatic",
"profanityFilterMode": "Masked",
"duration": "PT42S"
},
"lastActionDateTime": "2019-01-07T11:36:07Z",
"status": "Succeeded",
"createdDateTime": "2019-01-07T11:34:12Z",
"locale": "en-US",
"displayName": "Transcription using basic two-speaker diarization",
"customProperties": {
"key": "value"
}
}
Create a transcription with language identification
Sample request
POST {endpoint}/speechtotext/v3.2-preview.2/transcriptions
{
"contentUrls": [
"https://contoso.com/mystoragelocation"
],
"properties": {
"diarizationEnabled": false,
"wordLevelTimestampsEnabled": false,
"displayFormWordLevelTimestampsEnabled": false,
"channels": [
0,
1
],
"punctuationMode": "DictatedAndAutomatic",
"profanityFilterMode": "Masked",
"languageIdentification": {
"mode": "Single",
"candidateLocales": [
"fr-FR",
"nl-NL",
"el-GR"
],
"speechModelMapping": {
"nl-NL": {
"self": "https://westus.api.cognitive.microsoft.com/speechtotext/v3.2-preview.2/models/827712a5-f942-4997-91c3-7c6cde35600b"
}
}
}
},
"locale": "fr-FR",
"displayName": "Transcription using language identification with three candidate languages, 'fr-FR' as fallback locale and a custom model for transcribing utterances that were classified as 'nl-NL' locale."
}
Sample response
{
"self": "https://westus.api.cognitive.microsoft.com/speechtotext/v3.2-preview.2/transcriptions/ba7ea6f5-3065-40b7-b49a-a90f48584683",
"model": {
"self": "https://westus.api.cognitive.microsoft.com/speechtotext/v3.2-preview.2/models/827712a5-f942-4997-91c3-7c6cde35600b"
},
"links": {
"files": "https://westus.api.cognitive.microsoft.com/speechtotext/v3.2-preview.2/transcriptions/ba7ea6f5-3065-40b7-b49a-a90f48584683/files"
},
"properties": {
"diarizationEnabled": false,
"wordLevelTimestampsEnabled": false,
"displayFormWordLevelTimestampsEnabled": false,
"channels": [
0,
1
],
"punctuationMode": "DictatedAndAutomatic",
"profanityFilterMode": "Masked",
"duration": "PT42S",
"languageIdentification": {
"mode": "Single",
"candidateLocales": [
"fr-FR",
"nl-NL",
"el-GR"
],
"speechModelMapping": {
"nl-NL": {
"self": "https://westus.api.cognitive.microsoft.com/speechtotext/v3.2-preview.2/models/827712a5-f942-4997-91c3-7c6cde35600b"
}
}
}
},
"lastActionDateTime": "2019-01-07T11:36:07Z",
"status": "Succeeded",
"createdDateTime": "2019-01-07T11:34:12Z",
"locale": "fr-FR",
"displayName": "Transcription using language identification with three candidate languages, 'fr-FR' as fallback locale and a custom model for transcribing utterances that were classified as 'nl-NL' locale.",
"customProperties": {
"key": "value"
}
}
Create a transcription with multispeaker diarization
Sample request
POST {endpoint}/speechtotext/v3.2-preview.2/transcriptions
{
"contentUrls": [
"https://contoso.com/mystoragelocation"
],
"properties": {
"diarizationEnabled": true,
"wordLevelTimestampsEnabled": false,
"displayFormWordLevelTimestampsEnabled": false,
"channels": [
0,
1
],
"punctuationMode": "DictatedAndAutomatic",
"profanityFilterMode": "Masked",
"diarization": {
"speakers": {
"minCount": 3,
"maxCount": 5
}
}
},
"locale": "en-US",
"displayName": "Transcription using diarization for audio that is known to contain speech from 3-5 speakers"
}
Sample response
{
"self": "https://westus.api.cognitive.microsoft.com/speechtotext/v3.2-preview.2/transcriptions/ba7ea6f5-3065-40b7-b49a-a90f48584683",
"model": {
"self": "https://westus.api.cognitive.microsoft.com/speechtotext/v3.2-preview.2/models/827712a5-f942-4997-91c3-7c6cde35600b"
},
"links": {
"files": "https://westus.api.cognitive.microsoft.com/speechtotext/v3.2-preview.2/transcriptions/ba7ea6f5-3065-40b7-b49a-a90f48584683/files"
},
"properties": {
"diarizationEnabled": true,
"wordLevelTimestampsEnabled": false,
"displayFormWordLevelTimestampsEnabled": false,
"channels": [
0,
1
],
"punctuationMode": "DictatedAndAutomatic",
"profanityFilterMode": "Masked",
"duration": "PT42S",
"diarization": {
"speakers": {
"minCount": 3,
"maxCount": 5
}
}
},
"lastActionDateTime": "2019-01-07T11:36:07Z",
"status": "Succeeded",
"createdDateTime": "2019-01-07T11:34:12Z",
"locale": "en-US",
"displayName": "Transcription using diarization for audio that is known to contain speech from 3-5 speakers",
"customProperties": {
"key": "value"
}
}
Definitions
Name | Description |
---|---|
Detailed |
DetailedErrorCode |
Diarization |
DiarizationProperties |
Diarization |
DiarizationSpeakersProperties |
Entity |
EntityError |
Entity |
EntityReference |
Error |
Error |
Error |
ErrorCode |
Inner |
InnerError |
Language |
LanguageIdentificationMode |
Language |
LanguageIdentificationProperties |
Profanity |
ProfanityFilterMode |
Punctuation |
PunctuationMode |
Status |
Status |
Transcription |
Transcription |
Transcription |
TranscriptionLinks |
Transcription |
TranscriptionProperties |
DetailedErrorCode
DetailedErrorCode
Name | Type | Description |
---|---|---|
DataImportFailed |
string |
Data import failed. |
DeleteNotAllowed |
string |
Delete not allowed. |
DeployNotAllowed |
string |
Deploy not allowed. |
DeployingFailedModel |
string |
Deploying failed model. |
EmptyRequest |
string |
Empty Request. |
EndpointCannotBeDefault |
string |
Endpoint cannot be default. |
EndpointNotUpdatable |
string |
Endpoint not updatable. |
EndpointWithoutLogging |
string |
Endpoint without logging. |
ExceededNumberOfRecordingsUris |
string |
Exceeded number of recordings uris. |
FailedDataset |
string |
Failed dataset. |
Forbidden |
string |
Forbidden. |
InUseViolation |
string |
In use violation. |
InaccessibleCustomerStorage |
string |
Inaccessible customer storage. |
InvalidAdaptationMapping |
string |
Invalid adaptation mapping. |
InvalidBaseModel |
string |
Invalid base model. |
InvalidCallbackUri |
string |
Invalid callback uri. |
InvalidCollection |
string |
Invalid collection. |
InvalidDataset |
string |
Invalid dataset. |
InvalidDocument |
string |
Invalid Document. |
InvalidDocumentBatch |
string |
Invalid Document Batch. |
InvalidLocale |
string |
Invalid locale. |
InvalidLogDate |
string |
Invalid log date. |
InvalidLogEndTime |
string |
Invalid log end time. |
InvalidLogId |
string |
Invalid log id. |
InvalidLogStartTime |
string |
Invalid log start time. |
InvalidModel |
string |
Invalid model. |
InvalidModelUri |
string |
Invalid model uri. |
InvalidParameter |
string |
Invalid parameter. |
InvalidParameterValue |
string |
Invalid parameter value. |
InvalidPayload |
string |
Invalid payload. |
InvalidPermissions |
string |
Invalid permissions. |
InvalidPrerequisite |
string |
Invalid prerequisite. |
InvalidProductId |
string |
Invalid product id. |
InvalidProject |
string |
Invalid project. |
InvalidProjectKind |
string |
Invalid project kind. |
InvalidRecordingsUri |
string |
Invalid recordings uri. |
InvalidRequestBodyFormat |
string |
Invalid request body format. |
InvalidSasValidityDuration |
string |
Invalid sas validity duration. |
InvalidSkipTokenForLogs |
string |
Invalid skip token for logs. |
InvalidSourceAzureResourceId |
string |
Invalid source Azure resource ID. |
InvalidSubscription |
string |
Invalid subscription. |
InvalidTest |
string |
Invalid test. |
InvalidTimeToLive |
string |
Invalid time to live. |
InvalidTopForLogs |
string |
Invalid top for logs. |
InvalidTranscription |
string |
Invalid transcription. |
InvalidWebHookEventKind |
string |
Invalid web hook event kind. |
MissingInputRecords |
string |
Missing Input Records. |
ModelCopyOperationExists |
string |
Model copy operation exists. |
ModelDeploymentNotCompleteState |
string |
Model deployment not complete state. |
ModelDeprecated |
string |
Model deprecated. |
ModelExists |
string |
Model exists. |
ModelMismatch |
string |
Model mismatch. |
ModelNotDeployable |
string |
Model not deployable. |
ModelVersionIncorrect |
string |
Model Version Incorrect. |
NoUtf8WithBom |
string |
No utf8 with bom. |
OnlyOneOfUrlsOrContainerOrDataset |
string |
Only one of urls or container or dataset. |
ProjectGenderMismatch |
string |
Project gender mismatch. |
QuotaViolation |
string |
Quota violation. |
SingleDefaultEndpoint |
string |
Single default endpoint. |
SkuLimitsExist |
string |
Sku limits exist. |
SubscriptionNotFound |
string |
Subscription not found. |
UnexpectedError |
string |
Unexpected error. |
UnsupportedClassBasedAdaptation |
string |
Unsupported class based adaptation. |
UnsupportedDelta |
string |
Unsupported delta. |
UnsupportedDynamicConfiguration |
string |
Unsupported dynamic configuration. |
UnsupportedFilter |
string |
Unsupported filter. |
UnsupportedLanguageCode |
string |
Unsupported language code. |
UnsupportedOrderBy |
string |
Unsupported order by. |
UnsupportedPagination |
string |
Unsupported pagination. |
UnsupportedTimeRange |
string |
Unsupported time range. |
DiarizationProperties
DiarizationProperties
Name | Type | Description |
---|---|---|
speakers |
DiarizationSpeakersProperties |
DiarizationSpeakersProperties
DiarizationSpeakersProperties
Name | Type | Description |
---|---|---|
maxCount |
integer |
The maximum number of speakers for diarization. Must be less than 36 and larger than or equal to the minSpeakers property. |
minCount |
integer |
A hint for the minimum number of speakers for diarization. Must be smaller than or equal to the maxSpeakers property. |
EntityError
EntityError
Name | Type | Description |
---|---|---|
code |
string |
The code of this error. |
message |
string |
The message for this error. |
EntityReference
EntityReference
Name | Type | Description |
---|---|---|
self |
string |
The location of the referenced entity. |
Error
Error
Name | Type | Description |
---|---|---|
code |
ErrorCode |
|
details |
Error[] |
Additional supportive details regarding the error and/or expected policies. |
innerError |
InnerError |
|
message |
string |
High level error message. |
target |
string |
The source of the error. For example it would be "documents" or "document id" in case of invalid document. |
ErrorCode
ErrorCode
Name | Type | Description |
---|---|---|
Conflict |
string |
Representing the conflict error code. |
Forbidden |
string |
Representing the forbidden error code. |
InternalCommunicationFailed |
string |
Representing the internal communication failed error code. |
InternalServerError |
string |
Representing the internal server error error code. |
InvalidArgument |
string |
Representing the invalid argument error code. |
InvalidRequest |
string |
Representing the invalid request error code. |
NotAllowed |
string |
Representing the not allowed error code. |
NotFound |
string |
Representing the not found error code. |
PipelineError |
string |
Representing the pipeline error error code. |
ServiceUnavailable |
string |
Representing the service unavailable error code. |
TooManyRequests |
string |
Representing the too many requests error code. |
Unauthorized |
string |
Representing the unauthorized error code. |
UnprocessableEntity |
string |
Representing the unprocessable entity error code. |
UnsupportedMediaType |
string |
Representing the unsupported media type error code. |
InnerError
InnerError
Name | Type | Description |
---|---|---|
code |
DetailedErrorCode |
|
details |
object |
Additional supportive details regarding the error and/or expected policies. |
innerError |
InnerError |
|
message |
string |
High level error message. |
target |
string |
The source of the error. For example it would be "documents" or "document id" in case of invalid document. |
LanguageIdentificationMode
LanguageIdentificationMode
Name | Type | Description |
---|---|---|
Continuous |
string |
Continuous language identification (Default). |
Single |
string |
Single language identification. |
LanguageIdentificationProperties
LanguageIdentificationProperties
Name | Type | Default value | Description |
---|---|---|---|
candidateLocales |
string[] |
The candidate locales for language identification (example ["en-US", "de-DE", "es-ES"]). A minimum of 2 and a maximum of 10 candidate locales, including the main locale for the transcription, is supported. |
|
mode | Continuous |
LanguageIdentificationMode |
|
speechModelMapping |
<string,
Entity |
An optional mapping of locales to speech model entities. If no model is given for a locale, the default base model is used. Keys must be locales contained in the candidate locales, values are entities for models of the respective locales. |
ProfanityFilterMode
ProfanityFilterMode
Name | Type | Description |
---|---|---|
Masked |
string |
Mask the profanity with * except of the first letter, e.g., f*** |
None |
string |
Disable profanity filtering. |
Removed |
string |
Remove profanity. |
Tags |
string |
Add "profanity" XML tags</Profanity> |
PunctuationMode
PunctuationMode
Name | Type | Description |
---|---|---|
Automatic |
string |
Automatic punctuation. |
Dictated |
string |
Dictated punctuation marks only, i.e., explicit punctuation. |
DictatedAndAutomatic |
string |
Dictated punctuation marks or automatic punctuation. |
None |
string |
No punctuation. |
Status
Status
Name | Type | Description |
---|---|---|
Failed |
string |
The long running operation has failed. |
NotStarted |
string |
The long running operation has not yet started. |
Running |
string |
The long running operation is currently processing. |
Succeeded |
string |
The long running operation has successfully completed. |
Transcription
Transcription
Name | Type | Description |
---|---|---|
contentContainerUrl |
string |
A URL for an Azure blob container that contains the audio files. A container is allowed to have a maximum size of 5GB and a maximum number of 10000 blobs. The maximum size for a blob is 2.5GB. Container SAS should contain 'r' (read) and 'l' (list) permissions. This property will not be returned in a response. |
contentUrls |
string[] |
A list of content urls to get audio files to transcribe. Up to 1000 urls are allowed. This property will not be returned in a response. |
createdDateTime |
string |
The time-stamp when the object was created. The time stamp is encoded as ISO 8601 date and time format ("YYYY-MM-DDThh:mm:ssZ", see https://en.wikipedia.org/wiki/ISO_8601#Combined_date_and_time_representations). |
customProperties |
object |
The custom properties of this entity. The maximum allowed key length is 64 characters, the maximum allowed value length is 256 characters and the count of allowed entries is 10. |
dataset |
EntityReference |
|
description |
string |
The description of the object. |
displayName |
string |
The display name of the object. |
lastActionDateTime |
string |
The time-stamp when the current status was entered. The time stamp is encoded as ISO 8601 date and time format ("YYYY-MM-DDThh:mm:ssZ", see https://en.wikipedia.org/wiki/ISO_8601#Combined_date_and_time_representations). |
links |
TranscriptionLinks |
|
locale |
string |
The locale of the contained data. If Language Identification is used, this locale is used to transcribe speech for which no language could be detected. |
model |
EntityReference |
|
project |
EntityReference |
|
properties |
TranscriptionProperties |
|
self |
string |
The location of this entity. |
status |
Status |
TranscriptionLinks
TranscriptionLinks
Name | Type | Description |
---|---|---|
files |
string |
The location to get all files of this entity. See operation "Transcriptions_ListFiles" for more details. |
TranscriptionProperties
TranscriptionProperties
Name | Type | Description |
---|---|---|
channels |
integer[] |
A collection of the requested channel numbers. In the default case, the channels 0 and 1 are considered. |
destinationContainerUrl |
string |
The requested destination container. RemarksWhen a destination container is used in combination with a |
diarization |
DiarizationProperties |
|
diarizationEnabled |
boolean |
A value indicating whether diarization (speaker identification) is requested. The default value
is The basic diarization system is deprecated and will be removed in the next major version of the API.
This |
displayFormWordLevelTimestampsEnabled |
boolean |
A value indicating whether word level timestamps for the display form are requested. The default value is |
duration |
string |
The duration of the transcription. The duration is encoded as ISO 8601 duration ("PnYnMnDTnHnMnS", see https://en.wikipedia.org/wiki/ISO_8601#Durations). |
string |
The email address to send email notifications to in case the operation completes. The value will be removed after successfully sending the email. |
|
error |
EntityError |
|
languageIdentification |
LanguageIdentificationProperties |
|
profanityFilterMode |
ProfanityFilterMode |
|
punctuationMode |
PunctuationMode |
|
timeToLive |
string |
How long the transcription will be kept in the system after it has completed. Once the transcription reaches the time to live after completion (successful or failed) it will be automatically deleted. Not setting this value or setting it to 0 will disable automatic deletion. The longest supported duration is 31 days. The duration is encoded as ISO 8601 duration ("PnYnMnDTnHnMnS", see https://en.wikipedia.org/wiki/ISO_8601#Durations). |
wordLevelTimestampsEnabled |
boolean |
A value indicating whether word level timestamps are requested. The default value is
|