Transcriptions - Transcribe
Transcribes the provided audio stream.
POST {endpoint}/speechtotext/transcriptions:transcribe?api-version=2024-05-15-preview
URI Parameters
Name | In | Required | Type | Description |
---|---|---|---|---|
audio
|
formData | True |
file binary |
The audio as a stream of bytes. |
definition
|
formData | True |
string |
Metadata for a fast transcription request. This field contains a JSON-serialized object of type |
endpoint
|
path | True |
string |
Supported Cognitive Services endpoints (protocol and hostname, for example: https://westus.api.cognitive.microsoft.com). |
api-version
|
query | True |
string |
The requested api version. |
Responses
Name | Type | Description |
---|---|---|
200 OK |
OK |
Security
Ocp-Apim-Subscription-Key
Provide your cognitive services account key here.
Type:
apiKey
In:
header
Authorization
Provide an access token from the JWT returned by the STS of this region. Make sure to add the management scope to the token by adding the following query string to the STS URL: ?scope=speechservicesmanagement
Type:
apiKey
In:
header
Examples
Transcribe an audio file
Sample request
POST {endpoint}/speechtotext/transcriptions:transcribe?api-version=2024-05-15-preview
Sample response
{
"duration": 2000,
"combinedPhrases": [
{
"text": "Weather"
}
],
"phrases": [
{
"offset": 40,
"duration": 240,
"text": "Weather",
"words": [
{
"text": "Weather",
"offset": 40,
"duration": 240
}
],
"locale": "en-US",
"confidence": 0.7881154
}
]
}
Definitions
Name | Description |
---|---|
Combined |
|
Phrase |
A transcribed phrase. |
Transcribe |
The result of the transcribe operation. |
Word |
Time-stamped word in the display form. |
CombinedPhrases
Name | Type | Description |
---|---|---|
channel |
integer |
The 0-based channel index. Only present if channel separation is enabled. |
text |
string |
The complete transcribed text for the channel. |
Phrase
A transcribed phrase.
Name | Type | Description |
---|---|---|
channel |
integer |
The 0-based channel index. Only present if channel separation is enabled. |
confidence |
number |
The confidence value for the phrase. |
duration |
integer |
The duration of the phrase in milliseconds. |
locale |
string |
The locale of the phrase. |
offset |
integer |
The start offset of the phrase in milliseconds. |
speaker |
integer |
The speaker number. Only present if speaker diarization is enabled. |
text |
string |
The transcribed text of the phrase. |
words |
Word[] |
The words that make up the phrase. Only present if word-level timestamps are enabled. |
TranscribeResult
The result of the transcribe operation.
Name | Type | Description |
---|---|---|
combinedPhrases |
The combined transcription results for each channel. |
|
duration |
integer |
The duration of the audio in milliseconds. |
phrases |
Phrase[] |
The transcription results segmented into phrases. |
Word
Time-stamped word in the display form.
Name | Type | Description |
---|---|---|
duration |
integer |
The duration of the word in milliseconds. |
offset |
integer |
The start offset of the word in milliseconds. |
text |
string |
The recognized word, including punctuation. |