快速謄寫 API 用來轉譯音訊檔案,其傳回結果會同步且比即時快。 在需要盡可能快速獲取音頻錄音轉錄結果並且對延遲時間有可預測要求的情境下,適合使用快速轉錄,例如:
使用音訊檔案和要求本文屬性向 transcriptions 端點發出多通道/表單資料 POST 要求。
下列範例示範如何使用指定的地區設定來轉譯音訊檔案。 如果您知道音訊檔案的地區設定,您可以指定它以改善轉譯精確度,並將延遲降到最低。
- 以您的語音資源金鑰取代
YourSpeechResoureKey。
- 將
YourServiceRegion 替換成您的語音資源區域。
- 請以您的音訊檔案路徑取代
YourAudioFile。
這很重要
針對 Microsoft Entra ID 的建議無金鑰認證,請將 --header 'Ocp-Apim-Subscription-Key: YourSpeechResoureKey' 替換為 --header "Authorization: Bearer YourAccessToken"。 如需無密鑰驗證的詳細資訊,請參閱 角色型訪問控制 作指南。
curl --location 'https://YourServiceRegion.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=2025-10-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: YourSpeechResoureKey' \
--form 'audio=@"YourAudioFile"' \
--form 'definition="{
"locales":["en-US"]}"'
根據下列指示建構表單定義:
- 設定選擇性的 (但建議的)
locales 屬性,該屬性應符合要轉譯之音訊數據的預期地區設定。 在這裡範例中,地區設定會設定為 en-US。 如需與支援的地區設定有關的詳細資訊,請參閱語音轉換文字支援的語言。
如需快速轉譯 API 和其他屬性的詳細資訊 locales ,請參閱 本指南稍後的要求組態選項 一節。
回應包括 durationMilliseconds、 offsetMilliseconds等等。 屬性 combinedPhrases 包含所有說話者的完整轉譯。
{
"durationMilliseconds": 182439,
"combinedPhrases": [
{
"text": "Good afternoon. This is Sam. Thank you for calling Contoso. How can I help? Hi there. My name is Mary. I'm currently living in Los Angeles, but I'm planning to move to Las Vegas. I would like to apply for a loan. Okay. I see you're currently living in California. Let me make sure I understand you correctly. Uh You'd like to apply for a loan even though you'll be moving soon. Is that right? Yes, exactly. So I'm planning to relocate soon, but I would like to apply for the loan first so that I can purchase a new home once I move there. And are you planning to sell your current home? Yes, I will be listing it on the market soon and hopefully it'll sell quickly. That's why I'm applying for a loan now, so that I can purchase a new house in Nevada and close on it quickly as well once my current home sells. I see. Would you mind holding for a moment while I take your information down? Yeah, no problem. Thank you for your help. Mm-hmm. Just one moment. All right. Thank you for your patience, ma'am. May I have your first and last name, please? Yes, my name is Mary Smith. Thank you, Ms. Smith. May I have your current address, please? Yes. So my address is 123 Main Street in Los Angeles, California, and the zip code is 90923. Sorry, that was a 90 what? 90923. 90923 on Main Street. Got it. Thank you. May I have your phone number as well, please? Uh Yes, my phone number is 504-529-2351 and then yeah. 2351. Got it. And do you have an e-mail address we I can associate with this application? uh Yes, so my e-mail address is mary.a.sm78@gmail.com. Mary.a, was that a S-N as in November or M as in Mike? M as in Mike. Mike78, got it. Thank you. Ms. Smith, do you currently have any other loans? Uh Yes, so I currently have two other loans through Contoso. So my first one is my car loan and then my other is my student loan. They total about 1400 per month combined and my interest rate is 8%. I see. And you're currently paying those loans off monthly, is that right? Yes, of course I do. OK, thank you. Here's what I suggest we do. Let me place you on a brief hold again so that I can talk with one of our loan officers and get this started for you immediately. In the meantime, it would be great if you could take a few minutes and complete the remainder of the secure application online at www.contosoloans.com. Yeah, that sounds good. I can go ahead and get started. Thank you for your help. Thank you."
}
],
"phrases": [
{
"offsetMilliseconds": 960,
"durationMilliseconds": 640,
"text": "Good afternoon.",
"words": [
{
"text": "Good",
"offsetMilliseconds": 960,
"durationMilliseconds": 240
},
{
"text": "afternoon.",
"offsetMilliseconds": 1200,
"durationMilliseconds": 400
}
],
"locale": "en-US",
"confidence": 0.93554276
},
{
"offsetMilliseconds": 1600,
"durationMilliseconds": 640,
"text": "This is Sam.",
"words": [
{
"text": "This",
"offsetMilliseconds": 1600,
"durationMilliseconds": 240
},
{
"text": "is",
"offsetMilliseconds": 1840,
"durationMilliseconds": 120
},
{
"text": "Sam.",
"offsetMilliseconds": 1960,
"durationMilliseconds": 280
}
],
"locale": "en-US",
"confidence": 0.93554276
},
{
"offsetMilliseconds": 2240,
"durationMilliseconds": 1040,
"text": "Thank you for calling Contoso.",
"words": [
{
"text": "Thank",
"offsetMilliseconds": 2240,
"durationMilliseconds": 200
},
{
"text": "you",
"offsetMilliseconds": 2440,
"durationMilliseconds": 80
},
{
"text": "for",
"offsetMilliseconds": 2520,
"durationMilliseconds": 120
},
{
"text": "calling",
"offsetMilliseconds": 2640,
"durationMilliseconds": 200
},
{
"text": "Contoso.",
"offsetMilliseconds": 2840,
"durationMilliseconds": 440
}
],
"locale": "en-US",
"confidence": 0.93554276
},
{
"offsetMilliseconds": 3280,
"durationMilliseconds": 640,
"text": "How can I help?",
"words": [
{
"text": "How",
"offsetMilliseconds": 3280,
"durationMilliseconds": 120
},
{
"text": "can",
"offsetMilliseconds": 3440,
"durationMilliseconds": 120
},
{
"text": "I",
"offsetMilliseconds": 3560,
"durationMilliseconds": 40
},
{
"text": "help?",
"offsetMilliseconds": 3600,
"durationMilliseconds": 320
}
],
"locale": "en-US",
"confidence": 0.93554276
},
{
"offsetMilliseconds": 5040,
"durationMilliseconds": 400,
"text": "Hi there.",
"words": [
{
"text": "Hi",
"offsetMilliseconds": 5040,
"durationMilliseconds": 240
},
{
"text": "there.",
"offsetMilliseconds": 5280,
"durationMilliseconds": 160
}
],
"locale": "en-US",
"confidence": 0.93554276
},
{
"offsetMilliseconds": 5440,
"durationMilliseconds": 800,
"text": "My name is Mary.",
"words": [
{
"text": "My",
"offsetMilliseconds": 5440,
"durationMilliseconds": 80
},
{
"text": "name",
"offsetMilliseconds": 5520,
"durationMilliseconds": 120
},
{
"text": "is",
"offsetMilliseconds": 5640,
"durationMilliseconds": 80
},
{
"text": "Mary.",
"offsetMilliseconds": 5720,
"durationMilliseconds": 520
}
],
"locale": "en-US",
"confidence": 0.93554276
},
// More transcription results...
// Redacted for brevity
{
"offsetMilliseconds": 180320,
"durationMilliseconds": 680,
"text": "Thank you for your help.",
"words": [
{
"text": "Thank",
"offsetMilliseconds": 180320,
"durationMilliseconds": 160
},
{
"text": "you",
"offsetMilliseconds": 180480,
"durationMilliseconds": 80
},
{
"text": "for",
"offsetMilliseconds": 180560,
"durationMilliseconds": 120
},
{
"text": "your",
"offsetMilliseconds": 180680,
"durationMilliseconds": 120
},
{
"text": "help.",
"offsetMilliseconds": 180800,
"durationMilliseconds": 200
}
],
"locale": "en-US",
"confidence": 0.92022026
},
{
"offsetMilliseconds": 181960,
"durationMilliseconds": 280,
"text": "Thank you.",
"words": [
{
"text": "Thank",
"offsetMilliseconds": 181960,
"durationMilliseconds": 200
},
{
"text": "you.",
"offsetMilliseconds": 182160,
"durationMilliseconds": 80
}
],
"locale": "en-US",
"confidence": 0.92022026
}
]
}
使用音訊檔案和要求本文屬性向 transcriptions 端點發出多通道/表單資料 POST 要求。
下列範例示範如何轉譯具有語言識別的音訊檔案。 如果您不確定地區設定,您可以指定多個地區設定。 如果您未指定任何地區設定,或您指定的地區設定不在音訊檔案中,則語音服務會嘗試識別地區設定。
附註
快速轉譯中的語言識別的作用是識別每個音訊檔案的一個主要語言地區設定。 如果您需要在音訊中轉譯多語系內容,請考慮多語系轉譯(預覽)。
- 以您的語音資源金鑰取代
YourSpeechResoureKey。
- 將
YourServiceRegion 替換成您的語音資源區域。
- 請以您的音訊檔案路徑取代
YourAudioFile。
這很重要
針對 Microsoft Entra ID 的建議無金鑰認證,請將 --header 'Ocp-Apim-Subscription-Key: YourSpeechResoureKey' 替換為 --header "Authorization: Bearer YourAccessToken"。 如需無密鑰驗證的詳細資訊,請參閱 角色型訪問控制 作指南。
curl --location 'https://YourServiceRegion.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=2025-10-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: YourSpeechResoureKey' \
--form 'audio=@"YourAudioFile"' \
--form 'definition="{
"locales":["en-US","ja-JP"]}"'
根據下列指示建構表單定義:
- 設定選擇性的 (但建議的)
locales 屬性,該屬性應符合要轉譯之音訊數據的預期地區設定。 在這裡範例中,地區設定會設定為 en-US 和 ja-JP。 您可以指定的支援地區設定位於所有支援語言的範圍內。
如需快速轉譯 API 和其他屬性的詳細資訊 locales ,請參閱 本指南稍後的要求組態選項 一節。
回應包括 durationMilliseconds、 offsetMilliseconds等等。 屬性 combinedPhrases 包含所有說話者的完整轉譯。
{
"durationMilliseconds": 185079,
"combinedPhrases": [
{
"text": "Hello, thank you for calling Contoso. Who am I speaking with today? Hi, my name is Mary Rondo. I'm trying to enroll myself with Contoso. Hi, Mary. Are you calling because you need health insurance? Yes. Yeah, I'm calling to sign up for insurance. Great. Uh If you can answer a few questions, we can get you signed up in a Jiffy. Okay. So what's your full name? uh So Mary Beth Rondo, last name is R like Romeo, O like Ocean, N like Nancy D, D like Dog, and O like Ocean again. Rondo. Got it. And what's the best callback number in case we get disconnected? I only have a cell phone, so I can give you that. Yep, that'll be fine. Sure. So it's 234-554 and then 9312. Got it. So to confirm, it's 234-554-9312. Yep, that's right. Excellent. Let's get some additional information for your application. Do you have a job? Uh Yes, I am self-employed. Okay, so then you have a social security number as well? Uh Yes, I do. Okay, and what is your social security number, please? Uh Sure, so it's 412-253-4931. 6789. Sorry, was that a 25 or a 225? You cut out for a bit. It's double two, so 412, then another two, then five. Thank you so much. And could I have your e-mail address, please? Yeah, it's maryrondo@gmail.com. So my first and last name at gmail.com. No periods, no dashes. Great. Uh That is the last question. So let me take your information and I'll be able to get you signed up right away. Thank you for calling Contoso and I'll be able to get you signed up immediately. One of our agents will call you back in about 24 hours or so to confirm your application. That sounds good. Thank you. Absolutely. If you need anything else, please give us a call at 1-800-555-5564, extension 123. Thank you very much for calling Contoso. Actually, so I have one more question. Yes, of course. I'm curious, will I be getting a physical card as proof of coverage? So the default is a digital membership card, but we can send you a physical card if you prefer. Uh Yes. Could you please mail it to me when it's ready? I'd like to have it shipped to, are you ready for my address? Uh Yeah. uh So it's 2660 Unit A on Maple Avenue, Southeast Lansing, and then zip code is 48823. Absolutely. I've made a note on your file. Awesome. Thanks so much. You're very welcome. Thank you for calling Contoso and have a great day."
}
],
"phrases": [
{
"offsetMilliseconds": 720,
"durationMilliseconds": 1600,
"text": "Hello, thank you for calling Contoso.",
"words": [
{
"text": "Hello,",
"offsetMilliseconds": 720,
"durationMilliseconds": 480
},
{
"text": "thank",
"offsetMilliseconds": 1200,
"durationMilliseconds": 200
},
{
"text": "you",
"offsetMilliseconds": 1400,
"durationMilliseconds": 80
},
{
"text": "for",
"offsetMilliseconds": 1480,
"durationMilliseconds": 120
},
{
"text": "calling",
"offsetMilliseconds": 1600,
"durationMilliseconds": 240
},
{
"text": "Contoso.",
"offsetMilliseconds": 1840,
"durationMilliseconds": 480
}
],
"locale": "en-US",
"confidence": 0.93265927
},
{
"offsetMilliseconds": 2320,
"durationMilliseconds": 1120,
"text": "Who am I speaking with today?",
"words": [
{
"text": "Who",
"offsetMilliseconds": 2320,
"durationMilliseconds": 160
},
{
"text": "am",
"offsetMilliseconds": 2480,
"durationMilliseconds": 80
},
{
"text": "I",
"offsetMilliseconds": 2560,
"durationMilliseconds": 80
},
{
"text": "speaking",
"offsetMilliseconds": 2640,
"durationMilliseconds": 320
},
{
"text": "with",
"offsetMilliseconds": 2960,
"durationMilliseconds": 160
},
{
"text": "today?",
"offsetMilliseconds": 3120,
"durationMilliseconds": 320
}
],
"locale": "en-US",
"confidence": 0.93265927
},
{
"offsetMilliseconds": 4480,
"durationMilliseconds": 1600,
"text": "Hi, my name is Mary Rondo.",
"words": [
{
"text": "Hi,",
"offsetMilliseconds": 4480,
"durationMilliseconds": 400
},
{
"text": "my",
"offsetMilliseconds": 4880,
"durationMilliseconds": 120
},
{
"text": "name",
"offsetMilliseconds": 5000,
"durationMilliseconds": 120
},
{
"text": "is",
"offsetMilliseconds": 5120,
"durationMilliseconds": 160
},
{
"text": "Mary",
"offsetMilliseconds": 5280,
"durationMilliseconds": 240
},
{
"text": "Rondo.",
"offsetMilliseconds": 5520,
"durationMilliseconds": 560
}
],
"locale": "en-US",
"confidence": 0.93265927
},
{
"offsetMilliseconds": 6120,
"durationMilliseconds": 1800,
"text": "I'm trying to enroll myself with Contoso.",
"words": [
{
"text": "I'm",
"offsetMilliseconds": 6120,
"durationMilliseconds": 120
},
{
"text": "trying",
"offsetMilliseconds": 6240,
"durationMilliseconds": 200
},
{
"text": "to",
"offsetMilliseconds": 6440,
"durationMilliseconds": 80
},
{
"text": "enroll",
"offsetMilliseconds": 6520,
"durationMilliseconds": 200
},
{
"text": "myself",
"offsetMilliseconds": 6720,
"durationMilliseconds": 360
},
{
"text": "with",
"offsetMilliseconds": 7080,
"durationMilliseconds": 120
},
{
"text": "Contoso.",
"offsetMilliseconds": 7200,
"durationMilliseconds": 720
}
],
"locale": "en-US",
"confidence": 0.93265927
},
// More transcription results...
// Redacted for brevity
{
"offsetMilliseconds": 181520,
"durationMilliseconds": 720,
"text": "You're very welcome.",
"words": [
{
"text": "You're",
"offsetMilliseconds": 181520,
"durationMilliseconds": 160
},
{
"text": "very",
"offsetMilliseconds": 181680,
"durationMilliseconds": 200
},
{
"text": "welcome.",
"offsetMilliseconds": 181880,
"durationMilliseconds": 360
}
],
"locale": "en-US",
"confidence": 0.90571773
},
{
"offsetMilliseconds": 182320,
"durationMilliseconds": 1840,
"text": "Thank you for calling Contoso and have a great day.",
"words": [
{
"text": "Thank",
"offsetMilliseconds": 182320,
"durationMilliseconds": 200
},
{
"text": "you",
"offsetMilliseconds": 182520,
"durationMilliseconds": 80
},
{
"text": "for",
"offsetMilliseconds": 182600,
"durationMilliseconds": 120
},
{
"text": "calling",
"offsetMilliseconds": 182720,
"durationMilliseconds": 280
},
{
"text": "Contoso",
"offsetMilliseconds": 183000,
"durationMilliseconds": 520
},
{
"text": "and",
"offsetMilliseconds": 183520,
"durationMilliseconds": 160
},
{
"text": "have",
"offsetMilliseconds": 183680,
"durationMilliseconds": 120
},
{
"text": "a",
"offsetMilliseconds": 183800,
"durationMilliseconds": 40
},
{
"text": "great",
"offsetMilliseconds": 183840,
"durationMilliseconds": 200
},
{
"text": "day.",
"offsetMilliseconds": 184040,
"durationMilliseconds": 120
}
],
"locale": "en-US",
"confidence": 0.90571773
}
]
}
使用音訊檔案和要求本文屬性向 transcriptions 端點發出多通道/表單資料 POST 要求。
下列範例示範如何使用最新的多語語音轉譯模型來轉譯音訊檔案。 如果您的音訊包含您想要持續且準確地轉譯的多語系內容,則可以使用最新的多語系語音轉譯模型,而不指定地區設定代碼。
- 以您的語音資源金鑰取代
YourSpeechResoureKey。
- 將
YourServiceRegion 替換成您的語音資源區域。
- 請以您的音訊檔案路徑取代
YourAudioFile。
這很重要
針對 Microsoft Entra ID 的建議無金鑰認證,請將 --header 'Ocp-Apim-Subscription-Key: YourSpeechResoureKey' 替換為 --header "Authorization: Bearer YourAccessToken"。 如需無密鑰驗證的詳細資訊,請參閱 角色型訪問控制 作指南。
curl --location 'https://YourServiceRegion.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=2025-10-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: YourSpeechResoureKey' \
--form 'audio=@"YourAudioFile"' \
--form 'definition="{
"locales":[]}"'
根據下列指示建構表單定義:
您可以將 locales 這個屬性留空(如上一個範例所示),或省略它。
目前多語言模型支援的音訊輸入地區設定包括: de-DE、 en-AU、 en-CA、 en-GB、 en-IN、 en-US、 es-ES、 es-MX、 fr-CA、 fr-FR、 it-IT、 ja-JP、 ko-KR和 zh-CN。
謄寫結果在語言層級區分,並且會遵循「此語言的主要地區設定」(例如,即使音訊有英國英文或印度英文口音,它仍會輸出「en-US」地區設定代碼)。
如需快速轉譯 API 和其他屬性的詳細資訊 locales ,請參閱 本指南稍後的要求組態選項 一節。
回應包括 durationMilliseconds、 offsetMilliseconds等等。 屬性 combinedPhrases 包含所有說話者的完整轉譯。
{
"durationMilliseconds": 57187,
"combinedPhrases": [
{
"text": "With custom speech,you can evaluate and improve the microsoft speech to text accuracy for your applications and products 现成的语音转文本,利用通用语言模型作为一个基本模型,使用microsoft自有数据进行训练,并反映常用的口语。此基础模型使用那些代表各常见领域的方言和发音进行了预先训练。 Quand vous effectuez une demande de reconnaissance vocale, le modèle de base le plus récent pour chaque langue prise en charge est utilisé par défaut. Le modèle de base fonctionne très bien dans la plupart des scénarios de reconnaissance vocale. A custom model can be used to augment the base model to improve recognition of domain specific vocabulary specified to the application by providing text data to train the model. It can also be used to improve recognition based for the specific audio conditions of the application by providing audio data with reference transcriptions."
}
],
"phrases": [
{
"offsetMilliseconds": 80,
"durationMilliseconds": 6960,
"text": "With custom speech,you can evaluate and improve the microsoft speech to text accuracy for your applications and products.",
"words": [
{
"text": "with",
"offsetMilliseconds": 80,
"durationMilliseconds": 160
},
{
"text": "custom",
"offsetMilliseconds": 240,
"durationMilliseconds": 480
},
{
"text": "speech",
"offsetMilliseconds": 720,
"durationMilliseconds": 360
},
{
"text": ",",
"offsetMilliseconds": 1080,
"durationMilliseconds": 10
},
{
"text": "you",
"offsetMilliseconds": 1200,
"durationMilliseconds": 240
},
{
"text": "can",
"offsetMilliseconds": 1440,
"durationMilliseconds": 160
},
{
"text": "evaluate",
"offsetMilliseconds": 1600,
"durationMilliseconds": 640
},
{
"text": "and",
"offsetMilliseconds": 2240,
"durationMilliseconds": 200
},
{
"text": "improve",
"offsetMilliseconds": 2440,
"durationMilliseconds": 280
},
{
"text": "the",
"offsetMilliseconds": 2720,
"durationMilliseconds": 160
},
{
"text": "microsoft",
"offsetMilliseconds": 2880,
"durationMilliseconds": 640
},
{
"text": "speech",
"offsetMilliseconds": 3520,
"durationMilliseconds": 320
},
{
"text": "to",
"offsetMilliseconds": 3840,
"durationMilliseconds": 200
},
{
"text": "text",
"offsetMilliseconds": 4040,
"durationMilliseconds": 360
},
{
"text": "accuracy",
"offsetMilliseconds": 4400,
"durationMilliseconds": 560
},
{
"text": "for",
"offsetMilliseconds": 4960,
"durationMilliseconds": 160
},
{
"text": "your",
"offsetMilliseconds": 5120,
"durationMilliseconds": 200
},
{
"text": "applications",
"offsetMilliseconds": 5320,
"durationMilliseconds": 760
},
{
"text": "and",
"offsetMilliseconds": 6080,
"durationMilliseconds": 200
},
{
"text": "products",
"offsetMilliseconds": 6280,
"durationMilliseconds": 680
},
],
"locale": "en-us",
"confidence": 0.9539559
},
{
"offsetMilliseconds": 8000,
"durationMilliseconds": 8600,
"text": "现成的语音转文本,利用通用语言模型作为一个基本模型,使用microsoft自有数据进行训练,并反映常用的口语。此基础模型使用那些代表各常见领域的方言和发音进行了预先训练。",
"words": [
{
"text": "现",
"offsetMilliseconds": 8000,
"durationMilliseconds": 40
},
{
"text": "成",
"offsetMilliseconds": 8040,
"durationMilliseconds": 40
},
{
"text": "的",
"offsetMilliseconds": 8160,
"durationMilliseconds": 40
},
{
"text": "语",
"offsetMilliseconds": 8200,
"durationMilliseconds": 40
},
{
"text": "音",
"offsetMilliseconds": 8240,
"durationMilliseconds": 40
},
{
"text": "转",
"offsetMilliseconds": 8280,
"durationMilliseconds": 40
},
{
"text": "文",
"offsetMilliseconds": 8320,
"durationMilliseconds": 40
},
{
"text": "本,",
"offsetMilliseconds": 8360,
"durationMilliseconds": 40
},
{
"text": "利",
"offsetMilliseconds": 8400,
"durationMilliseconds": 40
},
{
"text": "用",
"offsetMilliseconds": 8440,
"durationMilliseconds": 40
},
{
"text": "通",
"offsetMilliseconds": 8480,
"durationMilliseconds": 40
},
{
"text": "用",
"offsetMilliseconds": 8520,
"durationMilliseconds": 40
},
{
"text": "语",
"offsetMilliseconds": 8560,
"durationMilliseconds": 40
},
{
"text": "言",
"offsetMilliseconds": 8600,
"durationMilliseconds": 40
},
{
"text": "模",
"offsetMilliseconds": 8640,
"durationMilliseconds": 40
},
{
"text": "型",
"offsetMilliseconds": 8680,
"durationMilliseconds": 40
},
{
"text": "作",
"offsetMilliseconds": 8800,
"durationMilliseconds": 40
},
{
"text": "为",
"offsetMilliseconds": 8840,
"durationMilliseconds": 40
},
{
"text": "一",
"offsetMilliseconds": 9520,
"durationMilliseconds": 40
},
{
"text": "个",
"offsetMilliseconds": 9560,
"durationMilliseconds": 40
},
{
"text": "基",
"offsetMilliseconds": 9600,
"durationMilliseconds": 40
},
{
"text": "本",
"offsetMilliseconds": 9640,
"durationMilliseconds": 40
},
{
"text": "模",
"offsetMilliseconds": 9680,
"durationMilliseconds": 40
},
{
"text": "型,",
"offsetMilliseconds": 9720,
"durationMilliseconds": 40
},
{
"text": "使",
"offsetMilliseconds": 9760,
"durationMilliseconds": 40
},
{
"text": "用",
"offsetMilliseconds": 10080,
"durationMilliseconds": 320
},
{
"text": "microsoft",
"offsetMilliseconds": 10400,
"durationMilliseconds": 3600
},
{
"text": "自",
"offsetMilliseconds": 14000,
"durationMilliseconds": 40
},
{
"text": "有",
"offsetMilliseconds": 14040,
"durationMilliseconds": 40
},
{
"text": "数",
"offsetMilliseconds": 14160,
"durationMilliseconds": 40
},
{
"text": "据",
"offsetMilliseconds": 14200,
"durationMilliseconds": 40
},
{
"text": "进",
"offsetMilliseconds": 14320,
"durationMilliseconds": 40
},
{
"text": "行",
"offsetMilliseconds": 14360,
"durationMilliseconds": 40
},
{
"text": "训",
"offsetMilliseconds": 14400,
"durationMilliseconds": 40
},
{
"text": "练,",
"offsetMilliseconds": 14440,
"durationMilliseconds": 40
},
{
"text": "并",
"offsetMilliseconds": 14480,
"durationMilliseconds": 40
},
{
"text": "反",
"offsetMilliseconds": 14520,
"durationMilliseconds": 40
},
{
"text": "映",
"offsetMilliseconds": 14560,
"durationMilliseconds": 40
},
{
"text": "常",
"offsetMilliseconds": 14600,
"durationMilliseconds": 40
},
{
"text": "用",
"offsetMilliseconds": 14640,
"durationMilliseconds": 40
},
{
"text": "的",
"offsetMilliseconds": 14680,
"durationMilliseconds": 40
},
{
"text": "口",
"offsetMilliseconds": 14720,
"durationMilliseconds": 40
},
{
"text": "语",
"offsetMilliseconds": 14760,
"durationMilliseconds": 40
},
{
"text": "。",
"offsetMilliseconds": 14800,
"durationMilliseconds": 40
},
{
"text": "此",
"offsetMilliseconds": 14840,
"durationMilliseconds": 40
},
{
"text": "基",
"offsetMilliseconds": 14880,
"durationMilliseconds": 40
},
{
"text": "础",
"offsetMilliseconds": 14920,
"durationMilliseconds": 40
},
{
"text": "模",
"offsetMilliseconds": 14960,
"durationMilliseconds": 40
},
{
"text": "型",
"offsetMilliseconds": 15000,
"durationMilliseconds": 40
},
{
"text": "使",
"offsetMilliseconds": 15040,
"durationMilliseconds": 40
},
{
"text": "用",
"offsetMilliseconds": 15080,
"durationMilliseconds": 40
},
{
"text": "那",
"offsetMilliseconds": 15120,
"durationMilliseconds": 40
},
{
"text": "些",
"offsetMilliseconds": 15160,
"durationMilliseconds": 40
},
{
"text": "代",
"offsetMilliseconds": 15200,
"durationMilliseconds": 40
},
{
"text": "表",
"offsetMilliseconds": 15240,
"durationMilliseconds": 40
},
{
"text": "各",
"offsetMilliseconds": 15280,
"durationMilliseconds": 40
},
{
"text": "常",
"offsetMilliseconds": 15320,
"durationMilliseconds": 40
},
{
"text": "见",
"offsetMilliseconds": 15360,
"durationMilliseconds": 40
},
{
"text": "领",
"offsetMilliseconds": 15400,
"durationMilliseconds": 40
},
{
"text": "域",
"offsetMilliseconds": 15760,
"durationMilliseconds": 40
},
{
"text": "的",
"offsetMilliseconds": 15800,
"durationMilliseconds": 40
},
{
"text": "方",
"offsetMilliseconds": 15920,
"durationMilliseconds": 40
},
{
"text": "言",
"offsetMilliseconds": 15960,
"durationMilliseconds": 40
},
{
"text": "和",
"offsetMilliseconds": 16000,
"durationMilliseconds": 40
},
{
"text": "发",
"offsetMilliseconds": 16040,
"durationMilliseconds": 40
},
{
"text": "音",
"offsetMilliseconds": 16080,
"durationMilliseconds": 40
},
{
"text": "进",
"offsetMilliseconds": 16120,
"durationMilliseconds": 40
},
{
"text": "行",
"offsetMilliseconds": 16160,
"durationMilliseconds": 40
},
{
"text": "了",
"offsetMilliseconds": 16200,
"durationMilliseconds": 40
},
{
"text": "预",
"offsetMilliseconds": 16320,
"durationMilliseconds": 40
},
{
"text": "先",
"offsetMilliseconds": 16360,
"durationMilliseconds": 40
},
{
"text": "训",
"offsetMilliseconds": 16400,
"durationMilliseconds": 40
},
{
"text": "练",
"offsetMilliseconds": 16560,
"durationMilliseconds": 40
},
],
"locale": "zh-cn",
"confidence": 0.9241725
},
{
"offsetMilliseconds": 24320,
"durationMilliseconds": 6640,
"text": "Quand vous effectuez une demande de reconnaissance vocale, le modèle de base le plus récent pour chaque langue prise en charge est utilisé par défaut.",
"words": [
{
"text": "Quand",
"offsetMilliseconds": 24320,
"durationMilliseconds": 160
},
{
"text": "vous",
"offsetMilliseconds": 24480,
"durationMilliseconds": 80
},
// More transcription results...
// Redacted for brevity
{
"text": "scénarios",
"offsetMilliseconds": 34200,
"durationMilliseconds": 400
},
{
"text": "de",
"offsetMilliseconds": 34600,
"durationMilliseconds": 120
},
{
"text": "reconnaissance",
"offsetMilliseconds": 34720,
"durationMilliseconds": 640
},
{
"text": "vocale.",
"offsetMilliseconds": 35360,
"durationMilliseconds": 480
}
],
"locale": "fr-fr",
"confidence": 0.9308314
},
{
"offsetMilliseconds": 36720,
"durationMilliseconds": 10320,
"text": "A custom model can be used to augment the base model to improve recognition of domain specific vocabulary spécifique to the application by providing text data to train the model.",
"words": [
{
"text": "A",
"offsetMilliseconds": 36720,
"durationMilliseconds": 80
},
{
"text": "custom",
"offsetMilliseconds": 36880,
"durationMilliseconds": 400
},
{
"text": "model",
"offsetMilliseconds": 37280,
"durationMilliseconds": 480
},
// More transcription results...
// Redacted for brevity
{
"text": "with",
"offsetMilliseconds": 54720,
"durationMilliseconds": 200
},
{
"text": "reference",
"offsetMilliseconds": 54920,
"durationMilliseconds": 360
},
{
"text": "transcriptions.",
"offsetMilliseconds": 55280,
"durationMilliseconds": 1200
}
],
"locale": "en-us",
"confidence": 0.92155737
}
]
}
使用音訊檔案和要求本文屬性向 transcriptions 端點發出多通道/表單資料 POST 要求。
下列範例示範如何轉譯已啟用 diarization 的音訊檔案。 自動分段可區分交談中的不同說話者。 語音服務提供關於哪個說話者在轉錄語音中特定部分發言的相關資訊。
- 以您的語音資源金鑰取代
YourSpeechResoureKey。
- 將
YourServiceRegion 替換成您的語音資源區域。
- 請以您的音訊檔案路徑取代
YourAudioFile。
這很重要
針對 Microsoft Entra ID 的建議無金鑰認證,請將 --header 'Ocp-Apim-Subscription-Key: YourSpeechResoureKey' 替換為 --header "Authorization: Bearer YourAccessToken"。 如需無密鑰驗證的詳細資訊,請參閱 角色型訪問控制 作指南。
curl --location 'https://YourServiceRegion.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=2025-10-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: YourSpeechResoureKey' \
--form 'audio=@"YourAudioFile"' \
--form 'definition="{
"locales":["en-US"],
"diarization": {"maxSpeakers": 2,"enabled": true}}"'
根據下列指示建構表單定義:
設定選擇性的 (但建議的) locales 屬性,該屬性應符合要轉譯之音訊數據的預期地區設定。 在這裡範例中,地區設定會設定為 en-US。
diarization將屬性設定為在一個音訊通道中辨識和分隔多個喇叭。 例如,指定 "diarization": {"maxSpeakers": 2, "enabled": true}。 然後,轉錄檔案會包含每個謄寫片語的 speaker 項目。
如需關於locales、diarization及其他快速轉譯 API 屬性的詳細資訊,請參閱本指南稍後的請求配置選項一節。
回應包括 durationMilliseconds、 offsetMilliseconds等等。 在此範例中,會啟用自動分段,因此回應會包含每個謄寫片語的 speaker 資訊。 屬性 combinedPhrases 包含單一通道中所有說話者的完整轉譯。
{
"durationMilliseconds": 182439,
"combinedPhrases": [
{
"channel": 0,
"text": "Good afternoon. This is Sam. Thank you for calling Contoso. How can I help? Hi there. My name is Mary. I'm currently living in Los Angeles, but I'm planning to move to Las Vegas. I would like to apply for a loan. Okay. I see you're currently living in California. Let me make sure I understand you correctly. Uh You'd like to apply for a loan even though you'll be moving soon. Is that right? Yes, exactly. So I'm planning to relocate soon, but I would like to apply for the loan first so that I can purchase a new home once I move there. And are you planning to sell your current home? Yes, I will be listing it on the market soon and hopefully it'll sell quickly. That's why I'm applying for a loan now, so that I can purchase a new house in Nevada and close on it quickly as well once my current home sells. I see. Would you mind holding for a moment while I take your information down? Yeah, no problem. Thank you for your help. Mm-hmm. Just one moment. All right. Thank you for your patience, ma'am. May I have your first and last name, please? Yes, my name is Mary Smith. Thank you, Ms. Smith. May I have your current address, please? Yes. So my address is 123 Main Street in Los Angeles, California, and the zip code is 90923. Sorry, that was a 90 what? 90923. 90923 on Main Street. Got it. Thank you. May I have your phone number as well, please? Uh. Yes, my phone number is 504-529-2351 and then yeah. 2351. Got it. And do you have an e-mail address we I can associate with this application? Uh Yes, so my e-mail address is mary.a.sm78@gmail.com. Mary.a, was that a S-N as in November or M as in Mike? M as in Mike. Mike78, got it. Thank you. Ms. Smith, do you currently have any other loans? Uh Yes, so I currently have two other loans through Contoso. So my first one is my car loan and then my other is my student loan. They total about 1400 per month combined and my interest rate is 8%. I see. And. You're currently paying those loans off monthly, is that right? Yes, of course I do. OK, thank you. Here's what I suggest we do. Let me place you on a brief hold again so that I can talk with one of our loan officers and get this started for you immediately. In the meantime, it would be great if you could take a few minutes and complete the remainder of the secure application online at www.contosoloans.com. Yeah, that sounds good. I can go ahead and get started. Thank you for your help. Thank you."
}
],
"phrases": [
{
"channel": 0,
"speaker": 1,
"offsetMilliseconds": 960,
"durationMilliseconds": 640,
"text": "Good afternoon.",
"words": [
{
"text": "Good",
"offsetMilliseconds": 960,
"durationMilliseconds": 240
},
{
"text": "afternoon.",
"offsetMilliseconds": 1200,
"durationMilliseconds": 400
}
],
"locale": "en-US",
"confidence": 0.93616915
},
{
"channel": 0,
"speaker": 1,
"offsetMilliseconds": 1600,
"durationMilliseconds": 640,
"text": "This is Sam.",
"words": [
{
"text": "This",
"offsetMilliseconds": 1600,
"durationMilliseconds": 240
},
{
"text": "is",
"offsetMilliseconds": 1840,
"durationMilliseconds": 120
},
{
"text": "Sam.",
"offsetMilliseconds": 1960,
"durationMilliseconds": 280
}
],
"locale": "en-US",
"confidence": 0.93616915
},
{
"channel": 0,
"speaker": 1,
"offsetMilliseconds": 2240,
"durationMilliseconds": 1040,
"text": "Thank you for calling Contoso.",
"words": [
{
"text": "Thank",
"offsetMilliseconds": 2240,
"durationMilliseconds": 200
},
{
"text": "you",
"offsetMilliseconds": 2440,
"durationMilliseconds": 80
},
{
"text": "for",
"offsetMilliseconds": 2520,
"durationMilliseconds": 120
},
{
"text": "calling",
"offsetMilliseconds": 2640,
"durationMilliseconds": 200
},
{
"text": "Contoso.",
"offsetMilliseconds": 2840,
"durationMilliseconds": 440
}
],
"locale": "en-US",
"confidence": 0.93616915
},
{
"channel": 0,
"speaker": 1,
"offsetMilliseconds": 3280,
"durationMilliseconds": 640,
"text": "How can I help?",
"words": [
{
"text": "How",
"offsetMilliseconds": 3280,
"durationMilliseconds": 120
},
{
"text": "can",
"offsetMilliseconds": 3440,
"durationMilliseconds": 120
},
{
"text": "I",
"offsetMilliseconds": 3560,
"durationMilliseconds": 40
},
{
"text": "help?",
"offsetMilliseconds": 3600,
"durationMilliseconds": 320
}
],
"locale": "en-US",
"confidence": 0.93616915
},
{
"channel": 0,
"speaker": 0,
"offsetMilliseconds": 5040,
"durationMilliseconds": 400,
"text": "Hi there.",
"words": [
{
"text": "Hi",
"offsetMilliseconds": 5040,
"durationMilliseconds": 240
},
{
"text": "there.",
"offsetMilliseconds": 5280,
"durationMilliseconds": 160
}
],
"locale": "en-US",
"confidence": 0.93616915
},
{
"channel": 0,
"speaker": 0,
"offsetMilliseconds": 5440,
"durationMilliseconds": 800,
"text": "My name is Mary.",
"words": [
{
"text": "My",
"offsetMilliseconds": 5440,
"durationMilliseconds": 80
},
{
"text": "name",
"offsetMilliseconds": 5520,
"durationMilliseconds": 120
},
{
"text": "is",
"offsetMilliseconds": 5640,
"durationMilliseconds": 80
},
{
"text": "Mary.",
"offsetMilliseconds": 5720,
"durationMilliseconds": 520
}
],
"locale": "en-US",
"confidence": 0.93616915
},
// More transcription results...
// Redacted for brevity
{
"channel": 0,
"speaker": 0,
"offsetMilliseconds": 180320,
"durationMilliseconds": 680,
"text": "Thank you for your help.",
"words": [
{
"text": "Thank",
"offsetMilliseconds": 180320,
"durationMilliseconds": 160
},
{
"text": "you",
"offsetMilliseconds": 180480,
"durationMilliseconds": 80
},
{
"text": "for",
"offsetMilliseconds": 180560,
"durationMilliseconds": 120
},
{
"text": "your",
"offsetMilliseconds": 180680,
"durationMilliseconds": 120
},
{
"text": "help.",
"offsetMilliseconds": 180800,
"durationMilliseconds": 200
}
],
"locale": "en-US",
"confidence": 0.9314801
},
{
"channel": 0,
"speaker": 1,
"offsetMilliseconds": 181960,
"durationMilliseconds": 280,
"text": "Thank you.",
"words": [
{
"text": "Thank",
"offsetMilliseconds": 181960,
"durationMilliseconds": 200
},
{
"text": "you.",
"offsetMilliseconds": 182160,
"durationMilliseconds": 80
}
],
"locale": "en-US",
"confidence": 0.9314801
}
]
}
使用音訊檔案和要求本文屬性向 transcriptions 端點發出多通道/表單資料 POST 要求。
下列範例示範如何轉譯具有一或兩個通道的音訊檔案。 多通道轉譯對於具有多個通道的音訊檔案很有用,例如具有多個喇叭的音訊檔案,或具有背景雜訊的音訊檔案。 根據預設,快速轉譯 API 會將所有輸入通道合併成單一通道,然後執行轉譯。 如果不想要這樣做,則可以獨立謄寫頻道而不合併。
- 以您的語音資源金鑰取代
YourSpeechResoureKey。
- 將
YourServiceRegion 替換成您的語音資源區域。
- 請以您的音訊檔案路徑取代
YourAudioFile。
這很重要
針對 Microsoft Entra ID 的建議無金鑰認證,請將 --header 'Ocp-Apim-Subscription-Key: YourSpeechResoureKey' 替換為 --header "Authorization: Bearer YourAccessToken"。 如需無密鑰驗證的詳細資訊,請參閱 角色型訪問控制 作指南。
curl --location 'https://YourServiceRegion.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=2025-10-15' \
--header 'Content-Type: multipart/form-data' \
--header 'Ocp-Apim-Subscription-Key: YourSpeechResoureKey' \
--form 'audio=@"YourAudioFile"' \
--form 'definition="{
"locales":["en-US"],
"channels": [0,1]}"'
根據下列指示建構表單定義:
設定選擇性的 (但建議的) locales 屬性,該屬性應符合要轉譯之音訊數據的預期地區設定。 在這裡範例中,地區設定會設定為 en-US。 您可以指定的地區設定包括:de-DE、en-GB、en-IN、en-US、es-ES、es-MX、fr-FR、hi-IN、it-IT、ja-JP、ko-KR、pt-BR 和 zh-CN。
channels將屬性設定為指定要個別轉譯之通道之以零起始的索引。 除非啟用了自動分段,否則最多支援兩個頻道。 在此範例中,會指定通道 0 和 1。
如需關於locales、channels及其他快速轉譯 API 屬性的詳細資訊,請參閱本指南稍後的請求配置選項一節。
回應包括 durationMilliseconds、 offsetMilliseconds等等。 如果音訊檔案包含多個通道,則 channel 屬性會識別通道。
combinedPhrases 屬性包含依每個音訊頻道分隔的完整轉錄。 尋找 "channel": 0,"text" 和 "channel": 1,"text" 用來識別各個通道的完整轉錄。
{
"durationMilliseconds": 185079,
"combinedPhrases": [
{
"channel": 0,
"text": "Hello. Thank you for calling Contoso. Who am I speaking with today? Hi, Mary. Are you calling because you need health insurance? Great. If you can answer a few questions, we can get you signed up in the Jiffy. So what's your full name? Got it. And what's the best callback number in case we get disconnected? Yep, that'll be fine. Got it. So to confirm, it's 234-554-9312. Excellent. Let's get some additional information for your application. Do you have a job? OK, so then you have a Social Security number as well. OK, and what is your Social Security number please? Sorry, what was that, a 25 or a 225? You cut out for a bit. Alright, thank you so much. And could I have your e-mail address please? Great. Uh That is the last question. So let me take your information and I'll be able to get you signed up right away. Thank you for calling Contoso and I'll be able to get you signed up immediately. One of our agents will call you back in about 24 hours or so to confirm your application. Absolutely. If you need anything else, please give us a call at 1-800-555-5564, extension 123. Thank you very much for calling Contoso. Uh Yes, of course. So the default is a digital membership card, but we can send you a physical card if you prefer. Uh, yeah. Absolutely. I've made a note on your file. You're very welcome. Thank you for calling Contoso and have a great day."
},
{
"channel": 1,
"text": "Hi, my name is Mary Rondo. I'm trying to enroll myself with Contuso. Yes, yeah, I'm calling to sign up for insurance. Okay. So Mary Beth Rondo, last name is R like Romeo, O like Ocean, N like Nancy D, D like Dog, and O like Ocean again. Rondo. I only have a cell phone so I can give you that. Sure, so it's 234-554 and then 9312. Yep, that's right. Uh Yes, I am self-employed. Yes, I do. Uh Sure, so it's 412256789. It's double two, so 412, then another two, then five. Yeah, it's maryrondo@gmail.com. So my first and last name at gmail.com. No periods, no dashes. That was quick. Thank you. Actually, so I have one more question. I'm curious, will I be getting a physical card as proof of coverage? uh Yes. Could you please mail it to me when it's ready? I'd like to have it shipped to, are you ready for my address? So it's 2660 Unit A on Maple Avenue SE, Lansing, and then zip code is 48823. Awesome. Thanks so much."
}
],
"phrases": [
{
"channel": 0,
"offsetMilliseconds": 720,
"durationMilliseconds": 480,
"text": "Hello.",
"words": [
{
"text": "Hello.",
"offsetMilliseconds": 720,
"durationMilliseconds": 480
}
],
"locale": "en-US",
"confidence": 0.9177142
},
{
"channel": 0,
"offsetMilliseconds": 1200,
"durationMilliseconds": 1120,
"text": "Thank you for calling Contoso.",
"words": [
{
"text": "Thank",
"offsetMilliseconds": 1200,
"durationMilliseconds": 200
},
{
"text": "you",
"offsetMilliseconds": 1400,
"durationMilliseconds": 80
},
{
"text": "for",
"offsetMilliseconds": 1480,
"durationMilliseconds": 120
},
{
"text": "calling",
"offsetMilliseconds": 1600,
"durationMilliseconds": 240
},
{
"text": "Contoso.",
"offsetMilliseconds": 1840,
"durationMilliseconds": 480
}
],
"locale": "en-US",
"confidence": 0.9177142
},
{
"channel": 0,
"offsetMilliseconds": 2320,
"durationMilliseconds": 1120,
"text": "Who am I speaking with today?",
"words": [
{
"text": "Who",
"offsetMilliseconds": 2320,
"durationMilliseconds": 160
},
{
"text": "am",
"offsetMilliseconds": 2480,
"durationMilliseconds": 80
},
{
"text": "I",
"offsetMilliseconds": 2560,
"durationMilliseconds": 80
},
{
"text": "speaking",
"offsetMilliseconds": 2640,
"durationMilliseconds": 320
},
{
"text": "with",
"offsetMilliseconds": 2960,
"durationMilliseconds": 160
},
{
"text": "today?",
"offsetMilliseconds": 3120,
"durationMilliseconds": 320
}
],
"locale": "en-US",
"confidence": 0.9177142
},
{
"channel": 0,
"offsetMilliseconds": 9520,
"durationMilliseconds": 400,
"text": "Hi, Mary.",
"words": [
{
"text": "Hi,",
"offsetMilliseconds": 9520,
"durationMilliseconds": 80
},
{
"text": "Mary.",
"offsetMilliseconds": 9600,
"durationMilliseconds": 320
}
],
"locale": "en-US",
"confidence": 0.9177142
},
// More transcription results...
// Redacted for brevity
{
"channel": 1,
"offsetMilliseconds": 4480,
"durationMilliseconds": 1600,
"text": "Hi, my name is Mary Rondo.",
"words": [
{
"text": "Hi,",
"offsetMilliseconds": 4480,
"durationMilliseconds": 400
},
{
"text": "my",
"offsetMilliseconds": 4880,
"durationMilliseconds": 120
},
{
"text": "name",
"offsetMilliseconds": 5000,
"durationMilliseconds": 120
},
{
"text": "is",
"offsetMilliseconds": 5120,
"durationMilliseconds": 160
},
{
"text": "Mary",
"offsetMilliseconds": 5280,
"durationMilliseconds": 240
},
{
"text": "Rondo.",
"offsetMilliseconds": 5520,
"durationMilliseconds": 560
}
],
"locale": "en-US",
"confidence": 0.8989456
},
{
"channel": 1,
"offsetMilliseconds": 6080,
"durationMilliseconds": 1920,
"text": "I'm trying to enroll myself with Contuso.",
"words": [
{
"text": "I'm",
"offsetMilliseconds": 6080,
"durationMilliseconds": 160
},
{
"text": "trying",
"offsetMilliseconds": 6240,
"durationMilliseconds": 200
},
{
"text": "to",
"offsetMilliseconds": 6440,
"durationMilliseconds": 80
},
{
"text": "enroll",
"offsetMilliseconds": 6520,
"durationMilliseconds": 200
},
{
"text": "myself",
"offsetMilliseconds": 6720,
"durationMilliseconds": 360
},
{
"text": "with",
"offsetMilliseconds": 7080,
"durationMilliseconds": 120
},
{
"text": "Contuso.",
"offsetMilliseconds": 7200,
"durationMilliseconds": 800
}
],
"locale": "en-US",
"confidence": 0.8989456
},
// More transcription results...
// Redacted for brevity
]
}