你当前正在访问 Microsoft Azure Global Edition 技术文档网站。 如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站,请访问 https://docs.azure.cn。
将快速听录 API(预览版)与 Azure AI 语音配合使用
注意
此功能目前处于公开预览状态。 此预览版没有附带服务级别协议,建议不要用于生产工作负载。 某些功能可能不受支持或者受限。 有关详细信息,请参阅 Microsoft Azure 预览版补充使用条款。
快速听录 API 只能通过语音转文本 REST API 版本 2024-05-15-preview 使用。 此预览版本可能会更改,不建议用于生产。 它将在后续预览版本或 API 正式发布 (GA) 后 90 天后停用,恕不另行通知。
快速听录 API 用于听录音频文件,同步返回结果,速度比实时音频快得多。 在需要尽快获得音频录制脚本且可预测延迟的情况下使用快速听录,例如:
- 快速音频或视频听录、字幕和编辑。
- 视频翻译
提示
在Azure AI Studio中试用快速听录。
先决条件
快速听录 API 可用的某个区域中的 Azure AI 语音资源。 支持的区域是:澳大利亚东部、巴西南部、印度中部、美国东部、美国东部 2、法国中部、日本东部、美国中北部、北欧、美国中南部、东南亚、瑞典中部、西欧、美国西部、美国西部 2、美国西部 3。 有关其他语音服务功能支持的区域的详细信息,请参阅语音服务区域。
音频文件(长度小于 2 小时且大小小于 200 MB)采用批量听录 API 支持的格式和编解码器之一。 有关受支持的音频格式的详细信息,请参阅受支持的音频格式。
使用快速听录 API
快速听录 API 是一种 REST API,它使用多部分/表单数据提交音频文件以进行听录。 API 同步返回听录结果。
根据以下说明构造请求正文:
- 设置所需的
locales
属性。 此值应与要听录的音频数据的预期区域设置相匹配。 受支持的区域设置包括:de-DE、en-IN、en-US、es-ES、es-MX、fr-FR、hi-IN、it、ja-JP、ko-KR、pt-BR 和 zh-CN。 从语音服务语言支持了解详细信息。 可通过 Rest API 听录 - 列出支持的区域设置来获取最新支持的语言 - (可选)设置
profanityFilterMode
属性以指定如何处理识别结果中的亵渎内容。 接受的值为None
(禁用不雅内容筛选)、Masked
(将不雅内容替换为星号)、Removed
(从结果中删除所有不雅内容)或Tags
(添加不雅内容标记)。 默认值为Masked
。profanityFilterMode
属性的工作方式与通过批量听录 API的工作方式相同。 - (可选)设置
channels
属性以指定要单独听录的通道的从零开始的索引。 如果未指定,则合并并听录多个通道。 最多支持两个通道。 如果要单独从立体声音频文件听录通道,则需要在此处指定[0,1]
。 否则,立体声音频将合并为单声道,单声道音频将按原样保留,并且只会听录单个通道。 在后一种情况下,输出没有听录文本的通道索引,因为只听录了单个音频流。 - 可以选择设置
diarizationSettings
属性来识别和分隔单声道录音文件中的多个说话人。 你需要指定可以在音频文件中说话的最小和最大人数(例如,指定"diarizationSettings": {"minSpeakers": 1, "maxSpeakers": 4}
)。 然后,听录文件将包含每个已听录短语的speaker
条目。 将channels
属性设置为[0,1]
时,此功能不适用于立体声音频。
使用音频文件和请求正文属性向transcriptions
终结点发出多部分/表单数据 POST 请求。 以下示例演示了如何使用快速听录 API 创建听录。
- 将
YourSubscriptionKey
替换为语音资源密钥。 - 将
YourServiceRegion
替换为你所在的语音资源区域。 - 将
YourAudioFile
替换为音频文件的路径。 - 如前所述设置表单定义属性。
curl --location 'https://YourServiceRegion.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=2024-05-15-preview' \
--header 'Content-Type: multipart/form-data' \
--header 'Accept: application/json' \
--header 'Ocp-Apim-Subscription-Key: YourSubscriptionKey' \
--form 'audio=@"YourAudioFile"' \
--form 'definition="{
\"locales\":[\"en-US\"],
\"profanityFilterMode\": \"Masked\",
\"channels\": [0,1]}"'
响应将包括duration
、channel
等。 combinedPhrases
属性分别包含每个通道的完整听录。 例如,第一个说话人所说的所有内容都位于combinedPhrases
数组的第一个元素中,第二个说话人所说的所有内容都位于数组的第二个元素中。
{
"duration": 185079,
"combinedPhrases": [
{
"channel": 0,
"text": "Hello. Thank you for calling Contoso. Who am I speaking with today? Hi, Mary. Are you calling because you need health insurance? Great. If you can answer a few questions, we can get you signed up in the Jiffy. So what's your full name? Got it. And what's the best callback number in case we get disconnected? Yep, that'll be fine. Got it. So to confirm, it's 234-554-9312. Excellent. Let's get some additional information for your application. Do you have a job? OK, so then you have a Social Security number as well. OK, and what is your Social Security number please? Sorry, what was that, a 25 or a 225? You cut out for a bit. Alright, thank you so much. And could I have your e-mail address please? Great. Uh That is the last question. So let me take your information and I'll be able to get you signed up right away. Thank you for calling Contoso and I'll be able to get you signed up immediately. One of our agents will call you back in about 24 hours or so to confirm your application. Absolutely. If you need anything else, please give us a call at 1-800-555-5564, extension 123. Thank you very much for calling Contoso. Uh Yes, of course. So the default is a digital membership card, but we can send you a physical card if you prefer. Uh, yeah. Absolutely. I've made a note on your file. You're very welcome. Thank you for calling Contoso and have a great day."
},
{
"channel": 1,
"text": "Hi, my name is Mary Rondo. I'm trying to enroll myself with Contuso. Yes, yeah, I'm calling to sign up for insurance. Okay. So Mary Beth Rondo, last name is R like Romeo, O like Ocean, N like Nancy D, D like Dog, and O like Ocean again. Rondo. I only have a cell phone so I can give you that. Sure, so it's 234-554 and then 9312. Yep, that's right. Uh Yes, I am self-employed. Yes, I do. Uh Sure, so it's 412256789. It's double two, so 412, then another two, then five. Yeah, it's maryrondo@gmail.com. So my first and last name at gmail.com. No periods, no dashes. That was quick. Thank you. Actually, so I have one more question. I'm curious, will I be getting a physical card as proof of coverage? uh Yes. Could you please mail it to me when it's ready? I'd like to have it shipped to, are you ready for my address? So it's 2660 Unit A on Maple Avenue SE, Lansing, and then zip code is 48823. Awesome. Thanks so much."
}
],
"phrases": [
{
"channel": 0,
"offset": 720,
"duration": 480,
"text": "Hello.",
"words": [
{
"text": "Hello.",
"offset": 720,
"duration": 480
}
],
"locale": "en-US",
"confidence": 0.9177142
},
{
"channel": 0,
"offset": 1200,
"duration": 1120,
"text": "Thank you for calling Contoso.",
"words": [
{
"text": "Thank",
"offset": 1200,
"duration": 200
},
{
"text": "you",
"offset": 1400,
"duration": 80
},
{
"text": "for",
"offset": 1480,
"duration": 120
},
{
"text": "calling",
"offset": 1600,
"duration": 240
},
{
"text": "Contoso.",
"offset": 1840,
"duration": 480
}
],
"locale": "en-US",
"confidence": 0.9177142
},
{
"channel": 0,
"offset": 2320,
"duration": 1120,
"text": "Who am I speaking with today?",
"words": [
{
"text": "Who",
"offset": 2320,
"duration": 160
},
{
"text": "am",
"offset": 2480,
"duration": 80
},
{
"text": "I",
"offset": 2560,
"duration": 80
},
{
"text": "speaking",
"offset": 2640,
"duration": 320
},
{
"text": "with",
"offset": 2960,
"duration": 160
},
{
"text": "today?",
"offset": 3120,
"duration": 320
}
],
"locale": "en-US",
"confidence": 0.9177142
},
// More transcription results removed for brevity
// {...},
{
"channel": 1,
"offset": 4480,
"duration": 1600,
"text": "Hi, my name is Mary Rondo.",
"words": [
{
"text": "Hi,",
"offset": 4480,
"duration": 400
},
{
"text": "my",
"offset": 4880,
"duration": 120
},
{
"text": "name",
"offset": 5000,
"duration": 120
},
{
"text": "is",
"offset": 5120,
"duration": 160
},
{
"text": "Mary",
"offset": 5280,
"duration": 240
},
{
"text": "Rondo.",
"offset": 5520,
"duration": 560
}
],
"locale": "en-US",
"confidence": 0.8989456
},
{
"channel": 1,
"offset": 6080,
"duration": 1920,
"text": "I'm trying to enroll myself with Contuso.",
"words": [
{
"text": "I'm",
"offset": 6080,
"duration": 160
},
{
"text": "trying",
"offset": 6240,
"duration": 200
},
{
"text": "to",
"offset": 6440,
"duration": 80
},
{
"text": "enroll",
"offset": 6520,
"duration": 200
},
{
"text": "myself",
"offset": 6720,
"duration": 360
},
{
"text": "with",
"offset": 7080,
"duration": 120
},
{
"text": "Contuso.",
"offset": 7200,
"duration": 800
}
],
"locale": "en-US",
"confidence": 0.8989456
},
// More transcription results removed for brevity
// {...},
]
}