分享方式:


搭配 Azure AI 語音使用快速轉錄 API (預覽版)

注意

此功能目前處於公開預覽。 此預覽版是在沒有服務等級協定的情況下提供,不建議用於生產工作負載。 可能不支援特定功能,或可能已經限制功能。 如需詳細資訊,請參閱 Microsoft Azure 預覽版增補使用條款

快速轉錄 API 僅適用於語音轉換文字 2024 年 5 月 15 日 REST API 版。 此預覽版本可能會變更,不建議用於實際執行環境。 以版本將在後續預覽版或 API 正式發行 (GA) 後 90 天淘汰,恕不另行通知。

快速轉錄 API 可以同步方式轉錄音訊檔案,且比即時音訊傳回結果的速度更快。 您需要盡快轉錄音訊錄製,並且可預測延遲時,適合使用快速轉錄,例如:

  • 快速音訊或視訊轉錄、字幕和編輯。
  • 影片翻譯

提示

Azure AI Studio 中試用快速轉錄。

必要條件

  • 其中一個區域中的 Azure AI 語音資源可以使用快速轉錄 API。 支持的區域包括:澳大利亞東部、巴西南部印度中部、美國東部、美國東部 2法國中部、日本東部美國中北部、北歐、美國中南部、東南亞瑞典中部、西歐美國西部 2美國西部 3。 如需其他語音服務功能支援地區的詳細資訊,請參閱語音服務地區

  • 音訊檔案 (長度少於 2 小時且大小小於 200 MB),其格式和轉碼器由批次轉譯 API 支援。 如需支援音訊格式的詳細資訊,請參閱支援的音訊格式

使用快速轉錄 API

快速轉錄 API 是 REST API,會使用多部分/表單資料來提交音訊檔案以進行轉錄。 API 會同步傳回轉錄結果。

根據下列指示來建構要求本文:

  • 設定必要的 locales 屬性。 此值應該符合要謄寫之音訊資料的預期地區設定。 支持的地區設定包括:de-DE、en-IN、en-US、es-ES、es-MX、fr-FR、hi-IN、it-IT、ja-JP、ko-KR、pt-BR 和 zh-CN。 深入瞭解語音 服務語言支援。 您可以透過 Rest API 轉譯取得最新的支援語言 - 列出支援的地區設定
  • 選用,設定 profanityFilterMode 屬性以指定辨識結果處理不雅內容的方式。 接受的值為 None (會停用粗話過濾)、Masked (會以星號取代粗話)、Removed (會移除結果中的所有粗話) 或 Tags (會新增粗話標籤)。 預設值是 MaskedprofanityFilterMode 屬性的運作方式與透過批次轉譯 API 相同。
  • 選用,設定 channels 屬性以指定單獨轉錄的通道,以零為基礎開始索引。 若沒有指定,則會多通道合併並聯合轉錄。 最多僅支援兩個通道。 如果要單獨轉錄立體聲音訊檔案中的通道,則需要在此 [0,1] 指定。 否則,立體聲音訊將合併為單聲道,單聲道音訊將保持原樣,並且僅轉錄單個通道。 在後一種情況下,輸出不會有轉錄文字的通道索引,因為只轉錄單個音訊串流。
  • 選擇性地設定 diarizationSettings 屬性,以辨識和分隔單聲道音訊檔上的多個說話者。 您必須指定音訊檔中可能說話的人數下限和上限 (例如,指定 "diarizationSettings": {"minSpeakers": 1, "maxSpeakers": 4})。 然後,謄寫檔案會包含每個已謄寫片語的 speaker 項目。 當您將 channels 屬性設定為 [0,1] 時,該功能不適用於立體聲音訊。

使用音訊檔案和要求本文屬性向 transcriptions 端點發出多通道/表單資料 POST 要求。 下列範例示範如何使用快速轉錄 API 建立轉錄。

  • 以您的語音資源金鑰取代 YourSubscriptionKey
  • YourServiceRegion 取代為您的語音資源區域。
  • 請以您的音訊檔案路徑取代 YourAudioFile
  • 如先前所述設定表單定義屬性。
curl --location 'https://YourServiceRegion.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=2024-05-15-preview' \
--header 'Content-Type: multipart/form-data' \
--header 'Accept: application/json' \
--header 'Ocp-Apim-Subscription-Key: YourSubscriptionKey' \
--form 'audio=@"YourAudioFile"' \
--form 'definition="{
    \"locales\":[\"en-US\"], 
    \"profanityFilterMode\": \"Masked\", 
    \"channels\": [0,1]}"'

回應將包括 durationchannel 和更多。 combinedPhrases 屬性會分別包含每個通道的完整轉錄。 例如,第一個演講者所說的一切都在 combinedPhrases 陣列的第一個元素中,而第二個演講者所說的一切都在陣列的第二個元素中。

{
	"duration": 185079,
	"combinedPhrases": [
		{
			"channel": 0,
			"text": "Hello. Thank you for calling Contoso. Who am I speaking with today? Hi, Mary. Are you calling because you need health insurance? Great. If you can answer a few questions, we can get you signed up in the Jiffy. So what's your full name? Got it. And what's the best callback number in case we get disconnected? Yep, that'll be fine. Got it. So to confirm, it's 234-554-9312. Excellent. Let's get some additional information for your application. Do you have a job? OK, so then you have a Social Security number as well. OK, and what is your Social Security number please? Sorry, what was that, a 25 or a 225? You cut out for a bit. Alright, thank you so much. And could I have your e-mail address please? Great. Uh That is the last question. So let me take your information and I'll be able to get you signed up right away. Thank you for calling Contoso and I'll be able to get you signed up immediately. One of our agents will call you back in about 24 hours or so to confirm your application. Absolutely. If you need anything else, please give us a call at 1-800-555-5564, extension 123. Thank you very much for calling Contoso. Uh Yes, of course. So the default is a digital membership card, but we can send you a physical card if you prefer. Uh, yeah. Absolutely. I've made a note on your file. You're very welcome. Thank you for calling Contoso and have a great day."
		},
		{
			"channel": 1,
			"text": "Hi, my name is Mary Rondo. I'm trying to enroll myself with Contuso. Yes, yeah, I'm calling to sign up for insurance. Okay. So Mary Beth Rondo, last name is R like Romeo, O like Ocean, N like Nancy D, D like Dog, and O like Ocean again. Rondo. I only have a cell phone so I can give you that. Sure, so it's 234-554 and then 9312. Yep, that's right. Uh Yes, I am self-employed. Yes, I do. Uh Sure, so it's 412256789. It's double two, so 412, then another two, then five. Yeah, it's maryrondo@gmail.com. So my first and last name at gmail.com. No periods, no dashes. That was quick. Thank you. Actually, so I have one more question. I'm curious, will I be getting a physical card as proof of coverage? uh Yes. Could you please mail it to me when it's ready? I'd like to have it shipped to, are you ready for my address? So it's 2660 Unit A on Maple Avenue SE, Lansing, and then zip code is 48823. Awesome. Thanks so much."
		}
	],
	"phrases": [
		{
			"channel": 0,
			"offset": 720,
			"duration": 480,
			"text": "Hello.",
			"words": [
				{
					"text": "Hello.",
					"offset": 720,
					"duration": 480
				}
			],
			"locale": "en-US",
			"confidence": 0.9177142
		},
		{
			"channel": 0,
			"offset": 1200,
			"duration": 1120,
			"text": "Thank you for calling Contoso.",
			"words": [
				{
					"text": "Thank",
					"offset": 1200,
					"duration": 200
				},
				{
					"text": "you",
					"offset": 1400,
					"duration": 80
				},
				{
					"text": "for",
					"offset": 1480,
					"duration": 120
				},
				{
					"text": "calling",
					"offset": 1600,
					"duration": 240
				},
				{
					"text": "Contoso.",
					"offset": 1840,
					"duration": 480
				}
			],
			"locale": "en-US",
			"confidence": 0.9177142
		},
		{
			"channel": 0,
			"offset": 2320,
			"duration": 1120,
			"text": "Who am I speaking with today?",
			"words": [
				{
					"text": "Who",
					"offset": 2320,
					"duration": 160
				},
				{
					"text": "am",
					"offset": 2480,
					"duration": 80
				},
				{
					"text": "I",
					"offset": 2560,
					"duration": 80
				},
				{
					"text": "speaking",
					"offset": 2640,
					"duration": 320
				},
				{
					"text": "with",
					"offset": 2960,
					"duration": 160
				},
				{
					"text": "today?",
					"offset": 3120,
					"duration": 320
				}
			],
			"locale": "en-US",
			"confidence": 0.9177142
		},
        // More transcription results removed for brevity
        // {...},
		{
			"channel": 1,
			"offset": 4480,
			"duration": 1600,
			"text": "Hi, my name is Mary Rondo.",
			"words": [
				{
					"text": "Hi,",
					"offset": 4480,
					"duration": 400
				},
				{
					"text": "my",
					"offset": 4880,
					"duration": 120
				},
				{
					"text": "name",
					"offset": 5000,
					"duration": 120
				},
				{
					"text": "is",
					"offset": 5120,
					"duration": 160
				},
				{
					"text": "Mary",
					"offset": 5280,
					"duration": 240
				},
				{
					"text": "Rondo.",
					"offset": 5520,
					"duration": 560
				}
			],
			"locale": "en-US",
			"confidence": 0.8989456
		},
		{
			"channel": 1,
			"offset": 6080,
			"duration": 1920,
			"text": "I'm trying to enroll myself with Contuso.",
			"words": [
				{
					"text": "I'm",
					"offset": 6080,
					"duration": 160
				},
				{
					"text": "trying",
					"offset": 6240,
					"duration": 200
				},
				{
					"text": "to",
					"offset": 6440,
					"duration": 80
				},
				{
					"text": "enroll",
					"offset": 6520,
					"duration": 200
				},
				{
					"text": "myself",
					"offset": 6720,
					"duration": 360
				},
				{
					"text": "with",
					"offset": 7080,
					"duration": 120
				},
				{
					"text": "Contuso.",
					"offset": 7200,
					"duration": 800
				}
			],
			"locale": "en-US",
			"confidence": 0.8989456
		},
        // More transcription results removed for brevity
        // {...},
	]
}