Azure AI Speech でファストトランスクリプション API (プレビュー) を使用する

[アーティクル]
11/04/2024

Note

現在、この機能はパブリックプレビュー段階にあります。このプレビュー版はサービスレベルアグリーメントなしで提供されています。運用環境のワークロードに使用することはお勧めできません。特定の機能はサポート対象ではなく、機能が制限されることがあります。詳しくは、Microsoft Azure プレビューの追加使用条件に関するページをご覧ください。

ファストトランスクリプション API は、音声テキスト変換 REST API バージョン 2024-05-15-preview からのみ使用できます。このプレビューバージョンは変更される可能性があり、運用環境での使用は推奨されません。これは、後続のプレビューバージョンまたは API の一般提供 (GA) から 90 日後に通知なしで廃止されます。

ファストトランスクリプション API は、オーディオファイルを文字起こしし、その結果を同期して返すために使用されます。これは、リアルタイムオーディオよりもはるかに高速です。ファストトランスクリプションは、オーディオ録音の文字起こしを予測可能な待機時間でできるだけ早く必要とする次のようなシナリオで使用されます。

オーディオまたはビデオの文字起こし、字幕、編集を迅速に行う場合。
ビデオの翻訳

ヒント

Azure AI Studio でファストトランスクリプションを試してみてください。

前提条件

ファストトランスクリプション API を使用できるリージョンの 1 つにある Azure AI 音声リソース。サポートされるリージョンは、オーストラリア東部、ブラジル南部、インド中部、米国東部、米国東部 2、フランス中部、東日本、米国中北部、北ヨーロッパ、米国中南部、東南アジア、スウェーデン中部、西ヨーロッパ、米国西部、米国西部 2、米国西部 3 です。その他の音声サービス機能でサポートされるリージョンの詳細については、「音声サービスのリージョン」を参照してください。
バッチ文字起こし API でサポートされている形式とコーデックのいずれかのオーディオファイル (長さが 2 時間未満、サイズが 200 MB 未満)。サポートされているオーディオ形式の詳細については、サポートされているオーディオ形式のセクションを参照してください。

ファストトランスクリプション API を使用する

ファストトランスクリプション API は、multipart/form-data を使用して文字起こし用のオーディオファイルを送信する REST API です。この API は、文字起こしの結果を同期的に返します。

次の手順に従って要求本文を作成します。

必須の locales プロパティを設定します。この値は、文字起こしする音声データの想定されるロケールと一致する必要があります。サポートされているロケールは de-DE、en-IN、en-US、es-ES、es-MX、fr-FR、hi-IN、it-IT、ja-JP、ko-KR、pt-BR、zh-CN です。 Speech サービスの言語サポートで詳細を確認します。 Rest API の文字起こし - サポートされているロケールの一覧から、サポートされている最新の言語を取得できます
必要に応じて、profanityFilterMode プロパティを設定して、認識結果内の不適切な表現を処理する方法を指定します。指定できる値は、None (不適切な表現のフィルターを無効にする)、Masked (不適切な表現をアスタリスクに置き換える)、Removed (すべての不適切な表現を結果から除去する)、または Tags (不適切な表現のタグを追加する) です。既定値は Masked です。 profanityFilterMode プロパティは、バッチ文字起こし API で使った場合と同じように動作します。
必要に応じて、channels プロパティを設定して、個別に文字起こしするチャネルの 0 から始まるインデックスを指定します。指定しない場合、複数のチャネルが結合され、まとめて文字起こしされます。サポートされるチャネルは最大 2 つまでです。ステレオオーディオファイルのチャンネルを個別に文字起こしする場合は、ここで [0,1] を指定する必要があります。そうしないと、ステレオオーディオはモノラルにマージされ、モノラルオーディオはそのまま残り、1 つのチャネルのみが文字起こしされます。後者のいずれの場合も、1 つのオーディオストリームのみが文字起こしされるため、出力には文字起こしされたテキストのチャネルインデックスはありません。
必要に応じて、モノラルチャンネルオーディオファイルで複数のスピーカーを認識して分離するように diarizationSettings プロパティを設定します。オーディオファイルで話している可能性があるユーザーの最小数と最大数を指定する必要があります (たとえば、"diarizationSettings": {"minSpeakers": 1, "maxSpeakers": 4} を指定します)。文字起こしファイルには、文字起こしされたフレーズごとに speaker エントリが含まれます。 channels プロパティを [0,1] として設定した場合、ステレオオーディオではこの機能を使用できません。

オーディオファイルと要求本文のプロパティを使用して、transcriptions エンドポイントに対して multipart/form-data POST 要求を行います。次の例は、ファストトランスクリプション API を使用して文字起こしを作成する方法を示しています。

YourSubscriptionKey をSpeech リソースキーに置き換えます。
YourServiceRegion を Azure Cognitive Service for Speech リソースのリージョンに置き換えます。
YourAudioFile を、オーディオファイルへのパスに置き換えます。
前の説明に従って、フォーム定義プロパティを設定します。

curl --location 'https://YourServiceRegion.api.cognitive.microsoft.com/speechtotext/transcriptions:transcribe?api-version=2024-05-15-preview' \
--header 'Content-Type: multipart/form-data' \
--header 'Accept: application/json' \
--header 'Ocp-Apim-Subscription-Key: YourSubscriptionKey' \
--form 'audio=@"YourAudioFile"' \
--form 'definition="{
    \"locales\":[\"en-US\"], 
    \"profanityFilterMode\": \"Masked\", 
    \"channels\": [0,1]}"'

応答には、duration、channel などが含まれます。 combinedPhrases プロパティには、各チャネルの完全な文字起こしが個別に含まれています。たとえば、最初の話者が話した内容すべてが combinedPhrases 配列の最初の要素にあり、2 番目の話者が話したすべてが配列の 2 番目の要素にあります。

{
	"duration": 185079,
	"combinedPhrases": [
		{
			"channel": 0,
			"text": "Hello. Thank you for calling Contoso. Who am I speaking with today? Hi, Mary. Are you calling because you need health insurance? Great. If you can answer a few questions, we can get you signed up in the Jiffy. So what's your full name? Got it. And what's the best callback number in case we get disconnected? Yep, that'll be fine. Got it. So to confirm, it's 234-554-9312. Excellent. Let's get some additional information for your application. Do you have a job? OK, so then you have a Social Security number as well. OK, and what is your Social Security number please? Sorry, what was that, a 25 or a 225? You cut out for a bit. Alright, thank you so much. And could I have your e-mail address please? Great. Uh That is the last question. So let me take your information and I'll be able to get you signed up right away. Thank you for calling Contoso and I'll be able to get you signed up immediately. One of our agents will call you back in about 24 hours or so to confirm your application. Absolutely. If you need anything else, please give us a call at 1-800-555-5564, extension 123. Thank you very much for calling Contoso. Uh Yes, of course. So the default is a digital membership card, but we can send you a physical card if you prefer. Uh, yeah. Absolutely. I've made a note on your file. You're very welcome. Thank you for calling Contoso and have a great day."
		},
		{
			"channel": 1,
			"text": "Hi, my name is Mary Rondo. I'm trying to enroll myself with Contuso. Yes, yeah, I'm calling to sign up for insurance. Okay. So Mary Beth Rondo, last name is R like Romeo, O like Ocean, N like Nancy D, D like Dog, and O like Ocean again. Rondo. I only have a cell phone so I can give you that. Sure, so it's 234-554 and then 9312. Yep, that's right. Uh Yes, I am self-employed. Yes, I do. Uh Sure, so it's 412256789. It's double two, so 412, then another two, then five. Yeah, it's maryrondo@gmail.com. So my first and last name at gmail.com. No periods, no dashes. That was quick. Thank you. Actually, so I have one more question. I'm curious, will I be getting a physical card as proof of coverage? uh Yes. Could you please mail it to me when it's ready? I'd like to have it shipped to, are you ready for my address? So it's 2660 Unit A on Maple Avenue SE, Lansing, and then zip code is 48823. Awesome. Thanks so much."
		}
	],
	"phrases": [
		{
			"channel": 0,
			"offset": 720,
			"duration": 480,
			"text": "Hello.",
			"words": [
				{
					"text": "Hello.",
					"offset": 720,
					"duration": 480
				}
			],
			"locale": "en-US",
			"confidence": 0.9177142
		},
		{
			"channel": 0,
			"offset": 1200,
			"duration": 1120,
			"text": "Thank you for calling Contoso.",
			"words": [
				{
					"text": "Thank",
					"offset": 1200,
					"duration": 200
				},
				{
					"text": "you",
					"offset": 1400,
					"duration": 80
				},
				{
					"text": "for",
					"offset": 1480,
					"duration": 120
				},
				{
					"text": "calling",
					"offset": 1600,
					"duration": 240
				},
				{
					"text": "Contoso.",
					"offset": 1840,
					"duration": 480
				}
			],
			"locale": "en-US",
			"confidence": 0.9177142
		},
		{
			"channel": 0,
			"offset": 2320,
			"duration": 1120,
			"text": "Who am I speaking with today?",
			"words": [
				{
					"text": "Who",
					"offset": 2320,
					"duration": 160
				},
				{
					"text": "am",
					"offset": 2480,
					"duration": 80
				},
				{
					"text": "I",
					"offset": 2560,
					"duration": 80
				},
				{
					"text": "speaking",
					"offset": 2640,
					"duration": 320
				},
				{
					"text": "with",
					"offset": 2960,
					"duration": 160
				},
				{
					"text": "today?",
					"offset": 3120,
					"duration": 320
				}
			],
			"locale": "en-US",
			"confidence": 0.9177142
		},
        // More transcription results removed for brevity
        // {...},
		{
			"channel": 1,
			"offset": 4480,
			"duration": 1600,
			"text": "Hi, my name is Mary Rondo.",
			"words": [
				{
					"text": "Hi,",
					"offset": 4480,
					"duration": 400
				},
				{
					"text": "my",
					"offset": 4880,
					"duration": 120
				},
				{
					"text": "name",
					"offset": 5000,
					"duration": 120
				},
				{
					"text": "is",
					"offset": 5120,
					"duration": 160
				},
				{
					"text": "Mary",
					"offset": 5280,
					"duration": 240
				},
				{
					"text": "Rondo.",
					"offset": 5520,
					"duration": 560
				}
			],
			"locale": "en-US",
			"confidence": 0.8989456
		},
		{
			"channel": 1,
			"offset": 6080,
			"duration": 1920,
			"text": "I'm trying to enroll myself with Contuso.",
			"words": [
				{
					"text": "I'm",
					"offset": 6080,
					"duration": 160
				},
				{
					"text": "trying",
					"offset": 6240,
					"duration": 200
				},
				{
					"text": "to",
					"offset": 6440,
					"duration": 80
				},
				{
					"text": "enroll",
					"offset": 6520,
					"duration": 200
				},
				{
					"text": "myself",
					"offset": 6720,
					"duration": 360
				},
				{
					"text": "with",
					"offset": 7080,
					"duration": 120
				},
				{
					"text": "Contuso.",
					"offset": 7200,
					"duration": 800
				}
			],
			"locale": "en-US",
			"confidence": 0.8989456
		},
        // More transcription results removed for brevity
        // {...},
	]
}

次の方法で共有

Azure AI Speech でファストトランスクリプション API (プレビュー) を使用する

前提条件

ファストトランスクリプション API を使用する

フィードバック

その他のリソース

次の方法で共有

Azure AI Speech でファスト トランスクリプション API (プレビュー) を使用する

前提条件

ファスト トランスクリプション API を使用する

関連するコンテンツ

フィードバック

その他のリソース

Azure AI Speech でファストトランスクリプション API (プレビュー) を使用する

ファストトランスクリプション API を使用する