Speech SDK を使用して音声合成の待機時間を短縮する

[アーティクル]
02/02/2024

アプリケーションにとって、合成の待機時間は重要です。この記事では、待機時間を短くし、エンドユーザーに最高のパフォーマンスを提供するためのベストプラクティスを紹介します。

通常、次のように、first byte latency と finish latency で待機時間を測定します。

待機時間	説明	SpeechSynthesisResult プロパティキー
最初のバイト待機時間	合成タスクが開始されてから、オーディオデータの最初のチャンクが受信されるまでの遅延時間を示します。	SpeechServiceResponse_SynthesisFirstByteLatencyMs
完了までの待機時間	合成タスクが開始されてから、合成するオーディオデータ全体が受信されるまでの遅延時間を示します。	SpeechServiceResponse_SynthesisFinishLatencyMs

Speech SDK によって、SpeechSynthesisResult のプロパティコレクションに待機時間が入力されます。これらの値を次のサンプルコードに示します。

var result = await synthesizer.SpeakTextAsync(text);
Console.WriteLine($"first byte latency: \t{result.Properties.GetProperty(PropertyId.SpeechServiceResponse_SynthesisFirstByteLatencyMs)} ms");
Console.WriteLine($"finish latency: \t{result.Properties.GetProperty(PropertyId.SpeechServiceResponse_SynthesisFinishLatencyMs)} ms");
// you can also get the result id, and send to us when you need help for diagnosis
var resultId = result.ResultId;

待機時間	説明	SpeechSynthesisResult プロパティキー
`first byte latency`	合成が開始されてから最初のオーディオチャンクが受信されるまでの遅延時間を示します。	`SpeechServiceResponse_SynthesisFirstByteLatencyMs`
`finish latency`	合成が開始されてから合成するオーディオ全体が受信されるまでの遅延時間を示します。	`SpeechServiceResponse_SynthesisFinishLatencyMs`

Speech SDK によって、待機時間が測定され、SpeechSynthesisResult のプロパティバッグに格納されます。それらを取得するには、次のコードを参照してください。

auto result = synthesizer->SpeakTextAsync(text).get();
auto firstByteLatency = std::stoi(result->Properties.GetProperty(PropertyId::SpeechServiceResponse_SynthesisFirstByteLatencyMs));
auto finishedLatency = std::stoi(result->Properties.GetProperty(PropertyId::SpeechServiceResponse_SynthesisFinishLatencyMs));
// you can also get the result id, and send to us when you need help for diagnosis
auto resultId = result->ResultId;

待機時間	説明	SpeechSynthesisResult プロパティキー
`first byte latency`	合成が開始されてから最初のオーディオチャンクが受信されるまでの遅延時間を示します。	`SpeechServiceResponse_SynthesisFirstByteLatencyMs`
`finish latency`	合成が開始されてから合成するオーディオ全体が受信されるまでの遅延時間を示します。	`SpeechServiceResponse_SynthesisFinishLatencyMs`

SpeechSynthesisResult result = synthesizer.SpeakTextAsync(text).get();
System.out.println("first byte latency: \t" + result.getProperties().getProperty(PropertyId.SpeechServiceResponse_SynthesisFirstByteLatencyMs) + " ms.");
System.out.println("finish latency: \t" + result.getProperties().getProperty(PropertyId.SpeechServiceResponse_SynthesisFinishLatencyMs) + " ms.");
// you can also get the result id, and send to us when you need help for diagnosis
String resultId = result.getResultId();

待機時間	説明	SpeechSynthesisResult プロパティキー
`first byte latency`	合成が開始されてから最初のオーディオチャンクが受信されるまでの遅延時間を示します。	`SpeechServiceResponse_SynthesisFirstByteLatencyMs`
`finish latency`	合成が開始されてから合成するオーディオ全体が受信されるまでの遅延時間を示します。	`SpeechServiceResponse_SynthesisFinishLatencyMs`

result = synthesizer.speak_text_async(text).get()
first_byte_latency = int(result.properties.get_property(speechsdk.PropertyId.SpeechServiceResponse_SynthesisFirstByteLatencyMs))
finished_latency = int(result.properties.get_property(speechsdk.PropertyId.SpeechServiceResponse_SynthesisFinishLatencyMs))
# you can also get the result id, and send to us when you need help for diagnosis
result_id = result.result_id

待機時間	説明	SPXSpeechSynthesisResult プロパティキー
`first byte latency`	合成が開始されてから最初のオーディオチャンクが受信されるまでの遅延時間を示します。	`SPXSpeechServiceResponseSynthesisFirstByteLatencyMs`
`finish latency`	合成が開始されてから合成するオーディオ全体が受信されるまでの遅延時間を示します。	`SPXSpeechServiceResponseSynthesisFinishLatencyMs`

Speech SDK によって、待機時間が測定され、SPXSpeechSynthesisResult のプロパティバッグに格納されます。それらを取得するには、次のコードを参照してください。

SPXSpeechSynthesisResult *speechResult = [speechSynthesizer speakText:text];
int firstByteLatency = [intString [speechResult.properties getPropertyById:SPXSpeechServiceResponseSynthesisFirstByteLatencyMs]];
int finishedLatency = [intString [speechResult.properties getPropertyById:SPXSpeechServiceResponseSynthesisFinishLatencyMs]];
// you can also get the result id, and send to us when you need help for diagnosis
NSString *resultId = result.resultId;

ほとんどの場合、最初のバイトまでの待機時間は、完了までの待機時間よりも短くなります。最初のバイト待機時間はテキストの長さとは無関係ですが、完了までの待機時間はテキストの長さによって増減します。

ユーザーが経験する待機時間 (ユーザーがサウンドを聞くまでの待機時間) を最小限に抑える、つまり音声合成サービスの 1 つのネットワークルートトリップ時間に最初のオーディオチャンク待機時間をプラスした時間に抑えることができれば理想的です。

ストリーミング

待機時間を短縮するうえで、ストリーミングが重要な鍵を握っています。クライアントコードでは、最初のオーディオチャンクを受信したときに再生を開始できます。サービスシナリオでは、オーディオ全体を待機するのではなく、オーディオチャンクをすぐにクライアントに転送できます。

Speech SDK の PullAudioOutputStream、PushAudioOutputStream、Synthesizing イベント、および AudioDataStream を使用して、ストリーミングを有効にすることができます。