使用語音 SDK 降低語音合成延遲

發行項
09/24/2024

在本文中，我們會介紹可降低文字轉換語音合成延遲的最佳做法，並為您的終端使用者帶來最佳效能。

一般來說，我們會根據 first byte latency 和 finish latency 來測量延遲，如下所示：

Latency	描述	SpeechSynthesisResult 屬性索引鍵
第一個位元組延遲	指出合成工作開始與接收第一個音訊資料區塊之間的時間延遲。	SpeechServiceResponse_SynthesisFirstByteLatencyMs
完成延遲	指出合成工作開始與接收整個合成音訊資料之間的時間延遲。	SpeechServiceResponse_SynthesisFinishLatencyMs

語音 SDK 會將延遲持續時間放在 SpeechSynthesisResult 的屬性集合中。下列範例程式碼顯示這些值。

var result = await synthesizer.SpeakTextAsync(text);
Console.WriteLine($"first byte latency: \t{result.Properties.GetProperty(PropertyId.SpeechServiceResponse_SynthesisFirstByteLatencyMs)} ms");
Console.WriteLine($"finish latency: \t{result.Properties.GetProperty(PropertyId.SpeechServiceResponse_SynthesisFinishLatencyMs)} ms");
// you can also get the result id, and send to us when you need help for diagnosis
var resultId = result.ResultId;

Latency	描述	SpeechSynthesisResult 屬性索引鍵
`first byte latency`	指出合成開始與收到第一個音訊區塊之間的時間延遲。	`SpeechServiceResponse_SynthesisFirstByteLatencyMs`
`finish latency`	指出合成開始與收到整個音訊之間的時間延遲。	`SpeechServiceResponse_SynthesisFinishLatencyMs`

語音 SDK 會測量延遲，並將其放在 SpeechSynthesisResult 的屬性包中。請參閱下列程式碼以便取得。

auto result = synthesizer->SpeakTextAsync(text).get();
auto firstByteLatency = std::stoi(result->Properties.GetProperty(PropertyId::SpeechServiceResponse_SynthesisFirstByteLatencyMs));
auto finishedLatency = std::stoi(result->Properties.GetProperty(PropertyId::SpeechServiceResponse_SynthesisFinishLatencyMs));
// you can also get the result id, and send to us when you need help for diagnosis
auto resultId = result->ResultId;

Latency	描述	SpeechSynthesisResult 屬性索引鍵
`first byte latency`	指出合成開始與收到第一個音訊區塊之間的時間延遲。	`SpeechServiceResponse_SynthesisFirstByteLatencyMs`
`finish latency`	指出合成開始與收到整個音訊之間的時間延遲。	`SpeechServiceResponse_SynthesisFinishLatencyMs`

語音 SDK 會測量延遲，並將其放在 SpeechSynthesisResult 的屬性包中。請參閱下列程式碼以便取得。

SpeechSynthesisResult result = synthesizer.SpeakTextAsync(text).get();
System.out.println("first byte latency: \t" + result.getProperties().getProperty(PropertyId.SpeechServiceResponse_SynthesisFirstByteLatencyMs) + " ms.");
System.out.println("finish latency: \t" + result.getProperties().getProperty(PropertyId.SpeechServiceResponse_SynthesisFinishLatencyMs) + " ms.");
// you can also get the result id, and send to us when you need help for diagnosis
String resultId = result.getResultId();

Latency	描述	SpeechSynthesisResult 屬性索引鍵
`first byte latency`	指出合成開始與收到第一個音訊區塊之間的時間延遲。	`SpeechServiceResponse_SynthesisFirstByteLatencyMs`
`finish latency`	指出合成開始與收到整個音訊之間的時間延遲。	`SpeechServiceResponse_SynthesisFinishLatencyMs`

語音 SDK 會測量延遲，並將其放在 SpeechSynthesisResult 的屬性包中。請參閱下列程式碼以便取得。

result = synthesizer.speak_text_async(text).get()
first_byte_latency = int(result.properties.get_property(speechsdk.PropertyId.SpeechServiceResponse_SynthesisFirstByteLatencyMs))
finished_latency = int(result.properties.get_property(speechsdk.PropertyId.SpeechServiceResponse_SynthesisFinishLatencyMs))
# you can also get the result id, and send to us when you need help for diagnosis
result_id = result.result_id

Latency	描述	SPXSpeechSynthesisResult 屬性索引鍵
`first byte latency`	指出合成開始與收到第一個音訊區塊之間的時間延遲。	`SPXSpeechServiceResponseSynthesisFirstByteLatencyMs`
`finish latency`	指出合成開始與收到整個音訊之間的時間延遲。	`SPXSpeechServiceResponseSynthesisFinishLatencyMs`

語音 SDK 會測量延遲，並將其放在 SPXSpeechSynthesisResult 的屬性包中。請參閱下列程式碼以便取得。

SPXSpeechSynthesisResult *speechResult = [speechSynthesizer speakText:text];
int firstByteLatency = [intString [speechResult.properties getPropertyById:SPXSpeechServiceResponseSynthesisFirstByteLatencyMs]];
int finishedLatency = [intString [speechResult.properties getPropertyById:SPXSpeechServiceResponseSynthesisFinishLatencyMs]];
// you can also get the result id, and send to us when you need help for diagnosis
NSString *resultId = result.resultId;

在大部分情況下，第一個位元組延遲會比完成延遲低。第一個位元組延遲與文字長度無關，而完成延遲則會隨著文字長度增加。

理想情況下，我們想要將使用者體驗到的延遲 (使用者聽到聲音之前的延遲) 降低到一個網路路由來回行程時間加上語音合成服務的第一個音訊區塊延遲。

串流

串流是降低延遲的關鍵。用戶端程式碼可以在收到第一個音訊區塊時開始播放。在服務案例中，您可以立即將音訊區塊轉寄給用戶端，而不用等候整個音訊。

您可以使用 PullAudioOutputStream、PushAudioOutputStream、Synthesizing 事件，以及語音 SDK 的 AudioDataStream 來啟用串流。

利用 AudioDataStream 作為範例：

using (var synthesizer = new SpeechSynthesizer(config, null as AudioConfig))
{
    using (var result = await synthesizer.StartSpeakingTextAsync(text))
    {
        using (var audioDataStream = AudioDataStream.FromResult(result))
        {
            byte[] buffer = new byte[16000];
            uint filledSize = 0;
            while ((filledSize = audioDataStream.ReadData(buffer)) > 0)
            {
                Console.WriteLine($"{filledSize} bytes received.");
            }
        }
    }
}

您可以使用 PullAudioOutputStream、PushAudioOutputStream、Synthesizing 事件，以及語音 SDK 的 AudioDataStream 來啟用串流。

利用 AudioDataStream 作為範例：

auto synthesizer = SpeechSynthesizer::FromConfig(config, nullptr);
auto result = synthesizer->SpeakTextAsync(text).get();
auto audioDataStream = AudioDataStream::FromResult(result);
uint8_t buffer[16000];
uint32_t filledSize = 0;
while ((filledSize = audioDataStream->ReadData(buffer, sizeof(buffer))) > 0)
{
    cout << filledSize << " bytes received." << endl;
}

您可以使用 PullAudioOutputStream、PushAudioOutputStream、Synthesizing 事件，以及語音 SDK 的 AudioDataStream 來啟用串流。

利用 AudioDataStream 作為範例：

SpeechSynthesizer synthesizer = new SpeechSynthesizer(config, null);
SpeechSynthesisResult result = synthesizer.StartSpeakingTextAsync(text).get();
AudioDataStream audioDataStream = AudioDataStream.fromResult(result);
byte[] buffer = new byte[16000];
long filledSize = audioDataStream.readData(buffer);
while (filledSize > 0) {
    System.out.println(filledSize + " bytes received.");
    filledSize = audioDataStream.readData(buffer);
}

您可以使用 PullAudioOutputStream、PushAudioOutputStream、Synthesizing 事件，以及語音 SDK 的 AudioDataStream 來啟用串流。

利用 AudioDataStream 作為範例：

speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=None)
result = speech_synthesizer.start_speaking_text_async(text).get()
audio_data_stream = speechsdk.AudioDataStream(result)
audio_buffer = bytes(16000)
filled_size = audio_data_stream.read_data(audio_buffer)
while filled_size > 0:
    print("{} bytes received.".format(filled_size))
    filled_size = audio_data_stream.read_data(audio_buffer)

您可以使用 SPXPullAudioOutputStream、SPXPushAudioOutputStream、Synthesizing 事件，以及語音 SDK 的 SPXAudioDataStream 來啟用串流。

利用 AudioDataStream 作為範例：

SPXSpeechSynthesizer *synthesizer = [[SPXSpeechSynthesizer alloc] initWithSpeechConfiguration:speechConfig audioConfiguration:nil];
SPXSpeechSynthesisResult *speechResult = [synthesizer startSpeakingText:inputText];
SPXAudioDataStream *stream = [[SPXAudioDataStream alloc] initFromSynthesisResult:speechResult];
NSMutableData* data = [[NSMutableData alloc]initWithCapacity:16000];
while ([stream readData:data length:16000] > 0) {
    // Read data here
}

預先連線和重複使用 SpeechSynthesizer

語音 SDK 會使用 websocket 來與服務通訊。在理想的情況下，網路延遲應該是一個路由來回行程時間 (RTT)。如果連線是新建立的，網路延遲會包含建立連線的額外時間。建立 websocket 連線需要 TCP 交握、SSL 交握、HTTP 連線和通訊協定升級，這會導致時間延遲。為了避免連線延遲，建議您預先連線並重複使用 SpeechSynthesizer。

連線前

若要預先連線，請在您知道即將需要連線時，建立與語音服務的連線。例如，如果您要在用戶端中建置語音聊天機器人，您可以在使用者開始交談時，預先連線到語音合成服務，並在聊天機器人回覆文字就緒時呼叫 SpeakTextAsync。

using (var synthesizer = new SpeechSynthesizer(uspConfig, null as AudioConfig))
{
    using (var connection = Connection.FromSpeechSynthesizer(synthesizer))
    {
        connection.Open(true);
    }
    await synthesizer.SpeakTextAsync(text);
}

auto synthesizer = SpeechSynthesizer::FromConfig(config, nullptr);
auto connection = Connection::FromSpeechSynthesizer(synthesizer);
connection->Open(true);

SpeechSynthesizer synthesizer = new SpeechSynthesizer(speechConfig, (AudioConfig) null);
Connection connection = Connection.fromSpeechSynthesizer(synthesizer);
connection.openConnection(true);

synthesizer = speechsdk.SpeechSynthesizer(config, None)
connection = speechsdk.Connection.from_speech_synthesizer(synthesizer)
connection.open(True)

SPXSpeechSynthesizer* synthesizer = [[SPXSpeechSynthesizer alloc]initWithSpeechConfiguration:self.speechConfig audioConfiguration:nil];
SPXConnection* connection = [[SPXConnection alloc]initFromSpeechSynthesizer:synthesizer];
[connection open:true];

注意

如果有可用的文字，只要呼叫 SpeakTextAsync 即可合成音訊。 SDK 會處理連線。

重複使用 SpeechSynthesizer

減少連線延遲的另一種方式是重複使用 SpeechSynthesizer，如此您就不需要為每個合成建立新的 SpeechSynthesizer。建議您在服務案例中使用物件集區。請參閱 C# 和 Java 的範例程式碼。

透過網路傳輸壓縮的音訊

當網路不穩定或頻寬有限時，承載大小也會影響延遲。同時，壓縮的音訊格式可協助節省使用者的網路頻寬，這對行動使用者來說特別有用。

我們支援許多壓縮格式，包括 opus、webm、mp3、silk 等等，請參閱 SpeechSynthesisOutputFormat 中的完整清單。例如，Riff24Khz16BitMonoPcm 格式的位元速率是 384 kbps，而 Audio24Khz48KBitRateMonoMp3 只需要 48 kbps。當 pcm 輸出格式已設定時，語音 SDK 會自動使用壓縮的格式來進行傳輸。針對 Linux 和 Windows，要啟用這項功能必須要 GStreamer。請參閱此指示，以針對語音 SDK 安裝和設定 GStreamer。對於 Android、iOS 和 macOS，從 1.20 版開始即不需要額外設定。

輸入文字串流

文字串流可用於即時文字處理以快速產生音訊。其非常適合動態文字語音化，例如即時讀取來自 GPT 等 AI 模型的輸出。此功能可將延遲降到最低，並改善音訊輸出的流暢性和回應性，使其非常適用於互動式應用程式、即時事件和回應式 AI 驅動對話。

如何使用文字串流

在 C#、C++ 和 Python 中可使用語音 SDK 支援文字串流。

若要使用文字串流功能，請連線到 WebSocket V2 端點：wss://{region}.tts.speech.microsoft.com/cognitiveservices/websocket/v2

請參閱設定端點的範例程式碼：

// IMPORTANT: MUST use the websocket v2 endpoint
var ttsEndpoint = $"wss://{Environment.GetEnvironmentVariable("AZURE_TTS_REGION")}.tts.speech.microsoft.com/cognitiveservices/websocket/v2";
var speechConfig = SpeechConfig.FromEndpoint(
    new Uri(ttsEndpoint),
    Environment.GetEnvironmentVariable("AZURE_TTS_API_KEY"));

重要步驟

建立文字串流要求：使用 SpeechSynthesisRequestInputType.TextStream 來起始文字串流。
設定全域屬性：直接調整輸出格式和語音名稱等設定，因為功能會處理部分文字輸入，且不支援 SSML。請參閱下列範例程式碼來了解其設定方式。文字串流功能不支援 OpenAI 文字轉換語音。如需完整語言支援，請參閱此語言表。
```
// Set output format
speechConfig.SetSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Raw24Khz16BitMonoPcm);

// Set a voice name
SpeechConfig.SetProperty(PropertyId.SpeechServiceConnection_SynthVoice, "en-US-AvaMultilingualNeural");
```
串流您的文字：針對從 GPT 模型產生的每個文字區塊，請使用 request.InputStream.Write(text); 將文字傳送至串流。
關閉串流：GPT 模型完成其輸出之後，請使用 request.InputStream.Close(); 關閉串流。

如需詳細的實作，請參閱 GitHub 上的範例程式碼

若要使用文字串流功能，請連線到 WebSocket V2 端點：wss://{region}.tts.speech.microsoft.com/cognitiveservices/websocket/v2

請參閱設定端點的範例程式碼：

# IMPORTANT: MUST use the websocket v2 endpoint
speech_config = speechsdk.SpeechConfig(endpoint=f"wss://{os.getenv('AZURE_TTS_REGION')}.tts.speech.microsoft.com/cognitiveservices/websocket/v2",
                                       subscription=os.getenv("AZURE_TTS_API_KEY"))

重要步驟

建立文字串流要求：使用 speechsdk.SpeechSynthesisRequestInputType.TextStream 來起始文字串流。
設定全域屬性：直接調整輸出格式和語音名稱等設定，因為功能會處理部分文字輸入，且不支援 SSML。請參閱下列範例程式碼來了解其設定方式。文字串流功能不支援 OpenAI 文字轉換語音。如需完整語言支援，請參閱此語言表。
```
# set a voice name
speech_config.speech_synthesis_voice_name = "en-US-AvaMultilingualNeural"
```
串流您的文字：針對從 GPT 模型產生的每個文字區塊，請使用 request.input_stream.write(text) 將文字傳送至串流。
關閉串流：GPT 模型完成其輸出之後，請使用 request.input_stream.close() 關閉串流。

如需詳細的實作，請參閱 GitHub 上的範例程式碼。

C++ 範例程式碼現在無法使用。如需示範如何使用文字串流的範例程式碼，請參閱：

如需示範如何使用文字串流的範例程式碼，請參閱：

其他秘訣

快取 CRL 檔案

語音 SDK 會使用 CRL 檔案來檢查認證。在到期前快取 CRL 檔案，可協助您避免每次都要下載 CRL 檔案。如需詳細資料，請參閱如何設定適用於 Linux 的 OpenSSL。

使用最新的語音 SDK

我們會持續改善語音 SDK 的效能，因此請嘗試在您的應用程式中使用最新的語音 SDK。

負載測試指導方針

您可以使用負載測試來測試語音合成服務的容量和延遲。以下是一些指導方針：

語音合成服務能夠自動調整，但擴增需要時間。如果並行存取在短時間內增加，用戶端可能會有很長的延遲或出現 429 錯誤碼 (太多要求)。因此，建議您在負載測試中逐步增加並行處理。請參閱這篇文章以取得詳細資料，特別是這個工作負載模式的範例。
您可以使用我們的範例，使用物件集區 (C# 和 JAVA) 進行負載測試，並取得延遲數值。您可以修改範例中的測試回合和並行處理，以符合您的目標並行。
服務的配額限制是以實際的流量為基礎，因此，如果您想要使用比實際流量更高的並行來執行負載測試，請在測試前先連線。

下一步

請參閱 GitHub 上的範例

共用方式為

使用語音 SDK 降低語音合成延遲

串流

預先連線和重複使用 SpeechSynthesizer

連線前

重複使用 SpeechSynthesizer

透過網路傳輸壓縮的音訊

輸入文字串流

如何使用文字串流

重要步驟

重要步驟

其他秘訣

快取 CRL 檔案

使用最新的語音 SDK

負載測試指導方針

下一步

意見反應

其他資源