Lower speech synthesis latency using Speech SDK

Grein
09/20/2024

In this article, we introduce the best practices to lower the text to speech synthesis latency and bring the best performance to your end users.

Normally, we measure the latency by first byte latency and finish latency, as follows:

Latency	Description	SpeechSynthesisResult property key
first byte latency	Indicates the time delay between the start of the synthesis task and receipt of the first chunk of audio data.	SpeechServiceResponse_SynthesisFirstByteLatencyMs
finish latency	Indicates the time delay between the start of the synthesis task and the receipt of the whole synthesized audio data.	SpeechServiceResponse_SynthesisFinishLatencyMs

The Speech SDK puts the latency durations in the Properties collection of SpeechSynthesisResult. The following sample code shows these values.

var result = await synthesizer.SpeakTextAsync(text);
Console.WriteLine($"first byte latency: \t{result.Properties.GetProperty(PropertyId.SpeechServiceResponse_SynthesisFirstByteLatencyMs)} ms");
Console.WriteLine($"finish latency: \t{result.Properties.GetProperty(PropertyId.SpeechServiceResponse_SynthesisFinishLatencyMs)} ms");
// you can also get the result id, and send to us when you need help for diagnosis
var resultId = result.ResultId;

Latency	Description	SpeechSynthesisResult property key
`first byte latency`	Indicates the time delay between the synthesis starts and the first audio chunk is received.	`SpeechServiceResponse_SynthesisFirstByteLatencyMs`
`finish latency`	Indicates the time delay between the synthesis starts and the whole synthesized audio is received.	`SpeechServiceResponse_SynthesisFinishLatencyMs`

The Speech SDK measured the latencies and puts them in the property bag of SpeechSynthesisResult. Refer following codes to get them.

auto result = synthesizer->SpeakTextAsync(text).get();
auto firstByteLatency = std::stoi(result->Properties.GetProperty(PropertyId::SpeechServiceResponse_SynthesisFirstByteLatencyMs));
auto finishedLatency = std::stoi(result->Properties.GetProperty(PropertyId::SpeechServiceResponse_SynthesisFinishLatencyMs));
// you can also get the result id, and send to us when you need help for diagnosis
auto resultId = result->ResultId;

Latency	Description	SpeechSynthesisResult property key
`first byte latency`	Indicates the time delay between the synthesis starts and the first audio chunk is received.	`SpeechServiceResponse_SynthesisFirstByteLatencyMs`
`finish latency`	Indicates the time delay between the synthesis starts and the whole synthesized audio is received.	`SpeechServiceResponse_SynthesisFinishLatencyMs`

The Speech SDK measured the latencies and puts them in the property bag of SpeechSynthesisResult. Refer following codes to get them.

SpeechSynthesisResult result = synthesizer.SpeakTextAsync(text).get();
System.out.println("first byte latency: \t" + result.getProperties().getProperty(PropertyId.SpeechServiceResponse_SynthesisFirstByteLatencyMs) + " ms.");
System.out.println("finish latency: \t" + result.getProperties().getProperty(PropertyId.SpeechServiceResponse_SynthesisFinishLatencyMs) + " ms.");
// you can also get the result id, and send to us when you need help for diagnosis
String resultId = result.getResultId();

Latency	Description	SpeechSynthesisResult property key
`first byte latency`	Indicates the time delay between the synthesis starts and the first audio chunk is received.	`SpeechServiceResponse_SynthesisFirstByteLatencyMs`
`finish latency`	Indicates the time delay between the synthesis starts and the whole synthesized audio is received.	`SpeechServiceResponse_SynthesisFinishLatencyMs`

The Speech SDK measured the latencies and puts them in the property bag of SpeechSynthesisResult. Refer following codes to get them.

result = synthesizer.speak_text_async(text).get()
first_byte_latency = int(result.properties.get_property(speechsdk.PropertyId.SpeechServiceResponse_SynthesisFirstByteLatencyMs))
finished_latency = int(result.properties.get_property(speechsdk.PropertyId.SpeechServiceResponse_SynthesisFinishLatencyMs))
# you can also get the result id, and send to us when you need help for diagnosis
result_id = result.result_id

Latency	Description	SPXSpeechSynthesisResult property key
`first byte latency`	Indicates the time delay between the synthesis starts and the first audio chunk is received.	`SPXSpeechServiceResponseSynthesisFirstByteLatencyMs`
`finish latency`	Indicates the time delay between the synthesis starts and the whole synthesized audio is received.	`SPXSpeechServiceResponseSynthesisFinishLatencyMs`

The Speech SDK measured the latencies and puts them in the property bag of SPXSpeechSynthesisResult. Refer following codes to get them.

SPXSpeechSynthesisResult *speechResult = [speechSynthesizer speakText:text];
int firstByteLatency = [intString [speechResult.properties getPropertyById:SPXSpeechServiceResponseSynthesisFirstByteLatencyMs]];
int finishedLatency = [intString [speechResult.properties getPropertyById:SPXSpeechServiceResponseSynthesisFinishLatencyMs]];
// you can also get the result id, and send to us when you need help for diagnosis
NSString *resultId = result.resultId;

The first byte latency is lower than finish latency in most cases. The first byte latency is independent from text length, while finish latency increases with text length.

Ideally, we want to minimize the user-experienced latency (the latency before user hears the sound) to one network route trip time plus the first audio chunk latency of the speech synthesis service.

Streaming

Streaming is critical to lowering latency. Client code can start playback when the first audio chunk is received. In a service scenario, you can forward the audio chunks immediately to your clients instead of waiting for the whole audio.

You can use the PullAudioOutputStream, PushAudioOutputStream, Synthesizing event, and AudioDataStream of the Speech SDK to enable streaming.