Training
Module
Create your first Azure AI speech to text application - Training
In this module, you'll learn how to use Azure AI services to create a speech to text application.
This browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
In this article, we introduce the best practices to lower the text to speech synthesis latency and bring the best performance to your end users.
Normally, we measure the latency by first byte latency
and finish latency
, as follows:
Latency | Description | SpeechSynthesisResult property key |
---|---|---|
first byte latency | Indicates the time delay between the start of the synthesis task and receipt of the first chunk of audio data. | SpeechServiceResponse_SynthesisFirstByteLatencyMs |
finish latency | Indicates the time delay between the start of the synthesis task and the receipt of the whole synthesized audio data. | SpeechServiceResponse_SynthesisFinishLatencyMs |
The Speech SDK puts the latency durations in the Properties collection of SpeechSynthesisResult
. The following sample code shows these values.
var result = await synthesizer.SpeakTextAsync(text);
Console.WriteLine($"first byte latency: \t{result.Properties.GetProperty(PropertyId.SpeechServiceResponse_SynthesisFirstByteLatencyMs)} ms");
Console.WriteLine($"finish latency: \t{result.Properties.GetProperty(PropertyId.SpeechServiceResponse_SynthesisFinishLatencyMs)} ms");
// you can also get the result id, and send to us when you need help for diagnosis
var resultId = result.ResultId;
Latency | Description | SpeechSynthesisResult property key |
---|---|---|
first byte latency |
Indicates the time delay between the synthesis starts and the first audio chunk is received. | SpeechServiceResponse_SynthesisFirstByteLatencyMs |
finish latency |
Indicates the time delay between the synthesis starts and the whole synthesized audio is received. | SpeechServiceResponse_SynthesisFinishLatencyMs |
The Speech SDK measured the latencies and puts them in the property bag of SpeechSynthesisResult
. Refer following codes to get them.
auto result = synthesizer->SpeakTextAsync(text).get();
auto firstByteLatency = std::stoi(result->Properties.GetProperty(PropertyId::SpeechServiceResponse_SynthesisFirstByteLatencyMs));
auto finishedLatency = std::stoi(result->Properties.GetProperty(PropertyId::SpeechServiceResponse_SynthesisFinishLatencyMs));
// you can also get the result id, and send to us when you need help for diagnosis
auto resultId = result->ResultId;
Latency | Description | SpeechSynthesisResult property key |
---|---|---|
first byte latency |
Indicates the time delay between the synthesis starts and the first audio chunk is received. | SpeechServiceResponse_SynthesisFirstByteLatencyMs |
finish latency |
Indicates the time delay between the synthesis starts and the whole synthesized audio is received. | SpeechServiceResponse_SynthesisFinishLatencyMs |
The Speech SDK measured the latencies and puts them in the property bag of SpeechSynthesisResult
. Refer following codes to get them.
SpeechSynthesisResult result = synthesizer.SpeakTextAsync(text).get();
System.out.println("first byte latency: \t" + result.getProperties().getProperty(PropertyId.SpeechServiceResponse_SynthesisFirstByteLatencyMs) + " ms.");
System.out.println("finish latency: \t" + result.getProperties().getProperty(PropertyId.SpeechServiceResponse_SynthesisFinishLatencyMs) + " ms.");
// you can also get the result id, and send to us when you need help for diagnosis
String resultId = result.getResultId();
Latency | Description | SpeechSynthesisResult property key |
---|---|---|
first byte latency |
Indicates the time delay between the synthesis starts and the first audio chunk is received. | SpeechServiceResponse_SynthesisFirstByteLatencyMs |
finish latency |
Indicates the time delay between the synthesis starts and the whole synthesized audio is received. | SpeechServiceResponse_SynthesisFinishLatencyMs |
The Speech SDK measured the latencies and puts them in the property bag of SpeechSynthesisResult
. Refer following codes to get them.
result = synthesizer.speak_text_async(text).get()
first_byte_latency = int(result.properties.get_property(speechsdk.PropertyId.SpeechServiceResponse_SynthesisFirstByteLatencyMs))
finished_latency = int(result.properties.get_property(speechsdk.PropertyId.SpeechServiceResponse_SynthesisFinishLatencyMs))
# you can also get the result id, and send to us when you need help for diagnosis
result_id = result.result_id
Latency | Description | SPXSpeechSynthesisResult property key |
---|---|---|
first byte latency |
Indicates the time delay between the synthesis starts and the first audio chunk is received. | SPXSpeechServiceResponseSynthesisFirstByteLatencyMs |
finish latency |
Indicates the time delay between the synthesis starts and the whole synthesized audio is received. | SPXSpeechServiceResponseSynthesisFinishLatencyMs |
The Speech SDK measured the latencies and puts them in the property bag of SPXSpeechSynthesisResult
. Refer following codes to get them.
SPXSpeechSynthesisResult *speechResult = [speechSynthesizer speakText:text];
int firstByteLatency = [intString [speechResult.properties getPropertyById:SPXSpeechServiceResponseSynthesisFirstByteLatencyMs]];
int finishedLatency = [intString [speechResult.properties getPropertyById:SPXSpeechServiceResponseSynthesisFinishLatencyMs]];
// you can also get the result id, and send to us when you need help for diagnosis
NSString *resultId = result.resultId;
The first byte latency is lower than finish latency in most cases. The first byte latency is independent from text length, while finish latency increases with text length.
Ideally, we want to minimize the user-experienced latency (the latency before user hears the sound) to one network route trip time plus the first audio chunk latency of the speech synthesis service.
Streaming is critical to lowering latency. Client code can start playback when the first audio chunk is received. In a service scenario, you can forward the audio chunks immediately to your clients instead of waiting for the whole audio.
You can use the PullAudioOutputStream
, PushAudioOutputStream
, Synthesizing
event, and AudioDataStream
of the Speech SDK to enable streaming.
Taking AudioDataStream
as an example:
using (var synthesizer = new SpeechSynthesizer(config, null as AudioConfig))
{
using (var result = await synthesizer.StartSpeakingTextAsync(text))
{
using (var audioDataStream = AudioDataStream.FromResult(result))
{
byte[] buffer = new byte[16000];
uint filledSize = 0;
while ((filledSize = audioDataStream.ReadData(buffer)) > 0)
{
Console.WriteLine($"{filledSize} bytes received.");
}
}
}
}
You can use the PullAudioOutputStream
, PushAudioOutputStream
, the Synthesizing
event, and AudioDataStream
of the Speech SDK to enable streaming.
Taking AudioDataStream
as an example:
auto synthesizer = SpeechSynthesizer::FromConfig(config, nullptr);
auto result = synthesizer->SpeakTextAsync(text).get();
auto audioDataStream = AudioDataStream::FromResult(result);
uint8_t buffer[16000];
uint32_t filledSize = 0;
while ((filledSize = audioDataStream->ReadData(buffer, sizeof(buffer))) > 0)
{
cout << filledSize << " bytes received." << endl;
}
You can use the PullAudioOutputStream
, PushAudioOutputStream
, the Synthesizing
event, and AudioDataStream
of the Speech SDK to enable streaming.
Taking AudioDataStream
as an example:
SpeechSynthesizer synthesizer = new SpeechSynthesizer(config, null);
SpeechSynthesisResult result = synthesizer.StartSpeakingTextAsync(text).get();
AudioDataStream audioDataStream = AudioDataStream.fromResult(result);
byte[] buffer = new byte[16000];
long filledSize = audioDataStream.readData(buffer);
while (filledSize > 0) {
System.out.println(filledSize + " bytes received.");
filledSize = audioDataStream.readData(buffer);
}
You can use the PullAudioOutputStream
, PushAudioOutputStream
, the Synthesizing
event, and AudioDataStream
of the Speech SDK to enable streaming.
Taking AudioDataStream
as an example:
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=None)
result = speech_synthesizer.start_speaking_text_async(text).get()
audio_data_stream = speechsdk.AudioDataStream(result)
audio_buffer = bytes(16000)
filled_size = audio_data_stream.read_data(audio_buffer)
while filled_size > 0:
print("{} bytes received.".format(filled_size))
filled_size = audio_data_stream.read_data(audio_buffer)
You can use the SPXPullAudioOutputStream
, SPXPushAudioOutputStream
, the Synthesizing
event, and SPXAudioDataStream
of the Speech SDK to enable streaming.
Taking AudioDataStream
as an example:
SPXSpeechSynthesizer *synthesizer = [[SPXSpeechSynthesizer alloc] initWithSpeechConfiguration:speechConfig audioConfiguration:nil];
SPXSpeechSynthesisResult *speechResult = [synthesizer startSpeakingText:inputText];
SPXAudioDataStream *stream = [[SPXAudioDataStream alloc] initFromSynthesisResult:speechResult];
NSMutableData* data = [[NSMutableData alloc]initWithCapacity:16000];
while ([stream readData:data length:16000] > 0) {
// Read data here
}
The Speech SDK uses a websocket to communicate with the service.
Ideally, the network latency should be one route trip time (RTT).
If the connection is newly established, the network latency includes extra time to establish the connection.
The establishment of a websocket connection needs the TCP handshake, SSL handshake, HTTP connection, and protocol upgrade, which introduces time delay.
To avoid the connection latency, we recommend pre-connecting and reusing the SpeechSynthesizer
.
To pre-connect, establish a connection to the Speech service when you know the connection is needed soon. For example, if you're building a speech bot in client, you can pre-connect to the speech synthesis service when the user starts to talk, and call SpeakTextAsync
when the bot reply text is ready.
using (var synthesizer = new SpeechSynthesizer(uspConfig, null as AudioConfig))
{
using (var connection = Connection.FromSpeechSynthesizer(synthesizer))
{
connection.Open(true);
}
await synthesizer.SpeakTextAsync(text);
}
auto synthesizer = SpeechSynthesizer::FromConfig(config, nullptr);
auto connection = Connection::FromSpeechSynthesizer(synthesizer);
connection->Open(true);
SpeechSynthesizer synthesizer = new SpeechSynthesizer(speechConfig, (AudioConfig) null);
Connection connection = Connection.fromSpeechSynthesizer(synthesizer);
connection.openConnection(true);
synthesizer = speechsdk.SpeechSynthesizer(config, None)
connection = speechsdk.Connection.from_speech_synthesizer(synthesizer)
connection.open(True)
SPXSpeechSynthesizer* synthesizer = [[SPXSpeechSynthesizer alloc]initWithSpeechConfiguration:self.speechConfig audioConfiguration:nil];
SPXConnection* connection = [[SPXConnection alloc]initFromSpeechSynthesizer:synthesizer];
[connection open:true];
Note
If the text is available, just call SpeakTextAsync
to synthesize the audio. The SDK will handle the connection.
Another way to reduce the connection latency is to reuse the SpeechSynthesizer
so you don't need to create a new SpeechSynthesizer
for each synthesis.
We recommend using object pool in service scenario. See our sample code for C# and Java.
When the network is unstable or with limited bandwidth, the payload size also affects latency. Meanwhile, a compressed audio format helps to save the users' network bandwidth, which is especially valuable for mobile users.
We support many compressed formats including opus
, webm
, mp3
, silk
, and so on, see the full list in SpeechSynthesisOutputFormat.
For example, the bitrate of Riff24Khz16BitMonoPcm
format is 384 kbps, while Audio24Khz48KBitRateMonoMp3
only costs 48 kbps.
The Speech SDK automatically uses a compressed format for transmission when a pcm
output format is set.
For Linux and Windows, GStreamer
is required to enable this feature.
Refer this instruction to install and configure GStreamer
for Speech SDK.
For Android, iOS, and macOS, no extra configuration is needed starting version 1.20.
Text streaming allows real-time text processing for rapid audio generation. It's perfect for dynamic text vocalization, such as reading outputs from AI models like GPT in real-time. This feature minimizes latency and improves the fluidity and responsiveness of audio outputs, making it ideal for interactive applications, live events, and responsive AI-driven dialogues.
Text streaming is supported in C#, C++ and Python with Speech SDK.
To use the text streaming feature, connect to the websocket V2 endpoint: wss://{region}.tts.speech.microsoft.com/cognitiveservices/websocket/v2
See the sample code for setting the endpoint:
// IMPORTANT: MUST use the websocket v2 endpoint
var ttsEndpoint = $"wss://{Environment.GetEnvironmentVariable("AZURE_TTS_REGION")}.tts.speech.microsoft.com/cognitiveservices/websocket/v2";
var speechConfig = SpeechConfig.FromEndpoint(
new Uri(ttsEndpoint),
Environment.GetEnvironmentVariable("AZURE_TTS_API_KEY"));
Create a text stream request: Use SpeechSynthesisRequestInputType.TextStream
to initiate a text stream.
Set global properties: Adjust settings such as output format and voice name directly, as the feature handles partial text inputs and doesn't support SSML. Refer to the following sample code for instructions on how to set them. OpenAI text to speech voices aren't supported by the text streaming feature. See this language table for full language support.
// Set output format
speechConfig.SetSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Raw24Khz16BitMonoPcm);
// Set a voice name
SpeechConfig.SetProperty(PropertyId.SpeechServiceConnection_SynthVoice, "en-US-AvaMultilingualNeural");
Stream your text: For each text chunk generated from a GPT model, use request.InputStream.Write(text);
to send the text to the stream.
Close the stream: Once the GPT model completes its output, close the stream using request.InputStream.Close();
.
For detailed implementation, see the sample code on GitHub
To use the text streaming feature, connect to the websocket V2 endpoint: wss://{region}.tts.speech.microsoft.com/cognitiveservices/websocket/v2
See the sample code for setting the endpoint:
# IMPORTANT: MUST use the websocket v2 endpoint
speech_config = speechsdk.SpeechConfig(endpoint=f"wss://{os.getenv('AZURE_TTS_REGION')}.tts.speech.microsoft.com/cognitiveservices/websocket/v2",
subscription=os.getenv("AZURE_TTS_API_KEY"))
Create a text stream request: Use speechsdk.SpeechSynthesisRequestInputType.TextStream
to initiate a text stream.
Set global properties: Adjust settings such as output format and voice name directly, as the feature handles partial text inputs and doesn't support SSML. Refer to the following sample code for instructions on how to set them. OpenAI text to speech voices aren't supported by the text streaming feature. See this language table for full language support.
# set a voice name
speech_config.speech_synthesis_voice_name = "en-US-AvaMultilingualNeural"
Stream your text: For each text chunk generated from a GPT model, use request.input_stream.write(text)
to send the text to the stream.
Close the stream: Once the GPT model completes its output, close the stream using request.input_stream.close()
.
For detailed implementation, see the sample code on GitHub.
The C++ sample code isn't available now. For the sample code that shows how to use text streaming, see:
For the sample code that shows how to use text streaming, see:
For the sample code that shows how to use text streaming, see:
The Speech SDK uses CRL files to check the certification. Caching the CRL files until expired helps you avoid downloading CRL files every time. See How to configure OpenSSL for Linux for details.
We keep improving the Speech SDK's performance, so try to use the latest Speech SDK in your application.
You can use load test to test the speech synthesis service capacity and latency. Here are some guidelines:
429
error code (too many requests). So, we recommend you increase your concurrency step by step in load test. See this article for more details, especially this example of workload patterns.Training
Module
Create your first Azure AI speech to text application - Training
In this module, you'll learn how to use Azure AI services to create a speech to text application.