Reduce latency in text to speech microsoft speedh SDK

Question

I am using Microsoft-cognitiveservices-speech-sdk in a react codebase, what is the best way to reduce latency and get output as fast as possible?

When I give it a text, the amount of seconds before the output is played is just too much and need a way to make this more real-time. Is there a way to start playing the sound as the transcribe is being done instead of waiting for the entire text to be synthesized?

  
   speechSynthesizer.synthesizing = () => {

     // Start playing audio
     };

Right now, I am playing the sound with below sample code, it works but takes so much time to start speaking:

 speechSynthesizerRef.current.speakTextAsync(
      text,
      (result) => {
        audioContext.current.decodeAudioData(result.audioData, (buffer) => {
          if (result.reason === ResultReason.SynthesizingAudioCompleted) {
        

            const newBufferSource = audioContext.current.createBufferSource();
            newBufferSource.connect(gainNode);
            gainNode.connect(audioContext.current.destination);

            newBufferSource.buffer = buffer;
            newBufferSource.start(0);
        
          }

        });
      },
      (err) => {
        console.error('Speech synthesis error:', err);
      
      }
    );

Answer

@Nas Welcome to Microsoft Q&A Forum, Thank you for posting your query here!

Please note, Microsoft does not publish any SLA for latency. Latency is a combination of many factors, including your network and client performance, especially when accessing lesser-used voices in text-to-speech.

Suggestions:

Try with most recent version of the SDK and check if you still encounter same issue.
Please measure the Latency: The Speech SDK provides properties to measure the latency. You can use SpeechServiceResponse_SynthesisFirstByteLatencyMs to measure the time delay between the start of the synthesis task and receipt of the first chunk of audio data. Similarly, SpeechServiceResponse_SynthesisFinishLatencyMs can be used to measure the time delay between the start of the synthesis task and the receipt of the whole synthesized audio data.

var result = await synthesizer.speakTextAsync(text);
console.log(`first byte latency: ${result.properties.getProperty(PropertyId.SpeechServiceResponse_SynthesisFirstByteLatencyMs)} ms`);
console.log(`finish latency: ${result.properties.getProperty(PropertyId.SpeechServiceResponse_SynthesisFinishLatencyMs)} ms`);

Reuse the SpeechSynthesizer: Instead of creating a new SpeechSynthesizer for each synthesis, you can reuse the same SpeechSynthesizer. This can help reduce the connection latency.
Real-time speech synthesis: Use the Speech SDK or REST API to convert text to speech by using prebuilt neural voices or custom neural voices.
You can also check if there is indeed latency by checking the metrics from Speech resource Azure portal by applying the splitting as shown below:

To lower speech synthesis latency using Speech SDK there are a few best practices to lower the latency and bring the best performance to your end users. Please follow the recommendations available here:

https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-lower-speech-synthesis-latency?pivots=programming-language-csharp

If the above suggestions, doesn't help you can enable the JS SDK logging as shown below:

sdk.Diagnostics.SetLoggingLevel(sdk.LogLevel.Debug); sdk.Diagnostics.SetLogOutputPath("LogfilePathAndName");

Hope this helps. If you have any follow-up questions, please let me know. I would be happy to help.

Share via

Reduce latency in text to speech microsoft speedh SDK

1 answer