Azure Text to Speech streaming incurs very high latency at times

Question

We're using the C# Microsoft Cognitive Services Speech SDK in a conversational assistant application for streaming Text to Speech output. Whilst the latency is usually very good (less than 1000 msec), we've noticed that sometimes the latency can be much greater, e.g., 5 secs, 16 secs, and in the worst case we've witnessed, 24 seconds. This makes the service unusable for real-time conversation.

Examination of both the SDK logs and the Azure Portal metrics confirm that the requests are not being throttled and are not erroring; they are just taking a long time to produce any output. In our case (currently) we are only sending a single request to the service at a time, so it's not as if we're rapidly increasing the transaction per second count.

Here's some example output from the SDK logging, which suggests that the issue is within the service, and not a network issue for example;

[121198]: 18080987ms SPX_DBG_TRACE_VERBOSE: named_properties.h:479 ISpxNamedProperties::GetStringValue: this=0x000001A50872D580; name='RESULT-SynthesisBackend'; value='online (websocket)' [121198]: 18080987ms SPX_DBG_TRACE_VERBOSE: named_properties.h:479 ISpxPropertyBagImpl::SetStringValue: this=0x000001A57C310288; name='RESULT-SynthesisFirstByteLatencyMs'; value='24667' [121198]: 18080987ms SPX_DBG_TRACE_VERBOSE: named_properties.h:479 ISpxPropertyBagImpl::SetStringValue: this=0x000001A57C310288; name='RESULT-SynthesisFinishLatencyMs'; value='24689' [121198]: 18080987ms SPX_DBG_TRACE_VERBOSE: named_properties.h:479 ISpxPropertyBagImpl::SetStringValue: this=0x000001A57C310288; name='RESULT-SynthesisConnectionLatencyMs'; value='0' [121198]: 18080987ms SPX_DBG_TRACE_VERBOSE: named_properties.h:479 ISpxPropertyBagImpl::SetStringValue: this=0x000001A57C310288; name='RESULT-SynthesisNetworkLatencyMs'; value='15' [121198]: 18080987ms SPX_DBG_TRACE_VERBOSE: named_properties.h:479 ISpxPropertyBagImpl::SetStringValue: this=0x000001A57C310288; name='RESULT-SynthesisServiceLatencyMs'; value='24652' [121198]: 18080987ms SPX_DBG_TRACE_VERBOSE: named_properties.h:479 ISpxPropertyBagImpl::SetStringValue: this=0x000001A57C310288; name='RESULT-SynthesisUnderrunTimeMs'; value='0'

Fort the Speech Service we're using a Standard tier resource in the UK South region. We've recently been using the en-US-AndrewMultilingualNeural and en-US-EmmaMultilingualNeural voices.

This issue looks to be similar to this Q&A post long-delay-on-tts-first-response but we've not noticed any pattern to the slow calls; it could be any request that suddenly exhibits really high latency.

Is there anything that can be done to alleviate this issue? We are aware we could run the containerised versions ourselves but that is a somewhat more expensive route to take.

Any guidance will be most welcome.

Accepted Answer

Hello,

Thanks for reaching out to us, it seems like you're encountering intermittent high latency issues with Azure Text to Speech (TTS) service, which is impacting the real-time nature of your conversational assistant application. Since this issue seems not related to your network but more possible related to the server or the service, I would high recommend you raise a support ticket to check with more backend data to see what is the root cause for it.

Please let us know if you have no support plan, we are happy to enable you a free ticket.

Generally, there are some different reasons may cause the high latency issue and below are some of the workarounds you may want to try -

Service Tier and Region Considerations

Service Tier: Ensure you are using an appropriate service tier for your workload. Since you mentioned you're using the Standard tier in the UK South region, generally, this should provide reliable performance. However, occasionally even in Standard tiers, there can be spikes in latency due to various reasons.

Region: The region you choose can affect latency. Using a region closer to your users can reduce latency. Since you're already using UK South, this should typically provide good performance for users in the UK and nearby regions.

Voice Selection

Voice Type: Different voices may have varying performance characteristics. You mentioned using en-US-AndrewMultilingualNeural and en-US-EmmaMultilingualNeural. If the issue persists with these voices, consider trying other voices within the same or different locales to see if there's any improvement.

Optimization Steps

Concurrency: Ensure that you're not exceeding concurrency limits. Even though you're sending single requests at a time, ensure that there are no unexpected spikes or concurrency issues that could cause delays.

SDK Configuration: Review your SDK configuration settings to ensure they are optimized for performance. Sometimes adjusting parameters like buffer sizes or timeout settings can help.

Monitoring and Troubleshooting

Azure Portal Metrics: Continuously monitor Azure portal metrics for your Cognitive Services resource. Look for patterns or spikes in latency around the times you observe delays. This can provide insights into potential causes.

Logging and Diagnostics: Use extensive logging in your application to capture detailed information around the times of latency spikes. The logs you provided indicate high latency in various components (RESULT-SynthesisFinishLatencyMs at 24689 ms), confirming that the issue is within the service.

Alternative Solutions

Containerized Deployment: While you mentioned it's more expensive, running the containerized version of the service could potentially offer more predictable performance if you have specific latency requirements. This option provides more control over the environment but requires managing infrastructure and updates.

Failover and Redundancy: Consider implementing failover mechanisms or redundancy across different regions or services if high availability is critical for your application.

Please have a try on above workarounds, and also please feel free to let us know if you have any other questions or you need a support ticket.

I hope this helps.

Regards,

Yutong

-Please kindly accept the answer if you feel helpful to support the community, thanks a lot.

Share via

Azure Text to Speech streaming incurs very high latency at times

0 additional answers

Your answer