Hello Saeed Zidan,
Welcome to the Microsoft Q&A and thank you for posting your questions here.
I understand that you are running Web Avatar Code Sample but with Audio-Only mode.
While reviewing your scenario:
I identified that the Avatar endpoint does not support audio-only streaming, explaining why SDP negotiation fails when disabling the avatar. Also, I would like to suggest using*Azure Speech SDK’s standard TTS service for generating audio responses and streaming them to the browser. This is a valid and cost-effective alternative. Then, I provide client-side and server-side implementation to replace the WebRTC-based setup with WebSockets/Web Audio API.
However, putting into consideration your critical scenario and generates the full TTS response first and then streams it, introducing higher latency than WebRTC. Instead of using WebSockets with a full synthesized response, use Azure Speech SDK’s Audio Push Stream, which allows real-time TTS streaming.
- At Client-Side (Browser) – use Azure Speech SDK with Push Audio Stream to process and play audio in real-time as it arrives.
const speechConfig = SpeechSDK.SpeechConfig.fromSubscription("SUBSCRIPTION_KEY", "REGION");
const audioConfig = SpeechSDK.AudioConfig.fromDefaultSpeakerOutput();
const synthesizer = new SpeechSDK.SpeechSynthesizer(speechConfig, audioConfig);
// Send text and receive audio stream in real-time
synthesizer.speakTextAsync("Hello! How can I assist you?", result => {
if (result.reason === SpeechSDK.ResultReason.SynthesizingAudioCompleted) {
console.log("Audio received and playing.");
}
});
- Server-Side (Python) – this ensures audio chunks are sent as they are generated, mimicking WebRTC’s real-time nature.
import azure.cognitiveservices.speech as speechsdk
def real_time_tts(text):
speech_config = speechsdk.SpeechConfig(subscription="SUBSCRIPTION_KEY", region="REGION")
# Create an audio stream for real-time playback
stream = speechsdk.audio.PushAudioOutputStream()
audio_output = speechsdk.audio.AudioOutputConfig(stream=stream)
synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_output)
result = synthesizer.speak_text_async(text).get()
if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
print("Real-time audio stream started.")
else:
print("TTS failed.")
The reason this is the best approach:
- This approach streams the synthesized audio in real-time, preventing the full-response delay issue.
- It uses the Push Audio Stream API, designed for real-time TTS playback.
- It uses standard TTS pricing without the avatar service, reducing costs.
- Direct SDK integration avoids extra infrastructure.
I hope this is helpful! Do not hesitate to let me know if you have any other questions.
Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.