Running Web Avatar Code Sample but with Audio-Only mode

Question

Running Web Avatar Code Sample but with Audio-Only mode

Saeed Zidan 0

I've been testing Microsoft's Web/Avatar code sample (cognitive-services-speech-sdk/samples/python/web/avatar at master · Azure-Samples/cognitive-services-speech-sdk) and successfully got it running. However, the pricing for the speech service with an avatar is quite high, especially for testing purposes.

Now, I'm trying to run the same sample in audio-only mode to enable voice chat without the avatar, but I haven't been able to get it working. When I disable the avatar in the Speech Service endpoint (wss://eastus2.tts.speech.microsoft.com/cognitiveservices/websocket/v1?enableTalkingAvatar=false), the SDP connection string remains empty no matter what I try. Additionally, there's no documentation on using a WebRTC connection with a relay in voice-only mode.

Is this even possible? If not, what alternative approaches would you suggest for implementing a voice-only chatbot on a web page?

Edit: To clarify, the issue I'm facing is specifically with the TTS part. The STT is working fine on the client side, but the problem arises when I try to stream the audio output without the avatar.

2 answers

Your answer

Answer 1

santoshkc 15,330 Microsoft External Staff Moderator

Hi @Saeed Zidan,

It looks like you're trying to run the Web Avatar sample in audio-only mode to reduce costs, but disabling the avatar (enableTalkingAvatar=false) is causing issues with the SDP connection in WebRTC. This likely happens because WebRTC expects both an audio and video track, and removing the avatar might be breaking the connection.

Since your speech-to-text (STT) is working fine, but TTS isn’t streaming properly, here are a few approaches you can try:

Use Speech SDK’s Audio Streaming: Instead of relying on WebRTC, modify the code to use Azure Speech SDK’s pull audio output, which lets you stream the synthesized speech and play it directly in the browser. Please refer: How to synthesize speech from text.
Use Azure Speech TTS REST API – Request speech synthesis via REST, receive the audio in chunks, and stream it dynamically using the Web Audio API. See: TTS REST API.

I hope you understand. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful.

santoshkc 15,330 Reputation points Microsoft External Staff Moderator

2025-02-17T07:25:36.0933333+00:00

Hi Saeed Zidan,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Thank you.
santoshkc 15,330 Reputation points Microsoft External Staff Moderator

2025-02-19T17:23:56.8133333+00:00

Hi Saeed Zidan,
Following up to see if the given response was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful.

Answer 2

Hello Saeed Zidan,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

I understand that you are running Web Avatar Code Sample but with Audio-Only mode.

While reviewing your scenario:

I identified that the Avatar endpoint does not support audio-only streaming, explaining why SDP negotiation fails when disabling the avatar. Also, I would like to suggest using*Azure Speech SDK’s standard TTS service for generating audio responses and streaming them to the browser. This is a valid and cost-effective alternative. Then, I provide client-side and server-side implementation to replace the WebRTC-based setup with WebSockets/Web Audio API.

However, putting into consideration your critical scenario and generates the full TTS response first and then streams it, introducing higher latency than WebRTC. Instead of using WebSockets with a full synthesized response, use Azure Speech SDK’s Audio Push Stream, which allows real-time TTS streaming.

At Client-Side (Browser) – use Azure Speech SDK with Push Audio Stream to process and play audio in real-time as it arrives.

const speechConfig = SpeechSDK.SpeechConfig.fromSubscription("SUBSCRIPTION_KEY", "REGION");
const audioConfig = SpeechSDK.AudioConfig.fromDefaultSpeakerOutput();
const synthesizer = new SpeechSDK.SpeechSynthesizer(speechConfig, audioConfig);
// Send text and receive audio stream in real-time
synthesizer.speakTextAsync("Hello! How can I assist you?", result => {
    if (result.reason === SpeechSDK.ResultReason.SynthesizingAudioCompleted) {
        console.log("Audio received and playing.");
    }
});

Server-Side (Python) – this ensures audio chunks are sent as they are generated, mimicking WebRTC’s real-time nature.

import azure.cognitiveservices.speech as speechsdk
def real_time_tts(text):
    speech_config = speechsdk.SpeechConfig(subscription="SUBSCRIPTION_KEY", region="REGION")
    # Create an audio stream for real-time playback
    stream = speechsdk.audio.PushAudioOutputStream()
    audio_output = speechsdk.audio.AudioOutputConfig(stream=stream)
    
    synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_output)
    
    result = synthesizer.speak_text_async(text).get()
    if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
        print("Real-time audio stream started.")
    else:
        print("TTS failed.")

The reason this is the best approach:

This approach streams the synthesized audio in real-time, preventing the full-response delay issue.
It uses the Push Audio Stream API, designed for real-time TTS playback.
It uses standard TTS pricing without the avatar service, reducing costs.
Direct SDK integration avoids extra infrastructure.

I hope this is helpful! Do not hesitate to let me know if you have any other questions.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

Share via

Running Web Avatar Code Sample but with Audio-Only mode

2 answers

Your answer