Stream Audio Issue with Speech

Diomedes Kastanis 0 Reputation points Microsoft Employee
2025-02-05T15:39:56.7166667+00:00

We’re using a Python FastAPI server to stream audio from the browser via WebSocket and pass it to Azure Speech. Our goal is to automatically recognize the input language, translate it to English, and stream both the translated text and audio back to the browser. The challenge seems to be with sending the audio stream to Azure Speech using AudioStream. When using use_default_microphone=True, everything works perfectly. However, streaming the audio input instead of using the default microphone appears to be the issue. Thanks here's the code, @router.websocket("/translate/speech")

async def websocket_endpoint(websocket: WebSocket):

    await websocket.accept()

    await websocket.send_text("Connected to the translator service")

 

    # Create a PushAudioInputStream to act as a bucket for incoming audio data.

    audio_format = speechsdk.audio.AudioStreamFormat(samples_per_second=16000, bits_per_sample=16, channels=1)

    audio_stream = speechsdk.audio.PushAudioInputStream(stream_format=audio_format)

    audio_config = speechsdk.audio.AudioConfig(stream=audio_stream)

 

    # Create a speech translation config with specified subscription key and service region.

    translation_config = speechsdk.translation.SpeechTranslationConfig(subscription=AZURE_SPEECH_SUBS_KEY, region=AZURE_SPEECH_REGION)

 

    # Replace with the languages of your choice, from list found here: https://aka.ms/speech/sttt-languages

    from_language = "en-US"

    to_language = "es"

    translation_config.speech_recognition_language = from_language

    translation_config.add_target_language(to_language)

    translation_config.voice_name = "en-US-JennyNeural"  # Optional: Set the voice name of the output translation.

 

    # Create the TranslationRecognizer with the audio configuration.

    recognizer = speechsdk.translation.TranslationRecognizer(translation_config=translation_config, audio_config=audio_config)

 

    # Configure speech synthesis (for translated speech output)

    speech_config = speechsdk.SpeechConfig(subscription=AZURE_SPEECH_SUBS_KEY, region=AZURE_SPEECH_REGION)

    speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
2,069 questions
{count} votes

1 answer

Sort by: Most helpful
  1. santoshkc 15,345 Reputation points Microsoft External Staff Moderator
    2025-02-06T13:09:29.5666667+00:00

    Hi @Diomedes Kastanis,

    Thank you for reaching out to Microsoft Q&A forum!

    When using the default microphone, the system expects a real-time stream from the microphone, but when streaming from a WebSocket, it can be more complex because the incoming data might not be in the same format.

    One common issue is the handling of the audio buffer or timing mismatch between the WebSocket stream and Azure Speech's PushAudioInputStream. It’s important to ensure that the audio is being correctly chunked and sent in real-time to the Azure Speech service. You might want to check the buffering and ensure that the audio is being sent in a consistent manner that matches the expected format for Azure's recognition.

    Additionally, consider checking if the WebSocket data is being correctly formatted before it is passed to the PushAudioInputStream. Also verify if the WebSocket connection is maintaining a consistent flow of data, as interruptions in the stream could affect recognition.

    Thank you.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.