Python Streamlit Real-time Speech-to-Text with Azure SDK Issues

Question

Python Streamlit Real-time Speech-to-Text with Azure SDK Issues

Raffaele Aldrigo 20

A Python Streamlit app is being developed to allow live transcription using streamlit_webrtc and Azure Speech SDK. The existing implementation can save and play recorded audio from the web, but live transcription is not functioning as expected.

Here is a snippet of the code being used:

webrtc_ctx = webrtc_streamer(key="speech-to-text", mode=WebRtcMode.SENDONLY,
        media_stream_constraints={"video": False, "audio": True},
        audio_receiver_size=256)

while webrtc_ctx.state.playing:
    if not st.session_state["recording"]:
        st.session_state.r = []

        stream = PushAudioInputStream()
        ###
        audio_input = speechsdk.AudioConfig(stream=stream)
        speech_config = speechsdk.SpeechConfig(env["SPEECH_KEY"], env["SPEECH_REGION"])
        speech_config.speech_recognition_language = "it-IT"
        if "proxy_host" in env and "proxy_port" in env:
            speech_config.set_proxy(env["proxy_host"], int(env["proxy_port"]))
        conversation_transcriber = speechsdk.transcription.ConversationTranscriber(speech_config, audio_input)

        def addsentence(evt: ConversationTranscriptionEventArgs):
            if evt.result.speaker_id == "Unknown":
                logger.debug("Unknown speaker: " + str(evt))
                return
            logger.info(f"Detected **{evt.result.speaker_id}**: {evt.result.text}")
            st.session_state.r.append(f"**{evt.result.speaker_id}**: {evt.result.text}")

        conversation_transcriber.transcribed.connect(addsentence)
        ###

        st.session_state.fullwav = pydub.AudioSegment.empty()
        with (st.chat_message("assistant")):
            with st.spinner("Trascrizione in corso..."):
                stream_placeholder = st.expander("Trascrizione", icon="📝").empty()

        conversation_transcriber.start_transcribing_async()
        logger.info("Transcribing started!")
        st.session_state["recording"] = True

    try:
        audio_frames = webrtc_ctx.audio_receiver.get_frames(timeout=1)
    except queue.Empty:
        time.sleep(0.1)
        logger.debug("No frame arrived.")
        continue

    stream_placeholder.markdown("## Trascrizione:\n\n" + "\\\n".join(st.session_state.r))

    #sound_chunk = pydub.AudioSegment.empty()
    for audio_frame in audio_frames:
        sound = pydub.AudioSegment(
            data=audio_frame.to_ndarray().tobytes(),
            sample_width=audio_frame.format.bytes,
            frame_rate=audio_frame.sample_rate,
            channels=len(audio_frame.layout.channels),
        )
        #sound_chunk += sound
        st.session_state.fullwav += sound

    #if len(sound_chunk) >0:
        #stream.write(sound_chunk.get_array_of_samples())

if st.session_state["recording"]:
    logger.info("stopped listening")
    wav_file_path= tempfile.NamedTemporaryFile(suffix='.wav', delete=False).name
    st.session_state.fullwav.export(wav_file_path, format="wav")

Any insights or suggestions on what might be causing the issue with live transcription?

SriLakshmi C 6,010 Reputation points Microsoft External Staff Moderator

2025-03-20T10:16:35.8866667+00:00
Hello Raffaele Aldrigo,

I understand that you are facing issues with live transcription in your Python Streamlit app using the Azure Speech SDK. Here are some steps to troubleshoot the issue:

Ensure that the WebRTC stream is properly capturing audio by verifying that webrtc_ctx.audio_receiver.get_frames() is successfully receiving audio frames and that audio_frames is not empty.

In your code, PushAudioInputStream is initialized but not receiving any audio data. To enable transcription, you need to push audio frames into the stream so that Azure can process them. This is likely why transcription is not working.

Add logic to write audio frames to the stream using:

stream.write(bytes_data)

where bytes_data is the raw audio data extracted from audio_frame.

Ensure that conversation_transcriber.transcribed.connect(addsentence) is properly setting up the event listener. To verify that the addsentence function is being triggered, add logging statements or breakpoints inside it.

Confirm that the language set in speech_config.speech_recognition_language matches the audio input. In your case, it is set to "it-IT" (Italian). Ensure that the audio being transcribed is indeed in Italian.

To isolate the issue, try testing with a pre-recorded audio file to see if transcription works outside of the real-time context. This can help confirm if the issue lies with audio input handling or Azure SDK interaction.

Implement try-except blocks around key areas, especially where interactions with Azure SDK occur, to capture and log any exceptions that might be silently causing failures.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

Raffaele Aldrigo 20

I know that the code is commented :D
If I remove comments and ajust ad you said nothing happens, ps I speak in the microphone

def addsentence(evt: ConversationTranscriptionEventArgs):
    if evt.result.speaker_id == "Unknown":
        logger.debug("Unknown speaker: " + str(evt))
        return
    logger.info(f"Detected **{evt.result.speaker_id}**: {evt.result.text}")
    st.session_state.r.append(f"**{evt.result.speaker_id}**: {evt.result.text}")

while webrtc_ctx.state.playing:
    if not st.session_state["recording"]:
        st.session_state.r = []

        st.session_state.stream = PushAudioInputStream()
        ###
        audio_input = speechsdk.AudioConfig(stream=st.session_state.stream)
        speech_config = speechsdk.SpeechConfig(env["SPEECH_KEY"], env["SPEECH_REGION"])
        if "proxy_host" in env and "proxy_port" in env:
            speech_config.set_proxy(env["proxy_host"], int(env["proxy_port"]))
        conversation_transcriber = ConversationTranscriber(speech_config, audio_input, language="it-IT")

        conversation_transcriber.transcribed.connect(addsentence)
        ###

        st.session_state.fullwav = pydub.AudioSegment.empty()
        with (st.chat_message("assistant")):
            with st.spinner("Trascrizione in corso..."):
                stream_placeholder = st.expander("Trascrizione", icon="📝").empty()

        conversation_transcriber.start_transcribing_async()
        logger.info("Transcribing started!")
        st.session_state["recording"] = True

    try:
        audio_frames = webrtc_ctx.audio_receiver.get_frames(timeout=1)
    except queue.Empty:
        time.sleep(0.1)
        logger.debug("No frame arrived.")
        continue

    stream_placeholder.markdown("## Trascrizione:\n\n" + "\\\n".join(st.session_state.r))

    for audio_frame in audio_frames:
        st.session_state.stream.write(audio_frame.to_ndarray().tobytes())
        sound = pydub.AudioSegment(
            data=audio_frame.to_ndarray().tobytes(),
            sample_width=audio_frame.format.bytes,
            frame_rate=audio_frame.sample_rate,
            channels=len(audio_frame.layout.channels),
        )
        st.session_state.fullwav += sound

if st.session_state["recording"]:
    logger.info("stopped listening")
    wav_file_path= tempfile.NamedTemporaryFile(suffix='.wav', delete=False).name
    st.session_state.fullwav.export(wav_file_path, format="wav")

SriLakshmi C 6,010 Microsoft External Staff Moderator

Raffaele Aldrigo,

Since you are using single recognition, please switch to continuous recognition instead. Here is the updated code block:

speech_recognizer.recognizing.connect(lambda evt: print('RECOGNIZING: {}'.format(evt)))
speech_recognizer.recognized.connect(lambda evt: print('RECOGNIZED: {}'.format(evt)))
speech_recognizer.session_started.connect(lambda evt: print('SESSION STARTED: {}'.format(evt)))
speech_recognizer.session_stopped.connect(lambda evt: print('SESSION STOPPED {}'.format(evt)))
speech_recognizer.canceled.connect(lambda evt: print('CANCELED {}'.format(evt)))
speech_recognizer.session_stopped.connect(stop_cb)
speech_recognizer.canceled.connect(stop_cb)

With everything set up, you can call start_continuous_recognition():

speech_recognizer.start_continuous_recognition()
while not done:
    time.sleep(.5)

Please refer this https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-recognize-speech?pivots=programming-language-python#use-continuous-recognition

Thank you!

Raffaele Aldrigo 20 Reputation points

2025-03-24T13:11:41.3866667+00:00

Your code won't work with ConversationTranscriber because instead of recognized it has transcribed

Raffaele Aldrigo 20

I've tryed with custom Stream, the stream seems to be never read once

from azure.cognitiveservices.speech.audio import PullAudioInputStreamCallback, PullAudioInputStream, AudioStreamFormat
import threading

class DirectStreamCallback(PullAudioInputStreamCallback):
    def __init__(self):
        super().__init__()
        self._ar = bytes()
        self._lock = threading.Lock()

    def read(self, buffer: memoryview) -> int:
        print(f'trying to read {buffer.nbytes} frames')
        with self._lock:
            try:
                size = buffer.nbytes if buffer.nbytes <= len(self._ar) else len(self._ar)

                buffer[:size] = self._ar[:size]
                print(f'read {size} frames')
                self._ar = bytes()

                return size
            except Exception as ex:
                print('Exception in `read`: {}'.format(ex))
                raise

    def write(self, buffer: bytes) -> None:
        with self._lock:
            self._ar += buffer
        #print(f"wrote {len(buffer)} frames")

    def close(self) -> None:
        pass


class DirectStream(PullAudioInputStream):
    def __init__(self):
        self._cb= DirectStreamCallback()
        super().__init__(pull_stream_callback=self._cb,
            stream_format=AudioStreamFormat(channels=2, samples_per_second=48000, bits_per_sample=2))

    def write(self, buffer: bytes) -> None:
        self._cb.write(buffer)

JAYA SHANKAR G S 4,035 Reputation points Microsoft External Staff Moderator

2025-03-27T05:32:18.5033333+00:00

Hi @Raffaele Aldrigo ,

Did you check using speech_recognizer.start_continuous_recognition()?

Please check that once, meanwhile i try to reproduce the issue from my end.

Thank you
Raffaele Aldrigo 20 Reputation points

2025-03-27T08:06:41.61+00:00

@Anonymous I'm using transcriber not speech recognizer!

Raffaele Aldrigo 20

I was able o process rtmp but not webrtc

def transcribe_rmtp(self, rtmp_url: str) -> str:
        push_stream = PushAudioInputStream()
        audio_config = AudioConfig(stream=push_stream)
        transcriber = self.setup_transcriber(audio_config)
        transcriber.start_transcribing_async()

        ffmpeg_args = [
            "ffmpeg", "-i", rtmp_url, "-vn", "-ac", "1", "-ar", "16000",
            "-f", "s16le", "-fflags", "+genpts", "-bufsize", "512k",
            "-maxrate", "128k", "pipe:1"]
        ffmpeg_process = subprocess.Popen(ffmpeg_args, stdout=subprocess.PIPE, stderr=subprocess.DEVNULL)

        try:
            while not self.done:
                if self.on_transcribed:
                    self.on_transcribed("\\\n".join(self.results))
                chunk = ffmpeg_process.stdout.read(4096)
                if not chunk:
                    break
                push_stream.write(chunk)
                time.sleep(0.1)
        except Exception as e:
            logger.error("Errore durante lo streaming RTMP: %s", e)
        finally:
            push_stream.close()
            ffmpeg_process.kill()
            transcriber.stop_transcribing_async()

        return "\\\n".join(self.results)

JAYA SHANKAR G S 4,035 Reputation points Microsoft External Staff Moderator

2025-03-28T10:26:21.2766667+00:00

Hi @Raffaele Aldrigo , I am trying to reproduce the issue from my end. Meanwhile please update if you make any latest changes / workaround. Thank you

Accepted answer

1 additional answer

Your answer

SriLakshmi C 6,010 Reputation points Microsoft External Staff Moderator

2025-03-20T10:16:35.8866667+00:00

Hello Raffaele Aldrigo,

I understand that you are facing issues with live transcription in your Python Streamlit app using the Azure Speech SDK. Here are some steps to troubleshoot the issue:

Ensure that the WebRTC stream is properly capturing audio by verifying that webrtc_ctx.audio_receiver.get_frames() is successfully receiving audio frames and that audio_frames is not empty.

In your code, PushAudioInputStream is initialized but not receiving any audio data. To enable transcription, you need to push audio frames into the stream so that Azure can process them. This is likely why transcription is not working.

Add logic to write audio frames to the stream using:

stream.write(bytes_data)

where bytes_data is the raw audio data extracted from audio_frame.

Ensure that conversation_transcriber.transcribed.connect(addsentence) is properly setting up the event listener. To verify that the addsentence function is being triggered, add logging statements or breakpoints inside it.

Confirm that the language set in speech_config.speech_recognition_language matches the audio input. In your case, it is set to "it-IT" (Italian). Ensure that the audio being transcribed is indeed in Italian.

To isolate the issue, try testing with a pre-recorded audio file to see if transcription works outside of the real-time context. This can help confirm if the issue lies with audio input handling or Azure SDK interaction.

Implement try-except blocks around key areas, especially where interactions with Azure SDK occur, to capture and log any exceptions that might be silently causing failures.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.
SriLakshmi C 6,010 Reputation points Microsoft External Staff Moderator

2025-03-24T11:32:54.53+00:00

Raffaele Aldrigo,

Since you are using single recognition, please switch to continuous recognition instead. Here is the updated code block:

speech_recognizer.recognizing.connect(lambda evt: print('RECOGNIZING: {}'.format(evt))) speech_recognizer.recognized.connect(lambda evt: print('RECOGNIZED: {}'.format(evt))) speech_recognizer.session_started.connect(lambda evt: print('SESSION STARTED: {}'.format(evt))) speech_recognizer.session_stopped.connect(lambda evt: print('SESSION STOPPED {}'.format(evt))) speech_recognizer.canceled.connect(lambda evt: print('CANCELED {}'.format(evt))) speech_recognizer.session_stopped.connect(stop_cb) speech_recognizer.canceled.connect(stop_cb)

With everything set up, you can call start_continuous_recognition():

speech_recognizer.start_continuous_recognition() while not done: time.sleep(.5)

Please refer this https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-recognize-speech?pivots=programming-language-python#use-continuous-recognition

Thank you!
Raffaele Aldrigo 20 Reputation points

2025-03-24T13:11:41.3866667+00:00

Your code won't work with ConversationTranscriber because instead of recognized it has transcribed
Raffaele Aldrigo 20 Reputation points

2025-03-24T15:48:02.2+00:00

I've tryed with custom Stream, the stream seems to be never read once

from azure.cognitiveservices.speech.audio import PullAudioInputStreamCallback, PullAudioInputStream, AudioStreamFormat import threading class DirectStreamCallback(PullAudioInputStreamCallback): def __init__(self): super().__init__() self._ar = bytes() self._lock = threading.Lock() def read(self, buffer: memoryview) -> int: print(f'trying to read {buffer.nbytes} frames') with self._lock: try: size = buffer.nbytes if buffer.nbytes <= len(self._ar) else len(self._ar) buffer[:size] = self._ar[:size] print(f'read {size} frames') self._ar = bytes() return size except Exception as ex: print('Exception in `read`: {}'.format(ex)) raise def write(self, buffer: bytes) -> None: with self._lock: self._ar += buffer #print(f"wrote {len(buffer)} frames") def close(self) -> None: pass class DirectStream(PullAudioInputStream): def __init__(self): self._cb= DirectStreamCallback() super().__init__(pull_stream_callback=self._cb, stream_format=AudioStreamFormat(channels=2, samples_per_second=48000, bits_per_sample=2)) def write(self, buffer: bytes) -> None: self._cb.write(buffer)
JAYA SHANKAR G S 4,035 Reputation points Microsoft External Staff Moderator

2025-03-27T05:32:18.5033333+00:00

Hi @Raffaele Aldrigo ,

Did you check using speech_recognizer.start_continuous_recognition()?

Please check that once, meanwhile i try to reproduce the issue from my end.

Thank you
Raffaele Aldrigo 20 Reputation points

2025-03-27T08:06:41.61+00:00

@Anonymous I'm using transcriber not speech recognizer!
Raffaele Aldrigo 20 Reputation points

2025-03-28T10:14:28.2866667+00:00

I was able o process rtmp but not webrtc

def transcribe_rmtp(self, rtmp_url: str) -> str: push_stream = PushAudioInputStream() audio_config = AudioConfig(stream=push_stream) transcriber = self.setup_transcriber(audio_config) transcriber.start_transcribing_async() ffmpeg_args = [ "ffmpeg", "-i", rtmp_url, "-vn", "-ac", "1", "-ar", "16000", "-f", "s16le", "-fflags", "+genpts", "-bufsize", "512k", "-maxrate", "128k", "pipe:1"] ffmpeg_process = subprocess.Popen(ffmpeg_args, stdout=subprocess.PIPE, stderr=subprocess.DEVNULL) try: while not self.done: if self.on_transcribed: self.on_transcribed("\\\n".join(self.results)) chunk = ffmpeg_process.stdout.read(4096) if not chunk: break push_stream.write(chunk) time.sleep(0.1) except Exception as e: logger.error("Errore durante lo streaming RTMP: %s", e) finally: push_stream.close() ffmpeg_process.kill() transcriber.stop_transcribing_async() return "\\\n".join(self.results)
JAYA SHANKAR G S 4,035 Reputation points Microsoft External Staff Moderator

2025-03-28T10:26:21.2766667+00:00

Hi @Raffaele Aldrigo , I am trying to reproduce the issue from my end. Meanwhile please update if you make any latest changes / workaround. Thank you

Answer 1

Hello @Raffaele Aldrigo ,

Thanks for the update.

Since you can not accept your own answer I am posting it as a solution please accept it so that it will help the community to find better solution.

Issue: Live transcription using Azure speech SDK with streamlit_webrtc not working in Python Streamlit app.

Solution:

Updated code

def transcribe_webrtc(self, webrtc_ctx: WebRtcStreamerContext) -> str:
        push_stream = PushAudioInputStream()
        audio_config = AudioConfig(stream=push_stream)
        transcriber = self.setup_transcriber(audio_config)
        transcriber.start_transcribing_async()
        logger.info("Started WebRTC transcription")

        try:
            while webrtc_ctx.state.playing:
                audio_frames = webrtc_ctx.audio_receiver.get_frames(timeout=1)
                if not audio_frames:
                    logger.debug("No audio frames received")
                    continue

                frame = pydub.AudioSegment.empty()
                for audio_frame in audio_frames:
                    sound = pydub.AudioSegment(
                        data=audio_frame.to_ndarray().tobytes(),
                        sample_width=audio_frame.format.bytes,
                        frame_rate=audio_frame.sample_rate,
                        channels=len(audio_frame.layout.channels),
                    )
                    frame += sound

                if len(frame) > 0:
                    logger.debug(f"Processing audio frame of length {len(frame.raw_data)} bytes")
                    frame= frame.set_channels(1).set_frame_rate(16000)
                    push_stream.write(frame.raw_data)

                if self.on_transcribed:
                    self.on_transcribed("\\\n".join(self.results))
                time.sleep(0.1)

Thank you

Answer 2

I've came up with a working solution

    def transcribe_webrtc(self, webrtc_ctx: WebRtcStreamerContext) -> str:
        push_stream = PushAudioInputStream()
        audio_config = AudioConfig(stream=push_stream)
        transcriber = self.setup_transcriber(audio_config)
        transcriber.start_transcribing_async()
        logger.info("Started WebRTC transcription")

        try:
            while webrtc_ctx.state.playing:
                audio_frames = webrtc_ctx.audio_receiver.get_frames(timeout=1)
                if not audio_frames:
                    logger.debug("No audio frames received")
                    continue

                frame = pydub.AudioSegment.empty()
                for audio_frame in audio_frames:
                    sound = pydub.AudioSegment(
                        data=audio_frame.to_ndarray().tobytes(),
                        sample_width=audio_frame.format.bytes,
                        frame_rate=audio_frame.sample_rate,
                        channels=len(audio_frame.layout.channels),
                    )
                    frame += sound

                if len(frame) > 0:
                    logger.debug(f"Processing audio frame of length {len(frame.raw_data)} bytes")
                    frame= frame.set_channels(1).set_frame_rate(16000)
                    push_stream.write(frame.raw_data)

                if self.on_transcribed:
                    self.on_transcribed("\\\n".join(self.results))
                time.sleep(0.1)

Share via

Python Streamlit Real-time Speech-to-Text with Azure SDK Issues

1 additional answer

Your answer