How to stream audio output using python Speech SDK?

Question

How to stream audio output using python Speech SDK?

Gustavo Jakobi 0

The GitHub repository for the SDK contains numerous samples, but they do not provide clear guidance on handling audio streams. Specifically, I am currently working with this python sample, and while the examples print the buffer size, they do not demonstrate the proper usage. Here is an excerpt from the sample:

audio_buffer = bytes(32000)
total_size = 0
filled_size = pull_stream.read(audio_buffer)
while filled_size > 0:     
	print("{} bytes received.".format(filled_size))     
	total_size += filled_size     
	filled_size = pull_stream.read(audio_buffer)     
print("Totally {} bytes received.".format(total_size))

Ideally, I would like to receive my audio in chunks and play it as I receive them, rather than waiting for the entire audio to finish. Could you provide guidance on the correct approach for handling audio streams in real-time?

navba-MSFT 27,540 Microsoft Employee Moderator

@Gustavo Jakobi

Could you please check if the below sample code helps ?

import pyaudio
import azure.cognitiveservices.speech as speechsdk
from azure.cognitiveservices.speech import AudioDataStream, AudioConfig, SpeechConfig, SpeechSynthesizer

# Set up the speech synthesizer
speech_config = SpeechConfig(subscription="027cdXXXXXXXX5e", region="westeurope")

# Set up the audio stream
pull_stream = speechsdk.audio.PullAudioOutputStream()
audio_output = AudioConfig(stream=pull_stream)
synthesizer = SpeechSynthesizer(speech_config=speech_config, audio_config=audio_output)

# Synthesize speech
result = synthesizer.speak_text_async("Hello, world!").get()

# Set up PyAudio
p = pyaudio.PyAudio()
stream = p.open(format=p.get_format_from_width(2), channels=1, rate=16000, output=True)

# Read and play the audio data as it becomes available
audio_buffer = bytearray(32000)

# Set the callback function for the pull_audio_output_stream event
pull_stream.on_pull_audio_output_stream = lambda buffer_size: bytes(audio_buffer[:pull_stream.readinto(memoryview(audio_buffer))])

# Read and play the audio data as it becomes available
stream.start_stream()
try:
    while stream.is_active():
        # You can add any additional logic here if needed
        pass
except KeyboardInterrupt:
    pass

# Clean up
stream.stop_stream()
stream.close()
p.terminate()

. Awaiting your reply.

Gustavo Jakobi 0 Reputation points

2024-01-17T13:56:33.4466667+00:00

Hi @navba-MSFT . What's the expected output of this code? It runs without errors here and enters the while loop, but did not play any audio. The stream seems to not have any connections with the audio_buffer, is that right?

In my real-world scenario, I'd like to send these audio chunks to another service. If I'm getting it right, they are stored in the audio_buffer, but they don't appear to be in any recognizable audio format.

Thanks for your help.

navba-MSFT 27,540 Microsoft Employee Moderator

@Gustavo Jakobi Thanks for getting back and clarifying the ask. I have updated the below code. Please test it and let me know if that helps.

import pyaudio
import azure.cognitiveservices.speech as speechsdk
from azure.cognitiveservices.speech import AudioConfig, SpeechConfig, SpeechSynthesizer

# Set up the speech synthesizer
speech_config = SpeechConfig(subscription="027cdXXXXXXXX5e", region="westeurope")

# Set up the audio stream
pull_stream = speechsdk.audio.PullAudioOutputStream()
audio_output = AudioConfig(stream=pull_stream)
synthesizer = SpeechSynthesizer(speech_config=speech_config, audio_config=audio_output)

# Set up PyAudio
p = pyaudio.PyAudio()
stream = p.open(format=p.get_format_from_width(2), channels=1, rate=16000, output=True)

# Synthesize speech
result = synthesizer.speak_text_async("Hello, world!").get()

# Get the audio data
audio_data = result.audio_data

# Play the audio data in chunks
chunk_size = 1024  # Adjust the chunk size as needed
offset = 0
while offset < len(audio_data):
    chunk = audio_data[offset:offset + chunk_size]
    stream.write(chunk)
    offset += chunk_size

# Clean up
stream.stop_stream()
stream.close()
p.terminate()

navba-MSFT 27,540 Reputation points Microsoft Employee Moderator

2024-01-20T06:27:07.64+00:00

@Gustavo Jakobi Just following up to check if my suggestion helped. Please let me know if you have any further queries. I would be happy to help.
Gustavo Jakobi 0 Reputation points

2024-01-22T00:36:46.9866667+00:00

Yes, the output would be something like that. However, in this particular example, is there any reduction in latency? It seems that we are still waiting for the entire text to be processed, unless I am overlooking something. The main goal would be emitting the chunks as soon the audios are being generated, as indicated here in the "Streaming" section.

1 answer

Your answer

navba-MSFT 27,540 Reputation points Microsoft Employee Moderator

2024-01-17T05:31:51.8133333+00:00

@Gustavo Jakobi

Could you please check if the below sample code helps ?

import pyaudio import azure.cognitiveservices.speech as speechsdk from azure.cognitiveservices.speech import AudioDataStream, AudioConfig, SpeechConfig, SpeechSynthesizer # Set up the speech synthesizer speech_config = SpeechConfig(subscription="027cdXXXXXXXX5e", region="westeurope") # Set up the audio stream pull_stream = speechsdk.audio.PullAudioOutputStream() audio_output = AudioConfig(stream=pull_stream) synthesizer = SpeechSynthesizer(speech_config=speech_config, audio_config=audio_output) # Synthesize speech result = synthesizer.speak_text_async("Hello, world!").get() # Set up PyAudio p = pyaudio.PyAudio() stream = p.open(format=p.get_format_from_width(2), channels=1, rate=16000, output=True) # Read and play the audio data as it becomes available audio_buffer = bytearray(32000) # Set the callback function for the pull_audio_output_stream event pull_stream.on_pull_audio_output_stream = lambda buffer_size: bytes(audio_buffer[:pull_stream.readinto(memoryview(audio_buffer))]) # Read and play the audio data as it becomes available stream.start_stream() try: while stream.is_active(): # You can add any additional logic here if needed pass except KeyboardInterrupt: pass # Clean up stream.stop_stream() stream.close() p.terminate()

. Awaiting your reply.
Gustavo Jakobi 0 Reputation points

2024-01-17T13:56:33.4466667+00:00

Hi @navba-MSFT . What's the expected output of this code? It runs without errors here and enters the while loop, but did not play any audio. The stream seems to not have any connections with the audio_buffer, is that right?

In my real-world scenario, I'd like to send these audio chunks to another service. If I'm getting it right, they are stored in the audio_buffer, but they don't appear to be in any recognizable audio format.

Thanks for your help.
navba-MSFT 27,540 Reputation points Microsoft Employee Moderator

2024-01-18T05:15:41.3633333+00:00

@Gustavo Jakobi Thanks for getting back and clarifying the ask. I have updated the below code. Please test it and let me know if that helps.

import pyaudio import azure.cognitiveservices.speech as speechsdk from azure.cognitiveservices.speech import AudioConfig, SpeechConfig, SpeechSynthesizer # Set up the speech synthesizer speech_config = SpeechConfig(subscription="027cdXXXXXXXX5e", region="westeurope") # Set up the audio stream pull_stream = speechsdk.audio.PullAudioOutputStream() audio_output = AudioConfig(stream=pull_stream) synthesizer = SpeechSynthesizer(speech_config=speech_config, audio_config=audio_output) # Set up PyAudio p = pyaudio.PyAudio() stream = p.open(format=p.get_format_from_width(2), channels=1, rate=16000, output=True) # Synthesize speech result = synthesizer.speak_text_async("Hello, world!").get() # Get the audio data audio_data = result.audio_data # Play the audio data in chunks chunk_size = 1024 # Adjust the chunk size as needed offset = 0 while offset < len(audio_data): chunk = audio_data[offset:offset + chunk_size] stream.write(chunk) offset += chunk_size # Clean up stream.stop_stream() stream.close() p.terminate()
navba-MSFT 27,540 Reputation points Microsoft Employee Moderator

2024-01-20T06:27:07.64+00:00

@Gustavo Jakobi Just following up to check if my suggestion helped. Please let me know if you have any further queries. I would be happy to help.
Gustavo Jakobi 0 Reputation points

2024-01-22T00:36:46.9866667+00:00

Yes, the output would be something like that. However, in this particular example, is there any reduction in latency? It seems that we are still waiting for the entire text to be processed, unless I am overlooking something. The main goal would be emitting the chunks as soon the audios are being generated, as indicated here in the "Streaming" section.

Answer 1

@Gustavo Jakobi Thanks for getting back. You can leverage the PushAudioInputStream.Write Method.. This writes the audio data specified by making an internal copy of the data. Note: The dataBuffer must not contain an audio header.

. Sample code:

import os
import azure.cognitiveservices.speech as speechsdk
import wave

def recognize_from_wav_file(filename):
    # This example requires environment variables named "SPEECH_KEY" and "SPEECH_REGION"
    speech_config = speechsdk.SpeechConfig(subscription="4cd859XXXXXXXXX1b8", region="westeurope")
    speech_config.speech_recognition_language="en-US"

    # Open the .wav file
    wf = wave.open(filename, 'rb')

    # Set up the audio stream
    push_stream = speechsdk.audio.PushAudioInputStream()
    audio_config = speechsdk.audio.AudioConfig(stream=push_stream)
    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

    # Read the .wav file in chunks and feed them to the speech recognizer
    CHUNK = 1024
    data = wf.readframes(CHUNK)
    while len(data) > 0:
        push_stream.write(data)
        data = wf.readframes(CHUNK)

    # Close the stream to signal that all audio data has been written
    push_stream.close()

    # Recognize the speech from the .wav file
    speech_recognition_result = speech_recognizer.recognize_once_async().get()

    if speech_recognition_result.reason == speechsdk.ResultReason.RecognizedSpeech:
        print("Recognized: {}".format(speech_recognition_result.text))
    elif speech_recognition_result.reason == speechsdk.ResultReason.NoMatch:
        print("No speech could be recognized: {}".format(speech_recognition_result.no_match_details))
    elif speech_recognition_result.reason == speechsdk.ResultReason.Canceled:
        cancellation_details = speech_recognition_result.cancellation_details
        print("Speech Recognition canceled: {}".format(cancellation_details.reason))
        if cancellation_details.reason == speechsdk.CancellationReason.Error:
            print("Error details: {}".format(cancellation_details.error_details))
            print("Did you set the speech resource key and region values?")

recognize_from_wav_file('MyAudioFile.wav')

Share via

How to stream audio output using python Speech SDK?

1 answer

Your answer