Stream azure tts audio byte output directly to cloud storage?

Question

Stream azure tts audio byte output directly to cloud storage?

LeetGPT 95

def speech_synthesis_to_push_audio_output_stream():
    """performs speech synthesis and push audio output to a stream"""
    class PushAudioOutputStreamSampleCallback(speechsdk.audio.PushAudioOutputStreamCallback):
        """
        Example class that implements the PushAudioOutputStreamCallback, which is used to show
        how to push output audio to a stream
        """
        def __init__(self) -> None:
            super().__init__()
            self._audio_data = bytes(0)
            self._closed = False
        def write(self, audio_buffer: memoryview) -> int:
            """
            The callback function which is invoked when the synthesizer has an output audio chunk
            to write out
            """
            self._audio_data += audio_buffer
            print("{} bytes received.".format(audio_buffer.nbytes))
            return audio_buffer.nbytes
        def close(self) -> None:
            """
            The callback function which is invoked when the synthesizer is about to close the
            stream.
            """
            self._closed = True
            print("Push audio output stream closed.")
        def get_audio_data(self) -> bytes:
            return self._audio_data
        def get_audio_size(self) -> int:
            return len(self._audio_data)
    # Creates an instance of a speech config with specified subscription key and service region.
    speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
    # Creates customized instance of PushAudioOutputStreamCallback
    stream_callback = PushAudioOutputStreamSampleCallback()
    # Creates audio output stream from the callback
    push_stream = speechsdk.audio.PushAudioOutputStream(stream_callback)
    # Creates a speech synthesizer using push stream as audio output.
    stream_config = speechsdk.audio.AudioOutputConfig(stream=push_stream)
    speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=stream_config)
    # Receives a text from console input and synthesizes it to stream output.
    while True:
        print("Enter some text that you want to synthesize, Ctrl-Z to exit")
        try:
            text = input()
        except EOFError:
            break
        result = speech_synthesizer.speak_text_async(text).get()
        # Check result
        if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
            print("Speech synthesized for text [{}], and the audio was written to output stream.".format(text))
        elif result.reason == speechsdk.ResultReason.Canceled:
            cancellation_details = result.cancellation_details
            print("Speech synthesis canceled: {}".format(cancellation_details.reason))
            if cancellation_details.reason == speechsdk.CancellationReason.Error:
                print("Error details: {}".format(cancellation_details.error_details))
        # Destroys result which is necessary for destroying speech synthesizer
        del result
    # Destroys the synthesizer in order to close the output stream.
    del speech_synthesizer
    print("Totally {} bytes received.".format(stream_callback.get_audio_size()))

Hi team, I was following this sample code for utilizing output stream, however when I upload audio bytes directly onto cloud storage, the audio itself cannot be played. I ended up saving the audio file locally first with proper encoding and then save to cloud storage. However, this create a bit I/O overhead and extra latency. Is there a way we can stream upload audio bytes to cloud storage without saving it to local file with proper encoding first? Thank you so much!!

Accepted answer

0 additional answers

Your answer

Answer 1

Hi @LeetGPT

Thank you for reaching out to the Microsoft Q&A forum and for providing your code snippet.

I understand that you're looking for a more streamlined approach to stream the Azure Text-to-Speech (TTS) audio output directly to cloud storage without saving it locally first.

While your current implementation effectively synthesizes speech and uploads it to Azure Blob Storage, it involves an intermediate step of saving the audio file locally before uploading it to the cloud storage, which introduces additional I/O overhead and latency.

To achieve a more direct streaming of audio data to cloud storage without saving it locally, you can leverage Azure Blob Storage's ability to accept byte data directly. Below, I've outlined a modified approach that worked for me.

import os
import azure.cognitiveservices.speech as speechsdk
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
# Azure Speech Service Configuration
speech_key = "YOUR_SPEECH_KEY"
service_region = "YOUR_SERVICE_REGION"
from_language = 'en-US'

# Azure Storage Configuration
storage_connection_string = "YOUR_STORAGE_CONNECTION_STRING"
container_name = 'YOUR_CONTAINER_NAME'
def synthesize_text_to_speech():
    # Azure Speech Service Configuration
    speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
    speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)
    # Text to be synthesized
    text = "Hello, how are you?"
    # Synthesize speech
    result = speech_synthesizer.speak_text_async(text).get()
    if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
        print("Speech synthesized successfully.")
        audio_data = result.audio_data
        upload_audio_to_storage(audio_data)
    else:
        print("Failed to synthesize speech:", result.reason)
def upload_audio_to_storage(audio_data):
    blob_service_client = BlobServiceClient.from_connection_string(storage_connection_string)
    container_client = blob_service_client.get_container_client(container_name)
    blob_name = 'synthesized_audio.wav'
    blob_client = container_client.get_blob_client(blob_name)
    # Upload audio data to Azure Blob Storage
    blob_client.upload_blob(audio_data, overwrite=True)
    print("Audio uploaded to Azure Blob Storage.")
synthesize_text_to_speech()

This modified approach directly uploads the audio data obtained from the Azure Text-to-Speech service to Azure Blob Storage without the need for saving it locally first. By utilizing the upload_blob method directly with the audio data, you can effectively streamline the process and reduce unnecessary I/O operations and latency.

Please ensure you replace the placeholder values (e.g., "YOUR_SPEECH_KEY", "YOUR_SERVICE_REGION", "YOUR_STORAGE_CONNECTION_STRING") with your actual Azure Speech service key, service region, and storage connection string respectively.

Blob Upload result.
User's image Hope you understand. Thank you.

If this answers your query, do click Accept Answer and Yes for was this answer helpful.

LeetGPT 95 Reputation points

2024-02-14T21:30:40.51+00:00

Thank you for the reply. I tried this and it did indeed work. However I was trying with audio stream and upload bytes on the fly whenever audio bytes becomes available, the result audio on cloud storage won't play. Any ideas?
dupammi 8,615 Reputation points Microsoft External Staff

2024-02-15T04:02:49.76+00:00

Hi @LeetGPT

It's possible that the audio file is being corrupted during the upload process. One thing to check is the encoding of the audio file. Make sure that the encoding of the audio file is compatible with the Azure Blob Storage. Also, check if the audio file is being uploaded in binary mode. If the audio file is being uploaded in text mode, it can cause corruption.

Another thing to check is the format of the audio file. Make sure that the format of the audio file is compatible with the Azure Blob Storage. The batch transcription API supports WAV, MP3, and OGG formats.

As part of debugging, you can also modify your code to perform a streaming playback with libraries like simpleaudio to make sure the streaming is being played back properly. Then the upload or download could be a problem with differing encoding, format, etc.

You can also try to download the audio file from the Azure Blob Storage and play it locally to see if it's working properly. If the audio file is still corrupted, you can try to re-upload the audio file to Azure Blob Storage.

You can also try chunked transfer uploading while posting the audio data, which can significantly reduce the latency. You can find more information on how to enable streaming in the sample code provided in various programming languages in the following link: https://github.com/Azure-Samples/Cognitive-Speech-TTS/tree/master/PronunciationAssessment.

I hope you understand. Thank you.

If this answers your query, do click Accept Answer and Yes for was this answer helpful.
LeetGPT 95 Reputation points

2024-02-15T17:11:17.33+00:00

@dupammi Okay it's working now. Thank you so much for your answer. There was a minor issue though, during testing I found that each generated audio has a duplicate audio segment at the very end. I attached my initial code below:

synthesis_future = self.synthesizer.start_speaking_text_async(text) result = synthesis_future.get() audio_data_stream = speechsdk.AudioDataStream(result) audio_buffer = bytes(16000) filled_size = audio_data_stream.read_data(audio_buffer) while filled_size > 0: self.on_audio(audio_buffer) filled_size = audio_data_stream.read_data(audio_buffer) # Mark the end of audio. self.on_audio(None)

However if I change audio_buffer size to 2000

audio_buffer = bytes(2000)

this problem goes away, any ideas? Not a major issue but curious why it behaves differently when given a different buffer size.
dupammi 8,615 Reputation points Microsoft External Staff

2024-02-16T01:43:13.04+00:00

Hi @LeetGPT

Glad to hear that the solution worked for you!

Regarding the issue with the duplicate audio segment at the end of each generated audio, it's possible that when you use a buffer size of 16000, the last audio segment is not fully read into the buffer before the end of the audio stream is reached, resulting in a duplicate segment at the end. When you use a buffer size of 2000, the last audio segment is fully read into the buffer before the end of the audio stream is reached, avoiding the duplicate segment.

To confirm whether this is the issue, you can try increasing the buffer size to a larger value, such as 32000, and see if the duplicate segment issue persists. If it does, you may try adjusting the code to handle the end of the audio stream more gracefully, such as by checking for the end of the stream explicitly and stopping the audio playback when it's reached.

I hope this helps! Thank you.

Please do not forget to click Accept Answer and Yes for was this answer helpful, wherever the information provided helps you. This can be beneficial to other community members.

Share via

Stream azure tts audio byte output directly to cloud storage?

0 additional answers

Your answer