Hey team,
I've been using azure TTS like the following, however I found that sometimes the wav file come back from the service is corrupted and not playable. I've attached one of the audio file I got and I keep in the memory from azure tts.
https://storage.googleapis.com/leetgpt-audio/clrpzssxg00047kglu71ywayr/cls40f9o80003n4qd9ck4n09f/assistant_2024-02-02_05-27-38/0.wav
I can confirm this has nothing to do with gcs upload I am using since I download the file locally and it's also not playable. I can't attach the original wav file here since format is not allowed.
I also noticed a increasing number of client errors from my tts service metrics dashboard, not sure if this is related to the issue.
Also this behavior is quite flaky so 60% of the time, the returned audio file is playable.
Could someone help me take a look and let me know what's the issue is all about?
import azure.cognitiveservices.speech as speechsdk
import os
import uuid
class AzureTTS:
# https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?tabs=tts#voice-styles-and-roles
# https://speech.microsoft.com/portal/1e44aaf2148347e5a53c696ab0175042/voicegallery
# Default: en-US-BrianNeural
# Indian: mr-IN-ManoharNeural
# Chinese: zh-CN-YunxiNeural
# HR-default: en-US-AvaNeural
def __init__(self,
on_audio,
voice='en-US-BrianNeural',
on_completion=None):
# Initialize the speech configuration using environment variables
self.speech_config = speechsdk.SpeechConfig(subscription=os.environ.get('SPEECH_KEY'),
region=os.environ.get('SPEECH_REGION'))
self.speech_config.speech_synthesis_voice_name=voice
# Create a synthesizer with no audio config to use for preconnecting
self.synthesizer = speechsdk.SpeechSynthesizer(self.speech_config, audio_config=None)
self.on_audio = on_audio
self.on_completion = on_completion
self.synthesizer.synthesis_completed.connect(self.__speech_synthesizer_synthesis_completed_cb)
# Preconnect
self.connection = speechsdk.Connection.from_speech_synthesizer(self.synthesizer)
self.connection.open(True)
def synthesize(self,
text: str):
if not text or len(text) == 0:
return
# Start text-to-speech process
synthesis_future = self.synthesizer.start_speaking_text_async(text)
result = synthesis_future.get()
audio_data_stream = speechsdk.AudioDataStream(result)
id = uuid.uuid4()
audio_data_stream.save_to_wav_file(f"{id}.wav")
self.on_audio(f"{id}.wav")
def __speech_synthesizer_synthesis_completed_cb(self, evt: speechsdk.SessionEventArgs):
"""
Callback that signals the event: synthesis completed.
It returns the audio duration of the synthesized speech.
"""
if self.on_completion is not None:
self.on_completion(evt.result.audio_duration)