Azure Text-to-Speech: API file output significantly lower quality than Speech Studio export

Question

Azure Text-to-Speech: API file output significantly lower quality than Speech Studio export

Lander 11

This issue was previously raised in this thread: https://learn.microsoft.com/en-us/answers/questions/539476/text-to-speech-poor-quality-python.html, but there didn't appear to be any resolution.

I have created the two following audio files of the same text--the Mandarin word "他们" rendered in the 'Yunye' voice profile--by two different means. The first is through a small python CLI modeled after the examples given in the documentation, with slight modifications such that it calls the speech service API and outputs a .wav file to disk, with the name generated according to the input text. I will show the code for this later. The second is through the 'Speech Studio' page and its export functionality.

Via API call
Via Speech Studio

You will notice that the second is of significantly higher audio quality. Nominally, one 'solution' is to just use the Speech Studio page, but this is an appreciably slower workflow when I am rendering lots of individual very small chunks of audio--my use case is generating audio samples for listening practice in my drive to learn Mandarin, so often I'm doing individual vocabulary words. I would like to be able to do the same for other languages when that time comes as well.

Below is the code for my Python script:

import os  
import azure.cognitiveservices.speech as speechsdk  
  
print("Enter the text to use for speech synthesis >")  
text = input()  
  
speech_config = speechsdk.SpeechConfig(subscription=os.environ.get('SPEECH_KEY'), region=os.environ.get('SPEECH_REGION'))  
audio_config = speechsdk.audio.AudioOutputConfig(use_default_speaker=False, filename="{}.wav".format(text))  
  
speech_config.speech_synthesis_voice_name='zh-CN-YunyeNeural'  
  
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)  
  
  
speech_synth_result = speech_synthesizer.speak_text_async(text).get()  
  
if speech_synth_result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:  
  print("Speech synthesized for text [{}]".format(text))  
elif speech_synth_result.reason == speechsdk.ResultReason.Canceled:  
  cancellation_details = speech_synth_result.cancellation_details  
  print("Speech synthesis canceled: {}".format(cancellation_details.reason))  
  if cancellation_details.reason == speechsdk.CancellationReason.Error:  
    if cancellation_details.error_details:  
      print("Error details: {}".format(cancellation_details.error_details))  
      print("Did you set the speech resource key and region values?")

Currently this is a pretty blunt tool, but at least I can refactor it over time to make it suit my needs more. But, why the vast gap in quality between the two methods? Am I doing something wrong in the code?

1 answer

Your answer

Answer 1

romungi-MSFT 48,906 Microsoft Employee Moderator

@Lander You could add a small change to this code to set the speech config objects' bit and sample rate to high-fidelity RIFF format Riff24Khz16BitMonoPcm

speech_config.set_speech_synthesis_output_format(speechsdk.SpeechSynthesisOutputFormat.Riff24Khz16BitMonoPcm)  

result = synthesizer.speak_text_async(text).get()  
stream = speechsdk.AudioDataStream(result)  
stream.save_to_wav_file("path/to/write/file.wav")

With the studio you have options to export the audio in the required format but with SDK you need to set the same with the config. The default is Raw24Khz16BitMonoPcm

If an answer is helpful, please click on or upvote which might help other community members reading this thread.

Lander 11 Reputation points

2023-01-06T12:55:26.867+00:00

Aha, thank you. I've tested this out, and it yields a significant improvement.

Share via

Azure Text-to-Speech: API file output significantly lower quality than Speech Studio export

1 answer

Your answer