This issue was previously raised in this thread: https://learn.microsoft.com/en-us/answers/questions/539476/text-to-speech-poor-quality-python.html, but there didn't appear to be any resolution.
I have created the two following audio files of the same text--the Mandarin word "他们" rendered in the 'Yunye' voice profile--by two different means. The first is through a small python CLI modeled after the examples given in the documentation, with slight modifications such that it calls the speech service API and outputs a .wav file to disk, with the name generated according to the input text. I will show the code for this later. The second is through the 'Speech Studio' page and its export functionality.
Via API call
Via Speech Studio
You will notice that the second is of significantly higher audio quality. Nominally, one 'solution' is to just use the Speech Studio page, but this is an appreciably slower workflow when I am rendering lots of individual very small chunks of audio--my use case is generating audio samples for listening practice in my drive to learn Mandarin, so often I'm doing individual vocabulary words. I would like to be able to do the same for other languages when that time comes as well.
Below is the code for my Python script:
import os
import azure.cognitiveservices.speech as speechsdk
print("Enter the text to use for speech synthesis >")
text = input()
speech_config = speechsdk.SpeechConfig(subscription=os.environ.get('SPEECH_KEY'), region=os.environ.get('SPEECH_REGION'))
audio_config = speechsdk.audio.AudioOutputConfig(use_default_speaker=False, filename="{}.wav".format(text))
speech_config.speech_synthesis_voice_name='zh-CN-YunyeNeural'
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)
speech_synth_result = speech_synthesizer.speak_text_async(text).get()
if speech_synth_result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
print("Speech synthesized for text [{}]".format(text))
elif speech_synth_result.reason == speechsdk.ResultReason.Canceled:
cancellation_details = speech_synth_result.cancellation_details
print("Speech synthesis canceled: {}".format(cancellation_details.reason))
if cancellation_details.reason == speechsdk.CancellationReason.Error:
if cancellation_details.error_details:
print("Error details: {}".format(cancellation_details.error_details))
print("Did you set the speech resource key and region values?")
Currently this is a pretty blunt tool, but at least I can refactor it over time to make it suit my needs more. But, why the vast gap in quality between the two methods? Am I doing something wrong in the code?