Azure Text-to-Speech: API file output significantly lower quality than Speech Studio export

Lander 11 Reputation points
2023-01-06T03:27:07.993+00:00

This issue was previously raised in this thread: https://learn.microsoft.com/en-us/answers/questions/539476/text-to-speech-poor-quality-python.html, but there didn't appear to be any resolution.

I have created the two following audio files of the same text--the Mandarin word "他们" rendered in the 'Yunye' voice profile--by two different means. The first is through a small python CLI modeled after the examples given in the documentation, with slight modifications such that it calls the speech service API and outputs a .wav file to disk, with the name generated according to the input text. I will show the code for this later. The second is through the 'Speech Studio' page and its export functionality.

Via API call
Via Speech Studio

You will notice that the second is of significantly higher audio quality. Nominally, one 'solution' is to just use the Speech Studio page, but this is an appreciably slower workflow when I am rendering lots of individual very small chunks of audio--my use case is generating audio samples for listening practice in my drive to learn Mandarin, so often I'm doing individual vocabulary words. I would like to be able to do the same for other languages when that time comes as well.

Below is the code for my Python script:

import os  
import azure.cognitiveservices.speech as speechsdk  
  
print("Enter the text to use for speech synthesis >")  
text = input()  
  
speech_config = speechsdk.SpeechConfig(subscription=os.environ.get('SPEECH_KEY'), region=os.environ.get('SPEECH_REGION'))  
audio_config = speechsdk.audio.AudioOutputConfig(use_default_speaker=False, filename="{}.wav".format(text))  
  
speech_config.speech_synthesis_voice_name='zh-CN-YunyeNeural'  
  
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)  
  
  
speech_synth_result = speech_synthesizer.speak_text_async(text).get()  
  
if speech_synth_result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:  
  print("Speech synthesized for text [{}]".format(text))  
elif speech_synth_result.reason == speechsdk.ResultReason.Canceled:  
  cancellation_details = speech_synth_result.cancellation_details  
  print("Speech synthesis canceled: {}".format(cancellation_details.reason))  
  if cancellation_details.reason == speechsdk.CancellationReason.Error:  
    if cancellation_details.error_details:  
      print("Error details: {}".format(cancellation_details.error_details))  
      print("Did you set the speech resource key and region values?")  
  

Currently this is a pretty blunt tool, but at least I can refactor it over time to make it suit my needs more. But, why the vast gap in quality between the two methods? Am I doing something wrong in the code?

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,857 questions
Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
3,043 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. romungi-MSFT 48,406 Reputation points Microsoft Employee
    2023-01-06T11:33:56.943+00:00

    @Lander You could add a small change to this code to set the speech config objects' bit and sample rate to high-fidelity RIFF format Riff24Khz16BitMonoPcm

    speech_config.set_speech_synthesis_output_format(speechsdk.SpeechSynthesisOutputFormat.Riff24Khz16BitMonoPcm)  
    
    result = synthesizer.speak_text_async(text).get()  
    stream = speechsdk.AudioDataStream(result)  
    stream.save_to_wav_file("path/to/write/file.wav")  
    

    With the studio you have options to export the audio in the required format but with SDK you need to set the same with the config. The default is Raw24Khz16BitMonoPcm

    If an answer is helpful, please click on 130616-image.png or upvote 130671-image.png which might help other community members reading this thread.

    1 person found this answer helpful.

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.