Voice Model Selection Issue with WebSocket API in Speech SDK

Question

Voice Model Selection Issue with WebSocket API in Speech SDK

Catherine Lee 0

I'm trying to use the voice 'en-US-AnaNeural' in the following code:

speech_config = speechsdk.SpeechConfig(endpoint=f"wss://{server_config.speech_region}.tts.speech.microsoft.com/cognitiveservices/websocket/v2", subscription=server_config.speech_key) speech_config.speech_synthesis_voice_name = "en-US-AnaNeural"

However, the output is a male voice instead. I tried other voice models, and most work, but a few always result in the same male voice. It seems these specific voice models are not supported. Could this be because I'm using wss instead of API requests? There are no errors during runtime."

romungi-MSFT 48,911 Reputation points Microsoft Employee Moderator

2024-09-23T08:12:22.3433333+00:00
@Catherine Lee I think some of your code snippet is missing from above, but I think you must be setting the voice using the following method. See quickstart of GH repo for reference.

speech_config.speech_synthesis_voice_name='en-US-AvaMultilingualNeural'

You should also lookup the voice list API for the list of supported voices in your region or the language support page.

There is no limitation on accessibility of voice through different protocols i.e http or wss. You might want to first try and use the speech studio audio content creation page and check if the voice works for you and then use it programmatically.

Catherine Lee 0

speech_config = speechsdk.SpeechConfig(endpoint=

Sorry, I didn’t check my question carefully. I have looked the voice list of supported voices in my region (eastus) up, it does include en-US-AnaNeural, and I found this voice from speech studio. so it should be available and used with no problem. However, the output sound is a male.

  {
    "Name": "Microsoft Server Speech Text to Speech Voice (en-US, AnaNeural)",
    "DisplayName": "Ana",
    "LocalName": "Ana",
    "ShortName": "en-US-AnaNeural",
    "Gender": "Female",
    "Locale": "en-US",
    "LocaleName": "English (United States)",
    "SampleRateHertz": "48000",
    "VoiceType": "Neural",
    "Status": "GA",
    "WordsPerMinute": "135"
  },

romungi-MSFT 48,911 Reputation points Microsoft Employee Moderator

2024-09-23T13:47:44.9833333+00:00
@Catherine Lee AFAIK if you use the SDK wss protocol is used by the service. With respect to setup, you can change it to following and re-run.

speech_key, service_region = "YourSubscriptionKey", "YourServiceRegion" speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region) # Set the voice name, refer to https://aka.ms/speech/voices/neural for full list. speech_config.speech_synthesis_voice_name = " print("Enter some text that you want to speak >") text = input() speech_synthesis_result = speech_synthesizer.speak_text_async(text).get()

The above should default to speaker and give the output.

If you are using stream output, please check this sample and try to implement the same with respect to stream callback. You can actually run the sample from GH repo as is but ensure to add the speech synthesis voice name.
Catherine Lee 0 Reputation points

2024-09-24T06:06:50.02+00:00
I used the tts-text-stream sample but made a small modification by adding the PushAudioOutputStreamSampleCallback to output the audio chunk:

stream_callback = PushAudioOutputStreamSampleCallback() push_stream = speechsdk.audio.PushAudioOutputStream(stream_callback) audio_config = speechsdk.audio.AudioConfig(stream=push_stream) speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)

I attempted to modify as you suggested:

speech_config = speechsdk.SpeechConfig(subscription=server_config.speech_key, region=server_config.speech_region)

without adding the wss endpoint, which led to the synthesis being canceled, generating a log entry like this:

SpeechSynthesisResult(result_id=bc45a724b2ea4950b6523953c409360c, reason=ResultReason.Canceled, audio_length=0).

The SSML version works without issue, but when trying the streaming version to reduce latency, several voice models, like Ana, Ashley, and Amber, output as male voices. Can you help me troubleshoot this? I appreciate your assistance, as I really have no idea about where the issue lies.
romungi-MSFT 48,911 Reputation points Microsoft Employee Moderator

2024-09-24T06:36:28.4+00:00

@Catherine Lee I would recommend raising an issue on SDK repo against the sample you are using since issue may be in the SDK. I am not from the SDK team and raising an issue will help the SDK team address this scenario appropriately since you are testing the same with OpenAI.

Your answer

romungi-MSFT 48,911 Reputation points Microsoft Employee Moderator

2024-09-23T08:12:22.3433333+00:00

@Catherine Lee I think some of your code snippet is missing from above, but I think you must be setting the voice using the following method. See quickstart of GH repo for reference.

speech_config.speech_synthesis_voice_name='en-US-AvaMultilingualNeural'

You should also lookup the voice list API for the list of supported voices in your region or the language support page.

There is no limitation on accessibility of voice through different protocols i.e http or wss. You might want to first try and use the speech studio audio content creation page and check if the voice works for you and then use it programmatically.
Catherine Lee 0 Reputation points

2024-09-23T12:24:43.66+00:00

speech_config = speechsdk.SpeechConfig(endpoint=

Sorry, I didn’t check my question carefully. I have looked the voice list of supported voices in my region (eastus) up, it does include en-US-AnaNeural, and I found this voice from speech studio. so it should be available and used with no problem. However, the output sound is a male.

{ "Name": "Microsoft Server Speech Text to Speech Voice (en-US, AnaNeural)", "DisplayName": "Ana", "LocalName": "Ana", "ShortName": "en-US-AnaNeural", "Gender": "Female", "Locale": "en-US", "LocaleName": "English (United States)", "SampleRateHertz": "48000", "VoiceType": "Neural", "Status": "GA", "WordsPerMinute": "135" },
romungi-MSFT 48,911 Reputation points Microsoft Employee Moderator

2024-09-23T13:47:44.9833333+00:00

@Catherine Lee AFAIK if you use the SDK wss protocol is used by the service. With respect to setup, you can change it to following and re-run.

speech_key, service_region = "YourSubscriptionKey", "YourServiceRegion" speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region) # Set the voice name, refer to https://aka.ms/speech/voices/neural for full list. speech_config.speech_synthesis_voice_name = " print("Enter some text that you want to speak >") text = input() speech_synthesis_result = speech_synthesizer.speak_text_async(text).get()

The above should default to speaker and give the output.

If you are using stream output, please check this sample and try to implement the same with respect to stream callback. You can actually run the sample from GH repo as is but ensure to add the speech synthesis voice name.
Catherine Lee 0 Reputation points

2024-09-24T06:06:50.02+00:00

I used the tts-text-stream sample but made a small modification by adding the PushAudioOutputStreamSampleCallback to output the audio chunk:

stream_callback = PushAudioOutputStreamSampleCallback() push_stream = speechsdk.audio.PushAudioOutputStream(stream_callback) audio_config = speechsdk.audio.AudioConfig(stream=push_stream) speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)

I attempted to modify as you suggested:

speech_config = speechsdk.SpeechConfig(subscription=server_config.speech_key, region=server_config.speech_region)

without adding the wss endpoint, which led to the synthesis being canceled, generating a log entry like this:

SpeechSynthesisResult(result_id=bc45a724b2ea4950b6523953c409360c, reason=ResultReason.Canceled, audio_length=0).

The SSML version works without issue, but when trying the streaming version to reduce latency, several voice models, like Ana, Ashley, and Amber, output as male voices. Can you help me troubleshoot this? I appreciate your assistance, as I really have no idea about where the issue lies.
romungi-MSFT 48,911 Reputation points Microsoft Employee Moderator

2024-09-24T06:36:28.4+00:00

@Catherine Lee I would recommend raising an issue on SDK repo against the sample you are using since issue may be in the SDK. I am not from the SDK team and raising an issue will help the SDK team address this scenario appropriately since you are testing the same with OpenAI.

Share via

Voice Model Selection Issue with WebSocket API in Speech SDK

Your answer