Custom Neural Voice Training Problem II

Question

Custom Neural Voice Training Problem II

Hero Li(李柏緯) 40

Hello, fellow community members,

I have a question regarding the training process for Custom Neural Voice (CNV).

Our goal is to train a model to speak both Mandarin and English. Our dataset primarily consists of Taiwanese individuals who speak Mandarin Chinese (Traditional) as their primary language, with a smaller amount of English data.

Here's my question:

I'm currently unsure about which model to select for our project: "Simplified Chinese" or "Simplified Chinese + English Bilingual." I'm also looking for information on the corresponding training data requirements.

Which model should I choose?
Should I convert the dataset text to simplified Chinese?
If I opt for the "Simplified Chinese + English Bilingual" model, what are the training data requirements? a. A mix of Chinese and English samples. b. Say the same sentence in Chinese and then say it again in English.

I greatly appreciate your insights and advice in advance.

dupammi 8,615 Reputation points Microsoft External Staff

2023-10-28T01:44:21.6+00:00

Hi @Hero Li(李柏緯) ,

Following up to see if the answer was helpful. Kindly look into the "Answer section" for training process of Custom Neural Voice (CNV).

Thanks!
dupammi 8,615 Reputation points Microsoft External Staff

2023-10-29T00:37:47.54+00:00

@Hero Li(李柏緯) ,

Following up to see if the answer was helpful. Kindly look into the "Answer section" for training process of Custom Neural Voice (CNV).
Hero Li(李柏緯) 40 Reputation points

2023-11-21T03:04:22.5933333+00:00

We chose the latest model (type = V9.2023.10) with training data = Chinese (Taiwanese Mandarin). It can speak both Mandarin and English now.
dupammi 8,615 Reputation points Microsoft External Staff

2023-11-23T07:35:08.8033333+00:00

Glad to know that your issue has been resolved. Thanks for sharing the solution, which might be beneficial to other community members reading this thread.

Accepted answer

0 additional answers

Your answer

dupammi 8,615 Reputation points Microsoft External Staff

2023-10-28T01:44:21.6+00:00

Hi @Hero Li(李柏緯) ,

Following up to see if the answer was helpful. Kindly look into the "Answer section" for training process of Custom Neural Voice (CNV).

Thanks!
dupammi 8,615 Reputation points Microsoft External Staff

2023-10-29T00:37:47.54+00:00

@Hero Li(李柏緯) ,

Following up to see if the answer was helpful. Kindly look into the "Answer section" for training process of Custom Neural Voice (CNV).
Hero Li(李柏緯) 40 Reputation points

2023-11-21T03:04:22.5933333+00:00

We chose the latest model (type = V9.2023.10) with training data = Chinese (Taiwanese Mandarin). It can speak both Mandarin and English now.
dupammi 8,615 Reputation points Microsoft External Staff

2023-11-23T07:35:08.8033333+00:00

Glad to know that your issue has been resolved. Thanks for sharing the solution, which might be beneficial to other community members reading this thread.

Answer 1

Hi @Hero Li(李柏緯) ,

Thank you for using Microsoft Q&A.

I understand that you are trying to implement Custom Neural Voice (CNV) to train a model to speak both Mandarin and English. I can help you with that.

Selecting the right model for your project depends on the specific requirements and goals of your speech synthesis system. Here are some considerations to help you make an informed decision.

If your goal is to train a model to speak both Mandarin and English, choose the "Simplified Chinese + English Bilingual" model, which handles both languages effectively; however, for a dataset primarily comprising Taiwanese individuals speaking Traditional Mandarin Chinese, convert it to Simplified Chinese, but keep in mind that the "Simplified Chinese + English Bilingual" model is designed for Simplified Chinese, not Traditional; when using this model, blend Chinese and English text samples for training without the need for repetition, and for custom voice model training with the CNV service, you'll need a minimum of 5 hours of high-quality speech data, with more details available in the Azure documentation.

Out of the box, text-to-speech can be used with prebuilt neural voices for each supported language. The prebuilt neural voices work very well in most text to speech scenarios if a unique voice isn't required.

For demo of the pre-trained neural voice, I have just used the Azure Cognitive Services Speech SDK to perform speech synthesis. A simple use case that I tried, based on your requirement.

Here is the sample code snippet using the standard built-in voices provided by Azure Cognitive Services.

import os
import azure.cognitiveservices.speech as speechsdk

# This example requires environment variables named "SPEECH_KEY" and "SPEECH_REGION"
# Create a SpeechConfig object with the container endpoint URL and the subscription key 
speech_config = speechsdk.SpeechConfig(
subscription="SPEECH_SUBSCRIPTION",
endpoint="SPEECH_END_POINT"
)
speech_key, service_region = "SPEECH_KEY", "SPEECH_REGION"
    # Set the speech recognition language
speech_config.speech_recognition_language = "en-US"
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=None)
audio_config = speechsdk.audio.AudioOutputConfig(use_default_speaker=True)

    # The language of the voice that speaks.
speech_config.speech_synthesis_voice_name='en-US-JennyNeural'

output_filename = "output_audio_voice1.wav"  # Specify the filename for the generated audio

# Create a SpeechSynthesizer object
speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)

speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)

# Synthesize the SSML with both English and Chinese text
ssml = """
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="zh-CN">
  
  <voice name="Microsoft Server Speech Text to Speech Voice (zh-CN, HuihuiRUS)" style="cheerful">
    <lang xml:lang="zh-CN">
      你好，我叫小萌，我会说中文。
    </lang>
  </voice>
  
  <voice name="Microsoft Server Speech Text to Speech Voice (en-US, JennyNeural)">
    <lang xml:lang="en-US">
      Hello, my name is Jenny, and I can speak English.
    </lang>
  </voice>
</speak>
"""
result = speech_synthesizer.speak_ssml_async(ssml).get()

Hope this helps. Thank You!

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Anonymous

2023-11-12T14:31:23.96+00:00

Hi @dupammi

Accord to your suggestion, does it mean we can based on same language model to speech any support different language by change the default lang setting?

in currently I hit the same issue, if a speech words include Chinese and English, it only speech one language type which based on the voice model type.

BTW, I try to train a voice model which use cross lingual, the training model is Chinese and the model language is English, I find the model only support English, do you have any Idea to use this model to support Chinese language speech?

in the Voice Gallery item, I saw all the language(not English model) can support English speech, how can we do the same use case in CNV?

regards

Nomo Hsu
dupammi 8,615 Reputation points Microsoft External Staff

2023-11-17T05:47:50.3366667+00:00

Hi Nomo Hsu(許丞佑),

We are reaching out to the internal team to get more information related to your query and will get back to you as soon as we have an update.

Share via

Custom Neural Voice Training Problem II

0 additional answers

Your answer