Custom Neural Voice Training Problem II

Hero Li(李柏緯) 40 Reputation points
2023-10-27T07:31:50.66+00:00

Hello, fellow community members,

I have a question regarding the training process for Custom Neural Voice (CNV).

Our goal is to train a model to speak both Mandarin and English. Our dataset primarily consists of Taiwanese individuals who speak Mandarin Chinese (Traditional) as their primary language, with a smaller amount of English data.

Here's my question:

I'm currently unsure about which model to select for our project: "Simplified Chinese" or "Simplified Chinese + English Bilingual." I'm also looking for information on the corresponding training data requirements.

  1. Which model should I choose?
  2. Should I convert the dataset text to simplified Chinese?
  3. If I opt for the "Simplified Chinese + English Bilingual" model, what are the training data requirements? a. A mix of Chinese and English samples. b. Say the same sentence in Chinese and then say it again in English.

I greatly appreciate your insights and advice in advance.

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,555 questions
{count} votes

Accepted answer
  1. dupammi 8,035 Reputation points Microsoft Vendor
    2023-10-27T13:10:17.2066667+00:00

    Hi @Hero Li(李柏緯) ,

    Thank you for using Microsoft Q&A.

    I understand that you are trying to implement Custom Neural Voice (CNV) to train a model to speak both Mandarin and English. I can help you with that.

    Selecting the right model for your project depends on the specific requirements and goals of your speech synthesis system. Here are some considerations to help you make an informed decision.

    If your goal is to train a model to speak both Mandarin and English, choose the "Simplified Chinese + English Bilingual" model, which handles both languages effectively; however, for a dataset primarily comprising Taiwanese individuals speaking Traditional Mandarin Chinese, convert it to Simplified Chinese, but keep in mind that the "Simplified Chinese + English Bilingual" model is designed for Simplified Chinese, not Traditional; when using this model, blend Chinese and English text samples for training without the need for repetition, and for custom voice model training with the CNV service, you'll need a minimum of 5 hours of high-quality speech data, with more details available in the Azure documentation.

    Out of the box, text-to-speech can be used with prebuilt neural voices for each supported language. The prebuilt neural voices work very well in most text to speech scenarios if a unique voice isn't required.

    For demo of the pre-trained neural voice, I have just used the Azure Cognitive Services Speech SDK to perform speech synthesis. A simple use case that I tried, based on your requirement.

    Here is the sample code snippet using the standard built-in voices provided by Azure Cognitive Services.

    import os
    import azure.cognitiveservices.speech as speechsdk
    
    # This example requires environment variables named "SPEECH_KEY" and "SPEECH_REGION"
    # Create a SpeechConfig object with the container endpoint URL and the subscription key 
    speech_config = speechsdk.SpeechConfig(
    subscription="SPEECH_SUBSCRIPTION",
    endpoint="SPEECH_END_POINT"
    )
    speech_key, service_region = "SPEECH_KEY", "SPEECH_REGION"
        # Set the speech recognition language
    speech_config.speech_recognition_language = "en-US"
    speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=None)
    audio_config = speechsdk.audio.AudioOutputConfig(use_default_speaker=True)
    
        # The language of the voice that speaks.
    speech_config.speech_synthesis_voice_name='en-US-JennyNeural'
    
    output_filename = "output_audio_voice1.wav"  # Specify the filename for the generated audio
    
    # Create a SpeechSynthesizer object
    speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
    
    speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)
    
    # Synthesize the SSML with both English and Chinese text
    ssml = """
    <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="zh-CN">
      
      <voice name="Microsoft Server Speech Text to Speech Voice (zh-CN, HuihuiRUS)" style="cheerful">
        <lang xml:lang="zh-CN">
          你好,我叫小萌,我会说中文。
        </lang>
      </voice>
      
      <voice name="Microsoft Server Speech Text to Speech Voice (en-US, JennyNeural)">
        <lang xml:lang="en-US">
          Hello, my name is Jenny, and I can speak English.
        </lang>
      </voice>
    </speak>
    """
    result = speech_synthesizer.speak_ssml_async(ssml).get()
    
    

    Hope this helps. Thank You!


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful