Hi @Hero Li(李柏緯) ,
Thank you for using Microsoft Q&A.
I understand that you are trying to implement Custom Neural Voice (CNV) to train a model to speak both Mandarin and English. I can help you with that.
Selecting the right model for your project depends on the specific requirements and goals of your speech synthesis system. Here are some considerations to help you make an informed decision.
If your goal is to train a model to speak both Mandarin and English, choose the "Simplified Chinese + English Bilingual" model, which handles both languages effectively; however, for a dataset primarily comprising Taiwanese individuals speaking Traditional Mandarin Chinese, convert it to Simplified Chinese, but keep in mind that the "Simplified Chinese + English Bilingual" model is designed for Simplified Chinese, not Traditional; when using this model, blend Chinese and English text samples for training without the need for repetition, and for custom voice model training with the CNV service, you'll need a minimum of 5 hours of high-quality speech data, with more details available in the Azure documentation.
Out of the box, text-to-speech can be used with prebuilt neural voices for each supported language. The prebuilt neural voices work very well in most text to speech scenarios if a unique voice isn't required.
For demo of the pre-trained neural voice, I have just used the Azure Cognitive Services Speech SDK to perform speech synthesis. A simple use case that I tried, based on your requirement.
Here is the sample code snippet using the standard built-in voices provided by Azure Cognitive Services.
import os
import azure.cognitiveservices.speech as speechsdk
# This example requires environment variables named "SPEECH_KEY" and "SPEECH_REGION"
# Create a SpeechConfig object with the container endpoint URL and the subscription key
speech_config = speechsdk.SpeechConfig(
subscription="SPEECH_SUBSCRIPTION",
endpoint="SPEECH_END_POINT"
)
speech_key, service_region = "SPEECH_KEY", "SPEECH_REGION"
# Set the speech recognition language
speech_config.speech_recognition_language = "en-US"
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=None)
audio_config = speechsdk.audio.AudioOutputConfig(use_default_speaker=True)
# The language of the voice that speaks.
speech_config.speech_synthesis_voice_name='en-US-JennyNeural'
output_filename = "output_audio_voice1.wav" # Specify the filename for the generated audio
# Create a SpeechSynthesizer object
speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)
# Synthesize the SSML with both English and Chinese text
ssml = """
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="zh-CN">
<voice name="Microsoft Server Speech Text to Speech Voice (zh-CN, HuihuiRUS)" style="cheerful">
<lang xml:lang="zh-CN">
你好,我叫小萌,我会说中文。
</lang>
</voice>
<voice name="Microsoft Server Speech Text to Speech Voice (en-US, JennyNeural)">
<lang xml:lang="en-US">
Hello, my name is Jenny, and I can speak English.
</lang>
</voice>
</speak>
"""
result = speech_synthesizer.speak_ssml_async(ssml).get()
Hope this helps. Thank You!
If this answers your query, do click Accept Answer
and Yes
for was this answer helpful. And, if you have any further query do let us know.