TTS output is different for different SSML

MH 1

I am generating samples to test out the TTS SSML prosody functionality for ja-JP-NanamiNeural.

Expecting rate="-50%" to have slower speech audio, however, the audio I am getting at the text in the prosody tag to be faster than expected (Sample 1). Similar is expected for Sample 2, where speech audio for rate="+50"% is expected to be faster, but slower speech audio was generated.

Samples 4-6 are generating as expected based on the rate (-50% (slower) or +50% (faster))

Would like to check if this output audio is expected for Samples 1 and 2?

Sample 1

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="ja-JP" xmlns:mstts="https://www.w3.org/2001/mstts">  
        <voice name="ja-JP-NanamiNeural">  
            こんにちは世界。<prosody rate="-50%" pitch="0%">これはテス</prosody>ト文 1 です。これはテスト文 2 です。  
        </voice>  
    </speak>

Sample 2

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="ja-JP" xmlns:mstts="https://www.w3.org/2001/mstts">  
        <voice name="ja-JP-NanamiNeural">  
            こんにちは世界。<prosody rate="+50%" pitch="0%">これはテス</prosody>ト文 1 です。これはテスト文 2 です。  
        </voice>  
    </speak>

Sample 3

  <speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="ja-JP" xmlns:mstts="https://www.w3.org/2001/mstts">  
        <voice name="ja-JP-NanamiNeural">  
            こんにちは世界。<prosody rate="-50%" pitch="0%">これはテスト文 1 です。</prosody>これはテスト文 2 です。  
        </voice>  
    </speak>

Sample 4

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="ja-JP" xmlns:mstts="https://www.w3.org/2001/mstts">  
        <voice name="ja-JP-NanamiNeural">  
            こんにちは世界。<prosody rate="+50%" pitch="0%">これはテスト文 1 です。</prosody>これはテスト文 2 です。  
        </voice>  
    </speak>

Sample 5

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="ja-JP" xmlns:mstts="https://www.w3.org/2001/mstts">  
  <voice name="ja-JP-NanamiNeural">  
    <prosody rate="-50%" pitch="0%">  
        こんにちは世界。これはテスト文 1 です。これはテスト文 2 です。  
    </prosody>  
  </voice>  
</speak>

Sample 6

<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="ja-JP" xmlns:mstts="https://www.w3.org/2001/mstts">  
  <voice name="ja-JP-NanamiNeural">  
    <prosody rate="+50%" pitch="0%">  
        こんにちは世界。これはテスト文 1 です。これはテスト文 2 です。  
    </prosody>  
  </voice>  
</speak>

romungi-MSFT 42,206 Reputation points Microsoft Employee

2022-11-23T05:47:02.88+00:00
@MH I have had a similar query recently from a different user for en-US voices and based on the research I found that speaking rate can be applied at the word or sentence level and the rate changes should be within 0.5 to 2 times the original audio. I am not a native speaker of Japanese but the last two SSMLs seem to indicate that the rate is applied for entire sentence rather than a word or phrase as seen in other SSMLs. Do you think this could be the reason for this behavior? Did you try to apply the same scenario of other XMLs and check if it works? I used sample 4 to divide the sentences from speech studio Audio content creation tool. The output of this SSML seems consistent with the settings.

<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="ja-JP"><voice name="ja-JP-NanamiNeural"> こんにちは世界。 <prosody rate="+50%" pitch="0%">これはテスト文 1 です。</prosody> これはテスト文 2 です。 </voice></speak>
MH 1 Reputation point

2022-11-23T10:03:29.497+00:00

@romungi-MSFT Thanks for assisting to look into this.

Samples 3 to 6 are generating audio output as expected, where the audio speaking rate is based on the input value, i.e. rate="-50%" made the audio slower and rate="+50%" made it faster.

For example, the text is made up of 3 sentences: "Hello world. This is test sentence 1. This is test sentence 2."

Samples 3 and 4 are generating either slower (-50%) or faster (+50%) audio, respectively, for the second sentence - "This is test sentence 1"

Samples 5 and 6 are generating either slower (-50%) or faster (+50%) audio, respectively, for the text - "Hello world. This is test sentence 1. This is test sentence 2."

I was expecting Samples 1 and 2 to generate either slower (-50%) or faster (+50%) audio based on the SSML tag location at word level - "This is", from the sentence "This is test sentence 1". However, I am getting an opposite result, where +50% generated a slower audio and -50% a faster audio.

I was wondering if this is expected for NanamiNeural as I tested with other Japanese voices (Aoi, Keita) and getting audio output that is similar to Samples 3 to 6, for Sample 1 and 2 inputs at word level.

Hope this clarifies.
romungi-MSFT 42,206 Reputation points Microsoft Employee

2022-11-28T07:06:16.733+00:00

@MH I have tested this scenario with some sentences in English using neural voice Jenny Multi-lingual. I think the word level setting that is applicable for rate is working if the limits are with 0.5 to 2 times original audio. However, if you are applying word level rate changes and playing only the audio of the word that is having faster or slower rates it sounds fast or slow when you hear only these words as per the limit that is set. But if you combine it with the rest of the sentence the output seems to adjust with the default rate to ensure the entire sentence is correctly audible. For example, if you set a rate that is 2 times more than the default for a part of the sentence the audio for the entire sentence sounds weird and the user will not be able to make out the part of the sentence that is too fast or slow. If you are setting the rate to just about the default limit the audio still increases or decreases the rate for this part of the sentence, but you might be unable to distinguish this change. I would recommend increasing the length of the sentence and increase the rate of the words you desire to about 2 times and then play only the words that you have set the higher rate and then the entire sentence and the complete set of sentences. This should give you a clear idea of how the API is adjusting the rate to make it more understandable. Here is an example of what I tried and how it might help you do the same for Japanese.

P.S: I have used the speech studio to test this scenario.

romungi-MSFT 42,206 Microsoft Employee

Example SSML:

    <speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="en-US"><voice name="en-US-JennyMultilingualNeural">Hello world. <prosody rate="+200.00%">This is test</prosody> sentence 1, Welcome to Microsoft Cognitive Services Text-to-Speech API. This is test sentence 2  
    Hello world. This is test sentence 1. This is test sentence 2</voice></speak>