How to use contour with rate in prosody via SSML in Text-to-Speech?

thekingofcity 25 Reputation points
2023-09-26T14:51:23.9033333+00:00

The doc provides an example with only contour

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-JennyNeural">
        <prosody contour="(60%,-60%) (100%,+80%)" >
            Were you the only person in the room?
        </prosody>
    </voice>
</speak>

https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup-voice?source=recommendations#change-pitch-contour-example

This works as expected in my demo with microsoft-cognitiveservices-speech-sdk. However when I add another rate to this it will return SSML parsing error: 0x80045003 - The caller has spec websocket error code: 1007

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <voice name="en-US-JennyNeural">
        <prosody rate="+25.00%" contour="(60%,-60%) (100%,+80%)" >
            Were you the only person in the room?
        </prosody>
    </voice>
</speak>

Also, this is a valid SSML in Speech Studio and I can hear the difference between this and above.

Does anyone know what's going wrong here? Thanks in advance :)

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,663 questions
{count} votes

Accepted answer
  1. VasaviLankipalle-MSFT 17,021 Reputation points
    2023-09-28T02:06:43.19+00:00

    Hello @thekingofcity , Thanks for sharing detailed infromation.

    I have understood the problem. I believe the issue is due to the contour attribute of the prosody element is formatted incorrectly.

    As we know, contour represents the changes in pitch at different points in the utterance: https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup-voice?source=recommendations#adjust-prosody

    Contour represents changes in pitch. These changes are represented as an array of targets at specified time positions in the speech output. Sets of parameter pairs define each target. For example:
    <prosody contour="(0%,+20Hz) (10%,-2st) (40%,+10Hz)">
    The first value in each set of parameters specifies the location of the pitch change as a percentage of the duration of the text. The second value specifies the amount to raise or lower the pitch by using a relative value or an enumeration value for pitch (see pitch).

    Here while specifying the contour the second value specifies pitch. So, while representing the as a percentage: Expressed as a number preceded by "+"or "-"and followed by "%" is required.

    The reason for your first voice name="en-US-JennyNeural" worked because the contour attribute was set correctly.

    I reproduced your query it resulted me with the same error after correcting with proper attributes it worked well on my end. In your query "+" before"(60%,0%) is missing that has caused this issue. There is no other issue with the voice as per my knowledge.

    <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="zh-CN">
        <voice name="zh-CN-XiaoqiuNeural">
            <prosody pitch="50%" rate="+50.00%" contour="(60%, +0%)(100%, +25%)" >
                可换乘
            </prosody>
        </voice>
    </speak>
    
    
    

    Please try this and let me know if you have any issues.

    I hope this helps.

    Regards,
    Vasavi

    -Please kindly accept the answer and vote 'yes' if you feel helpful to support the community, thanks.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.