ssml prosody tag

Lucia Pozzan 1 Reputation point
2021-07-07T19:10:10.223+00:00

According to ssml 1.1 (https://www.w3.org/TR/speech-synthesis11/), the prosody rate tag should only include non-negative numbers:

"rate: a change in the speaking rate for the contained text. Legal values are: a non-negative percentage or "x-slow", "slow", "medium", "fast", "x-fast", or "default". Labels "x-slow" through "x-fast" represent a sequence of monotonically non-decreasing speaking rates. When the value is a non-negative percentage it acts as a multiplier of the default rate. For example, a value of 100% means no change in speaking rate, a value of 200% means a speaking rate twice the default rate, and a value of 50% means a speaking rate of half the default rate. The default rate for a voice depends on the language and dialect and on the personality of the voice. The default rate for a voice should be such that it is experienced as a normal speaking rate for the voice when reading aloud text. Since voices are processor-specific, the default rate will be as well."

However, this does not seem to be the case when specifying prosody rate in Microsoft TTS, as <prosody rate="30.00%"> plays at a higher speed than 100% and seems to be interpreted as "<prosody rate="+30.00%">.

Is this a bug or a conscious decision to depart from SSML standards? Is there a way to force the tag to be interpreted as intended?

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,555 questions
{count} votes