Duration supports only standard voices, please ensure you are using standard voice for your scenario. Thanks.
Is there a way to control the desired duration of a speech using Text to Speech from Azure Cognitive Services?
Hello there. I am building an application to recognize speech, transform into text, translate and then regenerate that speech in a different language – this process with the supervision of a human to increase the accuracy rate.
However, I am finding it very difficult to control the output of audio, specially in terms of the duration of each sentence or paragraph. Due to this documentation, the application is generating speeches in other languages using <prosody duration="XXXXms">
for each sentence.
However, the output does not came as desired. Using an example, I have a file containing 22 paragraphs that should take 03min43s to speak the desired output, however the application seems to ignore the fact that each paragraph is arranged like the following example <p><prosody duration="4000ms">O objeto.</prosody></p>
and takes the natural time to generate the audio – which results in a 02min22s file.
Any ideas why this might be happening? And what should I do to avoid this outcome?
Thank you very much for any help that you guys may provide.