Is there a way to control the desired duration of a speech using Text to Speech from Azure Cognitive Services?

Question

Is there a way to control the desired duration of a speech using Text to Speech from Azure Cognitive Services?

Victor Kzam 1

Hello there. I am building an application to recognize speech, transform into text, translate and then regenerate that speech in a different language – this process with the supervision of a human to increase the accuracy rate.

However, I am finding it very difficult to control the output of audio, specially in terms of the duration of each sentence or paragraph. Due to this documentation, the application is generating speeches in other languages using <prosody duration="XXXXms"> for each sentence.

However, the output does not came as desired. Using an example, I have a file containing 22 paragraphs that should take 03min43s to speak the desired output, however the application seems to ignore the fact that each paragraph is arranged like the following example <p><prosody duration="4000ms">O objeto.</prosody></p> and takes the natural time to generate the audio – which results in a 02min22s file.

Any ideas why this might be happening? And what should I do to avoid this outcome?

Thank you very much for any help that you guys may provide.

GiftA-MSFT 11,176 Reputation points

2020-11-09T21:37:23.043+00:00

Hi, thanks for reaching out. Are you converting text-to-speech using standard, neural, or custom voices?

1 answer

Your answer

GiftA-MSFT 11,176 Reputation points

2020-11-09T21:37:23.043+00:00

Hi, thanks for reaching out. Are you converting text-to-speech using standard, neural, or custom voices?

Answer 1

GiftA-MSFT 11,176

Duration supports only standard voices, please ensure you are using standard voice for your scenario. Thanks.

Share via

Is there a way to control the desired duration of a speech using Text to Speech from Azure Cognitive Services?

1 answer

Your answer