Am I right that the Long Audio API is very clunky and limited when it comes to voice selection? Am I right to eschew/forswear/bag it in favor of using the regular synthesizer to make short files and stitch them together?

Stephen Cummings 6

Am I right that even if you can configure a bunch of separate speech resources, each for a different region, you still can't access all of the neural voices through Long Audio? In my tests so far the voices returned by get_voices() — I'm on python — are (for each region that returns any) a freakily random set. In a resource configured for the 'centralindia' region I get no Hindi or other Indian voices. At the moment I need Hindi, Mandarin Chinese, Norwegian Bokmal, and English in various flavors.

I should forget Long Audio, right? Or is there a secret set of steps to follow to get to full access to all the voices?

I was very happy with my initial results with the speech synthesizer sdk and am Microsoft-leaning so I haven't investigated the competition yet. Does anyone know whether Polly or Google provide a simpler path to voice/language options during conversion of long text files to speech?

Ramr-msft 17,731 Reputation points

2021-08-03T02:49:18.977+00:00

@Stephen Cummings Thanks for the question. The Long Audio API supports the following Public Neural Voices and Custom Neural Voices.
Can you please share get_voices() that you are getting for the 'centralindia' region.
The Audio Content Creation platform makes high-quality audiobooks and enables you to visually control speech attributes in real-time – such as voice style, rate, pitch, volume, pronunciation and breaks. It allows you to quickly create more accurate, expressive and customized audio.

We are investigating the issue internally will confirm on the same.
Stephen Cummings 6 Reputation points

2021-08-03T23:16:59.76+00:00

Here's a text file with my results from the get_voices() with the region set to centralindia.120229-voices-centralindia.txt
Stephen Cummings 6 Reputation points

2021-08-03T23:30:23.44+00:00

Regarding the supported voices, I was already aware of the list you linked, and am glad that Long Audio is supposed to support them all. I guess it just doesn't work that way for me.

It would be much, much easier if you just published and kept updated a list of which voices go with which region, rather than making people try to find the voices they need by trial and error. (And how are these regions defined? There's a centralindia, but no northindia or southindia? Why?)

Of course, it would make much more sense if you just made all voices available everywhere—restrictions based on "regions" don't make any sense in a global economy.
Ramr-msft 17,731 Reputation points

2021-08-04T11:59:25.07+00:00

@Stephen Cummings Thanks for the details. We have forwarded to the product team to check on this.

Share via

Am I right that the Long Audio API is very clunky and limited when it comes to voice selection? Am I right to eschew/forswear/bag it in favor of using the regular synthesizer to make short files and stitch them together?