Am I right that the Long Audio API is very clunky and limited when it comes to voice selection? Am I right to eschew/forswear/bag it in favor of using the regular synthesizer to make short files and stitch them together?
Am I right that even if you can configure a bunch of separate speech resources, each for a different region, you still can't access all of the neural voices through Long Audio? In my tests so far the voices returned by get_voices() — I'm on python — are (for each region that returns any) a freakily random set. In a resource configured for the 'centralindia' region I get no Hindi or other Indian voices. At the moment I need Hindi, Mandarin Chinese, Norwegian Bokmal, and English in various flavors.
I should forget Long Audio, right? Or is there a secret set of steps to follow to get to full access to all the voices?
I was very happy with my initial results with the speech synthesizer sdk and am Microsoft-leaning so I haven't investigated the competition yet. Does anyone know whether Polly or Google provide a simpler path to voice/language options during conversion of long text files to speech?