File encoding requirements for Azure Speech test transcripts

Martin Müller 1 Reputation point
2022-05-02T11:54:23.44+00:00

Text files used to train a custom language model have to be UTF-8 BOM, but there's no documentation on the format of transcripts used for the creation of test speech datasets.
So naturally I assumed that those files should also be presented as UTF-8 BOM, but analyzing the WER calculation I found out that synthetic audio generation doesn't produce accurate results for this encoding.

The very first sentence began with an "i a" which I could not find in the text (the very first sentence started with "A" only).

Performing a cross check (by uploading the same file, but as UTF-8 WITHOUT BOM) confirmed that automatic audio synthesis doesn't handle transcripts given in UTF-8 BOM correctly.
Omitting the BOM gave the correct first sentence.

Could you please file and fix this bug?

Thanks,
Martin

Not Monitored
Not Monitored
Tag not monitored by Microsoft.
36,251 questions
{count} votes