File encoding requirements for Azure Speech test transcripts
Text files used to train a custom language model have to be UTF-8 BOM, but there's no documentation on the format of transcripts used for the creation of test speech datasets.
So naturally I assumed that those files should also be presented as UTF-8 BOM, but analyzing the WER calculation I found out that synthetic audio generation doesn't produce accurate results for this encoding.
The very first sentence began with an "i a" which I could not find in the text (the very first sentence started with "A" only).
Performing a cross check (by uploading the same file, but as UTF-8 WITHOUT BOM) confirmed that automatic audio synthesis doesn't handle transcripts given in UTF-8 BOM correctly.
Omitting the BOM gave the correct first sentence.
Could you please file and fix this bug?
Thanks,
Martin