File encoding requirements for Azure Speech test transcripts

Martin Müller 1

Text files used to train a custom language model have to be UTF-8 BOM, but there's no documentation on the format of transcripts used for the creation of test speech datasets.
So naturally I assumed that those files should also be presented as UTF-8 BOM, but analyzing the WER calculation I found out that synthetic audio generation doesn't produce accurate results for this encoding.

The very first sentence began with an "i a" which I could not find in the text (the very first sentence started with "A" only).

Performing a cross check (by uploading the same file, but as UTF-8 WITHOUT BOM) confirmed that automatic audio synthesis doesn't handle transcripts given in UTF-8 BOM correctly.
Omitting the BOM gave the correct first sentence.

Could you please file and fix this bug?

Thanks,
Martin

YutongTie-MSFT 46,991 Reputation points

2022-05-03T17:43:38.69+00:00

Hello @Martin Müller

Thanks for reaching out to us, could you please provide more sample regarding to this? I trying to provide more details to product group for fixing.

Regards,
Yutong
YutongTie-MSFT 46,991 Reputation points

2022-05-23T22:26:35.11+00:00

Hello @Martin Müller

Is there any update for this case so that we can solve the issue and fix the bug? Thanks a lot.

Regards,
Yutong