question

MartinMller-5840 avatar image
0 Votes"
MartinMller-5840 asked YutongTie-MSFT commented

File encoding requirements for Azure Speech test transcripts

Text files used to train a custom language model have to be UTF-8 BOM, but there's no documentation on the format of transcripts used for the creation of test speech datasets.
So naturally I assumed that those files should also be presented as UTF-8 BOM, but analyzing the WER calculation I found out that synthetic audio generation doesn't produce accurate results for this encoding.

The very first sentence began with an "i a" which I could not find in the text (the very first sentence started with "A" only).

Performing a cross check (by uploading the same file, but as UTF-8 WITHOUT BOM) confirmed that automatic audio synthesis doesn't handle transcripts given in UTF-8 BOM correctly.
Omitting the BOM gave the correct first sentence.

Could you please file and fix this bug?

Thanks,
Martin

not-supported
· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hello @MartinMller-5840

Thanks for reaching out to us, could you please provide more sample regarding to this? I trying to provide more details to product group for fixing.

Regards,
Yutong

0 Votes 0 ·

Hello @MartinMller-5840

Is there any update for this case so that we can solve the issue and fix the bug? Thanks a lot.

Regards,
Yutong

0 Votes 0 ·

0 Answers