Hello @William
Thanks for using Microsoft Q&A platform. I think the human-labeled transcriptions in custom training can help for your case.
Human-labeled transcriptions are word-by-word transcriptions of an audio file. You use human-labeled transcriptions to improve recognition accuracy, especially when words are deleted or incorrectly replaced. It can help mis-recognition caused by accent.
A large sample of transcription data is required to improve recognition. We suggest providing between 1 and 20 hours of audio data. The Speech service will use up to 20 hours of audio for training. This guide is broken up by locale, with sections for US English, Mandarin Chinese, and German.
Please check on this guidance and have a try - https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-custom-speech-human-labeled-transcriptions#en-us
Let me know if you have more questions and we are happy to help.
Regards,
Yutong
-Please kindly accept the answer if you feel helpful to support the community, thanks a lot.