Thanks for your patience, @Risako . Below is the product team's response:
Confidence is a typical speech to text metric that the speech engine will report on for how well it thinks it is able to accurately match what was said to a known text word. There’s similar docs for the Windows Speech API at https://learn.microsoft.com/en-us/previous-versions/office/developer/speech-technologies/dd186108%28v%3doffice.14%29 that does a pretty good job of describing confidence.
On https://learn.microsoft.com/en-us/azure/media-services/previous/media-services-index-content we do talk about Recognizability. This is more closely related to the quality of the audio stream. If there is background noise, or maybe music it makes the task of doing speech to text more difficult. The higher the rating the more clear the human speech is. Consider this a baseline metric where Confidence sort of sits on top of this. If Recognizability is high then we have a good starting point and that can drive Confidence higher. If Recognizability is low, such as we’re trying to parse a heavy metal song for it’s lyrics, then Confidence will absolutely be impacted.
Hope that helps. Please let us know if you have further questions
Thanks,
Grace
------------------------------------------------------------------------------------------------------------------------------
--If the reply is helpful, please Upvote and Accept as answer--