Explicit definition for AudioAnalyzer confidence

Question

What is the definition of confidence for the AudioAnalyzer? The description for the audio analyzer is here, but it doesn't mention the confidence (https://learn.microsoft.com/en-us/azure/media-services/latest/analyze-video-audio-files-concept)
I found a very short description here, but this is for the cognitive service speech-to-text, not the media service audio analyzer (https://learn.microsoft.com/en-us/dotnet/api/microsoft.cognitiveservices.speech.detailedspeechrecognitionresult.confidence?view=azure-dotnet)

When applying AudioAnalyzerTranscriptAudio to a wav file, it gives us a file called transcript.vtt with content like below:

WEBVTT
NOTE duration:"00:42:09"
NOTE recognizability:0.806
NOTE language:en-us
NOTE Confidence: 0.8830942
00:00:22.900 --> 00:00:25.550
[some utterance]
NOTE Confidence: 0.746228337142857
00:00:27.730 --> 00:00:28.936
[some utterance]

What does confidence mean here and how is it different from recognizability? Also do you know what factors affect the confidence score? (pitch, volume, speed, etc.)

Our current assumption is that the confidence is the probability that the transcript is correct given certain parameters. But the documents say recognizability scores how recognizable the speech in a sound file is, and it's not clear what the difference is. And we don't know, when the confidence drops, what factors it could be caused by: the content of what someone is saying, the volume, the accent, etc.

Answer

Thanks for your patience, @Risako . Below is the product team's response:

Confidence is a typical speech to text metric that the speech engine will report on for how well it thinks it is able to accurately match what was said to a known text word. There’s similar docs for the Windows Speech API at https://learn.microsoft.com/en-us/previous-versions/office/developer/speech-technologies/dd186108%28v%3doffice.14%29 that does a pretty good job of describing confidence.

On https://learn.microsoft.com/en-us/azure/media-services/previous/media-services-index-content we do talk about Recognizability. This is more closely related to the quality of the audio stream. If there is background noise, or maybe music it makes the task of doing speech to text more difficult. The higher the rating the more clear the human speech is. Consider this a baseline metric where Confidence sort of sits on top of this. If Recognizability is high then we have a good starting point and that can drive Confidence higher. If Recognizability is low, such as we’re trying to parse a heavy metal song for it’s lyrics, then Confidence will absolutely be impacted.

Hope that helps. Please let us know if you have further questions

Thanks,
Grace

------------------------------------------------------------------------------------------------------------------------------

--If the reply is helpful, please Upvote and Accept as answer--

Explicit definition for AudioAnalyzer confidence

1 answer