N-Best and Confidence for Speech to Text

영주 홍 21 Reputation points
2022-02-24T02:22:31.97+00:00

Hi, below is STT result with base model.

"nBest": [  
 {   
   ... first ...  
   "confidence" : 0.84337946  
 }  
 {   
   ... second ...  
   "confidence" : 0.85337946  
 }   
  ....  
  

I think 'combinedRecognizedPhrases' choose the first sentence out of 5 n-bests.

but sometimes other sentences have higher confidence value.
Why were sentences with low confidence chosen?

and is there any way to remove words that have low confidence value?

and for 'Test models' (ignore the red circle)

177335-image.png

the texts of model 1 and model 2 are from 'txt_lexical' , which are slightly different from 'txt_display'. (I think 'txt_lexical' doesn't contain the 1st n-best data)

When I recognize speech using Speech SDK, it produces 'txt_display' data.
Then why does Speech Studio show 'txt_lexical' data?
If I want to calculate WER of display data, do I have to do it manually?

Thanks in advance!

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,713 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. romungi-MSFT 45,731 Reputation points Microsoft Employee
    2022-02-24T11:32:30.84+00:00

    @영주 홍 The first result in the nBest array is the best result, no matter what the confidence value it is. This is by design and you cannot limit the response when using the detailed format to remove any of the results. You can refer the FAQ of speech service which documents this behavior.

    With respect to the behavior of speech studio we will get back after checking internally, because this could be a implementation bug or a design to display the best based on other settings.

    I think WER data is shown if audio_human labeled data is used as per the documentation.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.