Hello Bhamin Patel,
Welcome to the Microsoft Q&A and thank you for posting your questions here.
I understand that your Azure Speech Pronunciation Assessment: Word and Phoneme level evaluation inaccuracies.
First, enable miscue detection to correctly identify word insertions and omissions, which are not captured by default. This is done by setting EnableMiscue = true in the PronunciationAssessmentConfig. Without this flag, the service focuses only on pronunciation quality and ignores extra or missing words. - https://learn.microsoft.com/azure/ai-services/speech-service/how-to-pronunciation-assessment
Second, pronunciation assessment must be treated separately from lexical correctness. Azure Speech evaluates how words sound, not whether the correct words were spoken. To detect substitutions like “they” instead of “there,” developers must compare the recognized text output from Speech-to-Text against the reference text using custom logic. - https://learn.microsoft.com/azure/ai-services/speech-service/speech-to-text
Third, phoneme-level scores should be interpreted as heuristic indicators, not absolute truth. Certain phoneme combinations (such as /th/ + vowel sounds) may be scored inconsistently due to internal alignment and acoustic modeling. For better inspection, developers should examine NBest phoneme alternatives rather than relying on a single phoneme score. - https://learn.microsoft.com/azure/ai-services/speech-service/how-to-pronunciation-assessment#phoneme-level-assessment
Fourth, developers should design around known model limitations rather than expecting strict correctness. Phonetically similar word substitutions are often accepted by the model and scored highly by design. This behavior is intentional and reflects the service’s emphasis on spoken fluency and intelligibility, not semantic or grammatical validation.
For a robust implementation combines configuration and post-processing logic, as shown below. The code enables miscue detection, applies pronunciation assessment, and allows downstream comparison of recognized text for exact word validation:
var config = new PronunciationAssessmentConfig(
referenceText,
GradingSystem.HundredMark,
Granularity.Phoneme,
enableMiscue: true
);
speechRecognizer.ApplyPronunciationAssessmentConfig(config);
I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.
Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.