Azure Speech Pronunciation Assessment: Word and Phoneme level evaluation inaccuracies

Bhamin Patel 20 Reputation points
2026-01-14T09:38:29.73+00:00

Testing Context
Pronunciation assessment scenario: Reading
Locale: en-US
Record mic audio (16 kHz, mono, PCM WAV)
gradingSystem: HundredMark
granularity: Phoneme
phonemeAlphabet: SAPI
nbestPhonemeCount: 5
Testing Environment Details: Issue replicates everywhere in microsoft-cognitiveservices-speech-sdk v1.45.0 integration in our application as well as Azure Speech Pronunciation Assessment Playground.

Issue
While implementing one of the student activity which uses Azure Speech for Pronunciation Assessment, we have observed two major issues:

  1. Incorrect word substitution not flagged as Mispronunciation When a user speaks a different word than the reference text, the API often returns a high overall pronunciation score and does not mark the word as “Mispronounced.” Example:
    • Reference text: "there"
    • Spoken word: "they"
    • Result: High pronunciation score, no mispronunciation error. This behavior is common for many words and undermines the reliability of word-level pronunciation evaluation.
    When we say 'they' in place of 'there'01f1814f-d4b1-4de9-9804-8fd51bf1cc7d When we say 'apple' in place of 'pineapple'image
  2. In N-Best Low phoneme scores for /th/ in “thigh” despite correct pronunciation For the word “thigh” (/th ay/), the API consistently returns low phoneme accuracy and score in N-Best for /th/ even when pronounced correctly.
    1. In some cases, the API even accepts just the vowel sound “ay” as correct pronunciation for 'thigh'.
    2. Other “th” words (e.g., “through”, “think”) behave normally, suggesting the issue is specific to the /th ay/ sequence.
Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
{count} votes

Answer accepted by question author
  1. Sina Salam 27,786 Reputation points Volunteer Moderator
    2026-01-23T12:28:46.1666667+00:00

    Hello Bhamin Patel,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that your Azure Speech Pronunciation Assessment: Word and Phoneme level evaluation inaccuracies.

    First, enable miscue detection to correctly identify word insertions and omissions, which are not captured by default. This is done by setting EnableMiscue = true in the PronunciationAssessmentConfig. Without this flag, the service focuses only on pronunciation quality and ignores extra or missing words. - https://learn.microsoft.com/azure/ai-services/speech-service/how-to-pronunciation-assessment

    Second, pronunciation assessment must be treated separately from lexical correctness. Azure Speech evaluates how words sound, not whether the correct words were spoken. To detect substitutions like “they” instead of “there,” developers must compare the recognized text output from Speech-to-Text against the reference text using custom logic. - https://learn.microsoft.com/azure/ai-services/speech-service/speech-to-text

    Third, phoneme-level scores should be interpreted as heuristic indicators, not absolute truth. Certain phoneme combinations (such as /th/ + vowel sounds) may be scored inconsistently due to internal alignment and acoustic modeling. For better inspection, developers should examine NBest phoneme alternatives rather than relying on a single phoneme score. - https://learn.microsoft.com/azure/ai-services/speech-service/how-to-pronunciation-assessment#phoneme-level-assessment

    Fourth, developers should design around known model limitations rather than expecting strict correctness. Phonetically similar word substitutions are often accepted by the model and scored highly by design. This behavior is intentional and reflects the service’s emphasis on spoken fluency and intelligibility, not semantic or grammatical validation.

    For a robust implementation combines configuration and post-processing logic, as shown below. The code enables miscue detection, applies pronunciation assessment, and allows downstream comparison of recognized text for exact word validation:

    var config = new PronunciationAssessmentConfig(
        referenceText,
        GradingSystem.HundredMark,
        Granularity.Phoneme,
        enableMiscue: true
    );
    speechRecognizer.ApplyPronunciationAssessmentConfig(config);
    

    I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. SRILAKSHMI C 13,585 Reputation points Microsoft External Staff Moderator
    2026-01-20T16:38:56.25+00:00

    Hello Bhamin Patel,

    Welcome to Microsoft Q&A and Thank you for reaching out.

    Based on the testing details and examples you’ve shared, what you’re observing aligns with known limitations of the current Azure Speech Pronunciation Assessment models, particularly at the word and phoneme granularity.

    1. Incorrect word substitution not flagged as mispronunciation

    When a spoken word is phonetically close to the reference word (for example, “they” instead of “there”, or “apple” instead of “pineapple”), the service may still return a high pronunciation score and not mark it as Mispronunciation.

    This happens because the pronunciation assessment model is optimized to evaluate acoustic similarity, not strict lexical correctness. If the substituted word aligns closely at the phoneme level, the model may treat it as an acceptable match, even though the word itself is incorrect.

    As a result:

    • Word substitution errors are not always reliably detected
    • Overall pronunciation scores may remain high
    • This is a model behavior, not a configuration or SDK issue
    1. Low or inconsistent phoneme scores for specific sounds (e.g., /th/ in “thigh”)

    Your observation with the word “thigh” is particularly useful. While other /th/ words such as “think” or “through” behave correctly, the /th ay/ phoneme sequence can sometimes be mis-evaluated.

    In some cases:

    The /th/ phoneme receives a low score despite correct pronunciation

    The vowel sound alone (“ay”) may be incorrectly accepted

    This inconsistency appears in both SDK usage and the Azure Pronunciation Assessment Playground

    This again points to model-level phoneme alignment limitations, especially in specific phoneme transitions, rather than an issue with audio quality or API usage.

    1. What influences these results

    The behavior you’re seeing is primarily influenced by:

    The pronunciation assessment model itself

    How phoneme similarity and alignment are scored internally

    Factors like audio format, SDK version, and configuration are important, but in this case:

    • Your settings are correct
    • The issue reproduces across environments
    • This confirms it is not an implementation defect

    Avoid relying solely on word-level mispronunciation flags for strict correctness. If lexical accuracy is critical, consider:

    • Comparing recognized text vs reference text
    • Adding a secondary validation layer for substitutions

    Set expectations appropriately in academic or learning scenarios:

    • Pronunciation Assessment is best suited for pronunciation quality, not absolute word correctness

    Some phonetically close substitutions may not be penalized as expected

    The observed behavior is expected with the current model

    Word substitutions may not always be flagged if they are phonetically similar

    Certain phoneme sequences (such as /th ay/) can show inconsistent scoring

    This is a model limitation, not an SDK, configuration, or audio issue

    I Hope this helps. Do let me know if you have any further queries.

    Thank you!

    1 person found this answer helpful.
    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.