Pronunciation assessment in Speech Studio

Pronunciation assessment uses the Speech-to-Text capability to provide subjective and objective feedback for language learners. Practicing pronunciation and getting timely feedback are essential for improving language skills. Assessments driven by experienced teachers can take a lot of time and effort and makes a high-quality assessment expensive for learners. Pronunciation assessment can help make the language assessment more engaging and accessible to learners of all backgrounds.

Pronunciation assessment provides various assessment results in different granularities, from individual phonemes to the entire text input.

  • At the full-text level, pronunciation assessment offers additional Fluency and Completeness scores: Fluency indicates how closely the speech matches a native speaker's use of silent breaks between words, and Completeness indicates how many words are pronounced in the speech to the reference text input. An overall score aggregated from Accuracy, Fluency and Completeness is then given to indicate the overall pronunciation quality of the given speech.
  • At the word-level, pronunciation assessment can automatically detect miscues and provide accuracy score simultaneously, which provides more detailed information on omission, repetition, insertions, and mispronunciation in the given speech.
  • Syllable-level accuracy scores are currently only available via the JSON file or Speech SDK.
  • At the phoneme level, pronunciation assessment provides accuracy scores of each phoneme, helping learners to better understand the pronunciation details of their speech.

This article describes how to use the pronunciation assessment tool through the Speech Studio. You can get immediate feedback on the accuracy and fluency of your speech without writing any code. For information about how to integrate pronunciation assessment in your speech applications, see How to use pronunciation assessment.


Usage of pronunciation assessment is charged the same as standard Speech to Text pricing.

For information about availability of pronunciation assessment, see supported languages and available regions.

Try out pronunciation assessment

You can explore and try out pronunciation assessment even without signing in.


To assess more than 5 seconds of speech with your own script, sign in with an Azure account and use your Speech resource.

Follow these steps to assess your pronunciation of the reference text:

  1. Go to Pronunciation Assessment in the Speech Studio.

  2. Choose a supported language that you want to evaluate the pronunciation.

  3. Choose from the provisioned text samples, or under the Enter your own script label, enter your own reference text.

    When reading the text, you should be close to microphone to make sure the recorded voice isn't too low.

    Screenshot of where to record audio with a microphone.

    Otherwise you can upload recorded audio for pronunciation assessment. Once successfully uploaded, the audio will be automatically evaluated by the system, as shown in the following screenshot.

    Screenshot of uploading recorded audio to be assessed.

Pronunciation assessment results

Once you've recorded the reference text or uploaded the recorded audio, the Assessment result will be output. The result includes your spoken audio and the feedback on the accuracy and fluency of spoken audio, by comparing a machine generated transcript of the input audio with the reference text. You can listen to your spoken audio, and download it if necessary.

You can also check the pronunciation assessment result in JSON. The word-level, syllable-level, and phoneme-level accuracy scores are included in the JSON file.

Overall scores

Pronunciation Assessment evaluates three aspects of pronunciation: accuracy, fluency, and completeness. At the bottom of Assessment result, you can see Pronunciation score, Accuracy score, Fluency score, and Completeness score. The Pronunciation score is overall score indicating the pronunciation quality of the given speech. This overall score is aggregated from Accuracy score, Fluency score, and Completeness score with weight.

Screenshot of overall assessment scores.

Scores within words

The complete transcription is shown in the Display window. If a word is omitted, inserted, or mispronounced compared to the reference text, the word will be highlighted according to the error type. While hovering over each word, you can see accuracy scores for the whole word or specific phonemes.

Screenshot of scores for a word and its phonemes.

Next steps