Characteristics and limitations of Pronunciation Assessment

2025-06-24

Important

Non-English translations are provided for convenience only. Please consult the EN-US version of this document for the binding version.

As a part of the Azure AI Speech service, pronunciation assessment empowers end-to-end education solutions for computer-assisted language learning. Pronunciation Assessment involves multiple criteria to assess learners' performance at multiple levels of detail, with perceptions similar to human judges.

How accurate is Pronunciation Assessment?

Pronunciation Assessment feature provides objective scores, like pronunciation accuracy and fluency degree, for language learners in computer-assisted language learning. The performance of pronunciation assessment depends on Azure AI Speech-To-Text transcription accuracy with the use of a submitted transcription as reference, and inter-rater agreement between the system and human judges. For a definition of Speech-To-Text accuracy, see Characteristics and limitations for using speech to text.

The following sections are designed to help you understand key concepts about accuracy as they apply to using Pronunciation Assessment.

The language of accuracy

The accuracy of Speech-To-Text affects pronunciation assessment. Word error rate (WER) is used to measure Speech-To-Text accuracy as the industry standard. WER counts the number of incorrect words identified during recognition and then divides by the total number of words provided in the correct transcript, which is often created by human labeling.

Comparing Pronunciation Assessment to Human Judges

The Pearson correlation coefficient is used to measure the correlation between pronunciation assessment API generated scores and scores generated by human judges. The Pearson correlation coefficient is a measure of linear correlation for two given sequences. It's widely used to measure the difference between automatically generated machine results and human-annotated labels. This coefficient assigns a value between –1 to 1, where 0 is no correlation, negative value means the prediction is opposite to the target, and positive value means how prediction is aligned with the target.

The proposed guidelines for a Pearson correlation coefficient interpretation are shown in the following table. The strength signifies the relationship correlation between two variables and reflects how consistently the machine result aligns with human labels. Values that are close to 1 indicate a stronger correlation.

Strength of Association	Coefficient Value	Detail
Low	0.1 to 0.3	The autogenerated scores from an automatic system aren't significantly aligned with the perception of humans.
Medium	0.3 to 0.5	The autogenerated scores from an automatic system are aligned with the perception of humans, but differences still exist, and people might not agree with the result.
High	0.5 to 1.0	The autogenerated scores from an automatic system are aligned with the perception of humans, and people are willing to agree with the system results.

In our evaluations, Microsoft Pronunciation Assessment has performed >0.5 Pearson correlation with human judges' results, which indicates the autogenerated results are highly consistent with the judgment of human experts.

System limitations and best practices to improve system accuracy

Pronunciation Assessment works better with higher-quality audio input. We recommend an input quality of 16 kHz or higher.
Pronunciation Assessment quality is also affected by the distance of the speaker from the microphone. Recordings should be made with the speaker close to the microphone, and not over a remote connection.
Pronunciation Assessment doesn't support a mixed lingual assessment scenario.
Pronunciation Assessment supports a broader range of languages.
Pronunciation Assessment doesn't support a multi-speaker assessment scenario. The audio should include only one speaker for each assessment.
Pronunciation Assessment compares the submitted audio to native speakers in general conditions. The speaker should maintain a normal speaking speed and volume, and avoid shouting or otherwise raising their voice.
Pronunciation assessment performs better in an environment with little background noise. Current Speech-To-Text models accommodate noise in general conditions. Noisy environments or multiple people speaking at the same time might lead to lower confidence of the evaluation. To handle difficult cases better, you can suggest that the speaker should repeat a pronunciation if they score below a certain threshold.

Evaluating Pronunciation Assessment in your applications

Pronunciation Assessment's performance will vary depending on the real-world uses that customers implement. In order to ensure optimal performance in their scenarios, customers should conduct their own evaluations of the solutions they implement using Pronunciation Assessment.

Before using Pronunciation Assessment in your applications, consider whether this product performs well in your scenario. Collect real-life data from the target scenario, test how Pronunciation Assessment performs, and make sure Speech-To-Text and Pronunciation Assessment can deliver the accuracy you need, see Evaluate and improve Azure AI services Custom Speech accuracy.
Select suitable thresholds per the target scenario. Pronunciation Assessment provides accuracy scores at different levels and you may need to consider the threshold employed in real-use. For example, the grading method for children's learning might not be as strict as that for adult learning. Consider setting a higher mispronunciation detection threshold for adult learning.