Azure Speech Recognition get worst result when adding pronounce assessment feature

Question

Azure Speech Recognition get worst result when adding pronounce assessment feature

Chuong Phung 70

Hello guy,

I'm using Azure Speech Recognition service. I realize that if I only use Azure Speech Recognition service, the result is really good, but when I add the pronounce assessment feature to it the result is getting worst.

How can I fix it, I'm using Python SDK?

I tested it on Azure AI Studio(https://ai.azure.com/explore/aiservices/speech/pronunciationassessment) and got the same error.

Because I can not upload the audio file, you can download the audio here: https://1drv.ms/u/c/05f6d99de34ab7b6/EWxt35-IAZxFihV9YhbsVpYBicKiTB2-mT-GkYW243aBkA?e=BjnRLC

Output in real-time speech to text:

User's image

Output in Pronounce assessment:

image (1)

3 answers

Your answer

Answer 1

Hi @Chuong Phung,

Thank you for reaching out regarding the Azure Speech Recognition service. I was able to reproduce your scenario, and I found that the recognition results are quite strong when using the service alone. With the pronunciation assessment feature, I was also able to achieve meaningful insights.

However, I noticed that some pronunciation challenges in the audio may have affected the assessment results. Clear pronunciation is essential for optimal performance, and focusing on specific areas of improvement could enhance the effectiveness of the assessment.

Here are a couple of suggestions that might help improve the results:

Reference Text Accuracy: Ensure that the reference text used for the pronunciation assessment closely matches the spoken content in the audio. Any discrepancies can lead to lower assessment scores.
Audio Quality: Clearer audio with minimal background noise often leads to better performance in both recognition and assessment. If it's possible to provide a cleaner audio sample, it might enhance the outcome.

Refer to the python sdk here: cognitive-services-speech-sdk.

Screen-shot for reference:

User's image

If you continue to face any issues, please let us know, and we will escalate this issue to the relevant team for further assistance.

Thank you.

If this answers your query, do click Accept Answer and Yes for was this answer helpful.

Chuong Phung 70 Reputation points

2024-10-23T14:47:27.67+00:00

Hello @santoshkc ,

Thank you for your answer.

We are using Azure Speech To Text on our application to get transcripts from user voice, so there is no way we have reference text from our side.

We are using Azure STT to get the transcript directly from user audio.

We are currently using the SpeechRecognition Python SDK with pronunciation assessment enabled. This configuration allows us to obtain both the transcript and the pronunciation assessment simultaneously. However, we have noticed that disabling pronunciation assessment (using only speech-to-text) results in higher quality transcripts.

Is there a way to first get the transcript from Azure STT and then perform the pronunciation assessment on that transcript afterward?

Answer 2

Hello @santoshkc ,

Thank you for your answer.

We are using Azure Speech To Text on our application to get transcripts from user voice, so there is no way we have reference text from our side.

We are using Azure STT to get the transcript directly from user audio.

We are currently using the SpeechRecognition Python SDK with pronunciation assessment enabled. This configuration allows us to obtain both the transcript and the pronunciation assessment simultaneously. However, we have noticed that disabling pronunciation assessment (using only speech-to-text) results in higher quality transcripts.

Is there a way to first get the transcript from Azure STT and then perform the pronunciation assessment on that transcript afterward?

Answer 3

Chuong Phung 70

Hello @santoshkc ,

Thank you for your answer.

We are using Azure Speech To Text on our application to get transcripts from user voice, so there is no way we have reference text from our side.

We are using Azure STT to get the transcript directly from user audio.

We are currently using the SpeechRecognition Python SDK with pronunciation assessment enabled. This configuration allows us to obtain both the transcript and the pronunciation assessment simultaneously. However, we have noticed that disabling pronunciation assessment (using only speech-to-text) results in higher quality transcripts.

Is there a way to first get the transcript from Azure STT and then perform the pronunciation assessment on that transcript afterward?

santoshkc 15,600 Reputation points Microsoft External Staff Moderator

2024-10-24T14:19:43.56+00:00
Hi @Chuong Phung,

I understand your use case better now. Since you're using Azure Speech-to-Text (STT) to get transcripts directly from user audio without having reference text beforehand, you can indeed separate the transcription step from the pronunciation assessment for improved control over both processes.

Here’s how you can approach this:

Step 1 - Transcription: You can first use Azure STT to transcribe the user’s speech into text without enabling pronunciation assessment. This will ensure you get the highest quality transcription results.

Step 2 - Pronunciation Assessment: Once you have the transcript, you can then use this transcribed text as the reference text for the pronunciation assessment feature. This will allow you to evaluate pronunciation based on the transcript generated by Azure STT.

This two-step approach should help you achieve the best of both- high-quality transcripts and meaningful pronunciation assessments.

See: Speech to text quickstart and Pronunciation assessment.

Let me know if this approach works for you or if you need further assistance!

Could you please take a moment to retake the survey on the above response? Your feedback is greatly appreciated.
Chuong Phung 70 Reputation points

2024-10-24T15:56:12.6966667+00:00

Hi @santoshkc ,

So, that means we need to pay double the bill for this function, is it correct? because the pronunciation assessment can not be run alone
Chuong Phung 70 Reputation points

2024-10-24T15:58:58.4133333+00:00

Hello @santoshkc , So that means we need to double bill for this feature, Is this correct? Because the pronunciation assessment feature can not be run stand-alone?
santoshkc 15,600 Reputation points Microsoft External Staff Moderator

2024-10-25T12:10:31.65+00:00

Hi @Chuong Phung,

Thank you for your query regarding the cost implications.

Running transcription and pronunciation assessment separately would indeed result in two calls, potentially leading to additional costs. Unfortunately, as of now, the pronunciation assessment feature relies on a provided or transcribed reference text, so it cannot be run independently without some form of initial transcription.

If cost-efficiency is a key consideration, you might explore optimizing usage by selectively enabling pronunciation assessment only when specific feedback is needed. Alternatively, I’d be happy to check with the Azure team if any upcoming updates might support stand-alone assessments in the future.

I hope you understand. Thank you.

Could you please take a moment to retake the survey on the above response? Your feedback is greatly appreciated.
Chuong Phung 70 Reputation points

2024-10-27T06:47:48.1733333+00:00

@santoshkc ,

Thank you for your reply. I understood. If you have any new updates on this, please let me know.

Thanks
santoshkc 15,600 Reputation points Microsoft External Staff Moderator

2024-10-28T06:14:10.3533333+00:00

Hi @Chuong Phung,

You're very welcome! We will keep you updated with any new information. Thank you.

Share via

Azure Speech Recognition get worst result when adding pronounce assessment feature

3 answers

Your answer