Azure Speech Recognition get worst result when adding pronounce assessment feature

Chuong Phung 70 Reputation points
2024-10-23T10:37:12.4533333+00:00

Hello guy,

I'm using Azure Speech Recognition service. I realize that if I only use Azure Speech Recognition service, the result is really good, but when I add the pronounce assessment feature to it the result is getting worst.

How can I fix it, I'm using Python SDK?

I tested it on Azure AI Studio(https://ai.azure.com/explore/aiservices/speech/pronunciationassessment) and got the same error.

Because I can not upload the audio file, you can download the audio here: https://1drv.ms/u/c/05f6d99de34ab7b6/EWxt35-IAZxFihV9YhbsVpYBicKiTB2-mT-GkYW243aBkA?e=BjnRLC

Output in real-time speech to text:

User's image

Output in Pronounce assessment:

image (1)

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
2,061 questions
0 comments No comments
{count} votes

3 answers

Sort by: Most helpful
  1. santoshkc 15,325 Reputation points Microsoft External Staff Moderator
    2024-10-23T13:51:58.4766667+00:00

    Hi @Chuong Phung,

    Thank you for reaching out regarding the Azure Speech Recognition service. I was able to reproduce your scenario, and I found that the recognition results are quite strong when using the service alone. With the pronunciation assessment feature, I was also able to achieve meaningful insights.

    However, I noticed that some pronunciation challenges in the audio may have affected the assessment results. Clear pronunciation is essential for optimal performance, and focusing on specific areas of improvement could enhance the effectiveness of the assessment.

    Here are a couple of suggestions that might help improve the results:

    1. Reference Text Accuracy: Ensure that the reference text used for the pronunciation assessment closely matches the spoken content in the audio. Any discrepancies can lead to lower assessment scores.
    2. Audio Quality: Clearer audio with minimal background noise often leads to better performance in both recognition and assessment. If it's possible to provide a cleaner audio sample, it might enhance the outcome.

    Refer to the python sdk here: cognitive-services-speech-sdk.

    Screen-shot for reference:

    User's image

    If you continue to face any issues, please let us know, and we will escalate this issue to the relevant team for further assistance.

    Thank you.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful.


  2. Chuong Phung 70 Reputation points
    2024-10-23T14:49:12.3833333+00:00

    Hello @santoshkc ,

    Thank you for your answer.

    
    

    We are using Azure Speech To Text on our application to get transcripts from user voice, so there is no way we have reference text from our side.

    We are using Azure STT to get the transcript directly from user audio.

    We are currently using the SpeechRecognition Python SDK with pronunciation assessment enabled. This configuration allows us to obtain both the transcript and the pronunciation assessment simultaneously. However, we have noticed that disabling pronunciation assessment (using only speech-to-text) results in higher quality transcripts.

    Is there a way to first get the transcript from Azure STT and then perform the pronunciation assessment on that transcript afterward?

    0 comments No comments

  3. Chuong Phung 70 Reputation points
    2024-10-23T14:50:04.3166667+00:00

    Hello @santoshkc ,

    Thank you for your answer.

    
    

    We are using Azure Speech To Text on our application to get transcripts from user voice, so there is no way we have reference text from our side.

    We are using Azure STT to get the transcript directly from user audio.

    We are currently using the SpeechRecognition Python SDK with pronunciation assessment enabled. This configuration allows us to obtain both the transcript and the pronunciation assessment simultaneously. However, we have noticed that disabling pronunciation assessment (using only speech-to-text) results in higher quality transcripts.

    Is there a way to first get the transcript from Azure STT and then perform the pronunciation assessment on that transcript afterward?


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.