Unable to Get Logical Results with Azure Pronunciation Assessment

Question

Unable to Get Logical Results with Azure Pronunciation Assessment

tzviya langenthal 0

I'm trying to use the pronunciationAssessment feature in the Azure Speech SDK, but I cannot get reasonable result.
I've tested this with the word "school" and other words as well, but I always get a result of 0—no matter whether the word was correctly spoken or not. I generated the audio files using this Text-to-Audio tool, so this should be easily reproducible.

Does anyone have any idea why the accuracy score is always 0, or what I might be missing?

namespace PronunciationAssessmentDemo
{
    class Program
    {
        public static AudioConfig CreateAudioConfigFromBytes(byte[] audioBytes)
        {
            var audioStream = new MemoryStream(audioBytes);
            var pushStream = AudioInputStream.CreatePushStream();
            pushStream.Write(audioBytes);
            pushStream.Close();
            var audioConfig = AudioConfig.FromStreamInput(pushStream);
            return audioConfig;
        }
        public static async Task<float> AssessPronunciation(byte[] audioBytes, string referenceText)
        {
            string subscriptionKey = Environment.GetEnvironmentVariable("STT_API_KEY");
            string region = "eastus";
            var pronunciationAssessmentConfig = new PronunciationAssessmentConfig(
                referenceText: referenceText,
                gradingSystem: GradingSystem.HundredMark,
                granularity: Granularity.Phoneme,
                enableMiscue: false);
            var audioConfig = CreateAudioConfigFromBytes(audioBytes);
            var speechConfig = SpeechConfig.FromSubscription(subscriptionKey, region);
            using (var speechRecognizer = new SpeechRecognizer(speechConfig, audioConfig))
            {
                pronunciationAssessmentConfig.ApplyTo(speechRecognizer);
                var speechRecognitionResult = await speechRecognizer.RecognizeOnceAsync();
                if (speechRecognitionResult.Reason == ResultReason.RecognizedSpeech)
                {
                    Console.WriteLine("Recognized: " + speechRecognitionResult.Text);
                    var pronunciationAssessmentResult = PronunciationAssessmentResult.FromResult(speechRecognitionResult);
                    Console.WriteLine($"Accuracy Score: {pronunciationAssessmentResult.AccuracyScore}");
                    return (float)pronunciationAssessmentResult.AccuracyScore;
                }
                else
                {
                    Console.WriteLine($"Recognition failed: {speechRecognitionResult.Reason}");
                    return 0;
                }
            }
        }
        static async Task Main(string[] args)
        {
            string audioFilePath = "...wwwroot\\audio\\school.wav"; 
            string referenceText = "school"; 
            byte[] audioBytes = File.ReadAllBytes(audioFilePath);
            float accuracyScore = await AssessPronunciation(audioBytes, referenceText);
            Console.WriteLine($"Final Accuracy Score: {accuracyScore}");
        }
    }
}


using Microsoft.CognitiveServices.Speech.Audio;

namespace PronunciationAssessmentDemo
{
    class Program
    {
        public static AudioConfig CreateAudioConfigFromBytes(byte[] audioBytes)
        {
            var audioStream = new MemoryStream(audioBytes);
            var pushStream = AudioInputStream.CreatePushStream();
            pushStream.Write(audioBytes);
            pushStream.Close();
            var audioConfig = AudioConfig.FromStreamInput(pushStream);
            return audioConfig;
        }

1 answer

Your answer

Answer 1

Saideep Anchuri 9,500 Moderator

Hi tzviya langenthal

Welcome to Microsoft Q&A Forum, thank you for posting your query here!

To ensure accurate results in pronunciation assessment, you should check that the audio configuration is set up correctly. You can try using AudioConfig.FromStreamInput(audioStream) instead of AudioConfig.FromStreamInput(pushStream). It's also important to make sure that the reference text you're using matches the spoken audio exactly, as any discrepancies can lead to inaccurate results, make sure you're using the latest version of the Azure Speech SDK, as updates often include bug fixes and improvements.

Kindly refer the below document:

https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-pronunciation-assessment?pivots=programming-language-csharp

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer

Thank You.

tzviya langenthal 0 Reputation points

2024-11-09T16:09:38.6533333+00:00

I am using pushStream since it didnt work with audioStream.

The reference text was created by Text-to-Audio tool so I would expext perfect match

I am using version 1.41.1, it looks like its the latest one
Saideep Anchuri 9,500 Reputation points Moderator

2024-11-11T06:34:01.3766667+00:00

Hi tzviya langenthal

It's good to hear that you're using the latest version of the Speech SDK. Regarding the reference text, even if it was created by a Text-to-Audio tool, there could still be slight differences between the reference text and the actual speech that is recognized. This could be due to factors such as background noise, accents, or variations in pronunciation.

If you're using a PushAudioInputStream, make sure that you're writing the audio data to the stream in the correct format and that the audio data is complete and not truncated. You can also try adjusting the recognition parameters, such as the language or the recognition mode, to see if that improves the accuracy of the recognition.

If you're still having issues with the recognition accuracy, you can try using the Speech Studio tool to test your audio and see if it can recognize the speech accurately. This can help you identify any issues with the audio or the recognition settings.

Thank You.

Share via

Unable to Get Logical Results with Azure Pronunciation Assessment

1 answer

Your answer