Is there information regarding how the alignment algorithm in 'PronunciationAssessment' mode works when there are insertions caused by the repetition of whole phrases?

Question

Is there information regarding how the alignment algorithm in 'PronunciationAssessment' mode works when there are insertions caused by the repetition of whole phrases?

Alex Del Giudice 25

I'm trying to understand the alignment behavior in pronunciation assessment.

When the target text is something like:

He wanted to repeat what he'd heard but couldn't remember it all.

And the voice reading says something like:

He wanted to rep what he'd learned reep what he'd repeat what he'd lear heard but couldn't remember it all.

I'm seeing some inconsistent behavior and am wondering whether it's defaulting to a 'longest correct ordered sequence' or whether it's looking at the pronunciation accuracy to find the correctly ordered sequence to label 'correct' and calling everything else an insertion.

VasaviLankipalle-MSFT 18,676 Reputation points Moderator

2024-02-20T23:09:18.9733333+00:00

Hello @Alex Del Giudice , Thanks for using Microsoft Q&A Platform.

Do you have any sample response to explain this issue? If possible, please also share the Pronunciation Assessment feature output to understand better.

Alex Del Giudice 25

Let me clarify with an example. Below is a portion of the returned transcription JSON. The reference target was "why did you name me matt?" The reader said "why do you name why did you name matt" What seems to have happened is that the child said the phrase 'why do you name' and then self corrected the whole phrase when they recognized they misread 'did'. so they repeated the phrase. But then they omitted 'me'. The problem here is that the omission of 'me' (as you can see in the json transcription below) appears BEFORE the corrected phrase. So it looks like they omitted the word me before they correctly pronounced the phrase 'why did you name'. It seems more intuitive to me to try to keep the order of the reference text intact for cases like this one (so that the 'me' omission json element would appear after the correct/mispronounced second 'name' instead of after the first 'inserted' 'name') We've encountered a few similar cases.

                {
                    "Confidence": 0.0,
                    "Duration": 3300000,
                    "Offset": 525600000,
                    "PronunciationAssessment": {
                        "AccuracyScore": 80.0,
                        "ErrorType": "Insertion"
                    },
                    "Word": "why"
                },
                {
                    "Confidence": 0.0,
                    "Duration": 2400000,
                    "Offset": 529000000,
                    "PronunciationAssessment": {
                        "AccuracyScore": 5.0,
                        "ErrorType": "Insertion"
                    },
                    "Word": "did"
                },
                {
                    "Confidence": 0.0,
                    "Duration": 2200000,
                    "Offset": 531500000,
                    "PronunciationAssessment": {
                        "AccuracyScore": 45.0,
                        "ErrorType": "Insertion"
                    },
                    "Word": "you"
                },
                {
                    "Confidence": 0.0,
                    "Duration": 5500000,
                    "Offset": 533800000,
                    "PronunciationAssessment": {
                        "AccuracyScore": 66.0,
                        "ErrorType": "Insertion"
                    },
                    "Word": "name"
                },
                {
                    "Confidence": 0.0,
                    "PronunciationAssessment": {
                        "ErrorType": "Omission"
                    },
                    "Word": "me"
                },
                {
                    "Confidence": 0.0,
                    "Duration": 2700000,
                    "Offset": 543800000,
                    "PronunciationAssessment": {
                        "AccuracyScore": 80.0,
                        "ErrorType": "None"
                    },
                    "Word": "why"
                },
                {
                    "Confidence": 0.0,
                    "Duration": 3100000,
                    "Offset": 546600000,
                    "PronunciationAssessment": {
                        "AccuracyScore": 50.0,
                        "ErrorType": "Mispronunciation"
                    },
                    "Word": "did"
                },
                {
                    "Confidence": 0.0,
                    "Duration": 2000000,
                    "Offset": 549800000,
                    "PronunciationAssessment": {
                        "AccuracyScore": 92.0,
                        "ErrorType": "None"
                    },
                    "Word": "you"
                },
                {
                    "Confidence": 0.0,
                    "Duration": 4400000,
                    "Offset": 551900000,
                    "PronunciationAssessment": {
                        "AccuracyScore": 49.0,
                        "ErrorType": "Mispronunciation"
                    },
                    "Word": "name"
                },
                {
                    "Confidence": 0.0,
                    "Duration": 3700000,
                    "Offset": 556400000,
                    "PronunciationAssessment": {
                        "AccuracyScore": 46.0,
                        "ErrorType": "Mispronunciation"
                    },
                    "Word": "matt"
                },

1 answer

Your answer

VasaviLankipalle-MSFT 18,676 Reputation points Moderator

2024-02-20T23:09:18.9733333+00:00

Hello @Alex Del Giudice , Thanks for using Microsoft Q&A Platform.

Do you have any sample response to explain this issue? If possible, please also share the Pronunciation Assessment feature output to understand better.

Answer 1

Hello @Alex Del Giudice ,

In pronunciation assessment, we usually have a "reference/target text". After the recognition, an algorithm, edit distance, and other steps will be applied to compute the insertion, and deletion error.

Only the words which are tag by sequence matching algorithm as matched word will be assign "mispronunciation" or None" tag like that.

I hope this helps. We appreciate your time and patience throughout this issue. Regards,

Vasavi

-Please kindly accept the answer and vote 'yes' if you feel helpful to support the community, thanks.

Share via

Is there information regarding how the alignment algorithm in 'PronunciationAssessment' mode works when there are insertions caused by the repetition of whole phrases?

1 answer

Your answer