Azure Speech to text service does not work for some audio files.

Jigar Shah 0

Azure Speech to text service does not identify the multi user conversation (tested for two speakers).

It mixes the statements of both the speakers in same sentence.

It then duplicates the documents as mentioned below:

Phrase: Hello, Amy, I'm doctor Jones. How are you doing today?

Speaker: 0

Sentiment: neutral

Phrase: Hello, Amy, I'm doctor Jones. How are you doing today?

Speaker: 1

Sentiment: neutral

Phrase: I'm OK, but it hurts when I go to the bathroom when I pee.

Speaker: 0

Sentiment: negative

Phrase: I'm OK, but it hurts when I go to the bathroom but I pee.

Speaker: 1

Sentiment: negative

Phrase: That's called this year and it's pretty common. When did this start? Two days ago.

Speaker: 0

Sentiment: neutral

Phrase: That's called this year and it's pretty common. When did this start? Two days ago.

Speaker: 1

Sentiment: neutral

Phrase: Have you had this before? Yes. I had this several years ago, before you were my doctor.

Speaker: 0

Sentiment: neutral

Phrase: Have you had this before? Yes. I had this several years ago, before you were my doctor.

Speaker: 1

Sentiment: neutral

...

So it joints the statements from both speakers together instead of recognizing them separately.

I used below url:

--languageKey <<LanguageKey>> --languageEndpoint <<languageEndPoint>> --speechKey <<Speechkey>> --speechRegion eastus --input <<Audio File path>> --stereo  --output summary.json

romungi-MSFT 43,696 Reputation points Microsoft Employee

2023-03-17T14:46:53.0933333+00:00

Jigar Shah It would be great if you could add audio files that you are facing issues with since you are seeing this with some files it could be an issue with audio format. It would be great if you could also add an issue on the SDK page since you are using the CLI.

Do you also see the same result using Azure speech studio?

Jigar Shah 0

Not able to upload wav file. But you may take this url:

https://github.com/jkshah7/AzureServiceDataFiles/blob/main/doctorpatientconversation1.wav?raw=true

It transcribe some phrases this way:
It combines statements of two person. "That's called this year and it's pretty common. When did this start" This was said by person 1.

Person 2 replied: "two days ago"

"that's called this year and it's pretty common when did this start two days ago"

{
      "recognitionStatus": "Success",
      "channel": 0,
      "offset": "PT10.13S",
      "duration": "PT5.22S",
      "offsetInTicks": 101300000.0,
      "durationInTicks": 52200000.0,
      "nBest": [
        {
          "confidence": 0.8262271,
          "lexical": "that's called this year and it's pretty common when did this start two days ago",
          "itn": "that's called this year and it's pretty common when did this start two days ago",
          "maskedITN": "that's called this year and it's pretty common when did this start two days ago",
          "display": "That's called this year and it's pretty common. When did this start? Two days ago."
        },
        {
          "confidence": 0.77173805,
          "lexical": "that's called dysuria and it's pretty common when did this start two days ago",
          "itn": "that's called dysuria and it's pretty common when did this start two days ago",
          "maskedITN": "that's called dysuria and it's pretty common when did this start two days ago",
          "display": "that's called dysuria and it's pretty common when did this start two days ago"
        },
        {
          "confidence": 0.81431377,
          "lexical": "that's called this area and it's pretty common when did this start two days ago",
          "itn": "that's called this area and it's pretty common when did this start two days ago",
          "maskedITN": "that's called this area and it's pretty common when did this start two days ago",
          "display": "that's called this area and it's pretty common when did this start two days ago"
        },
        {
          "confidence": 0.81597435,
          "lexical": "that's called this year and it's pretty common when did this start up two days ago",
          "itn": "that's called this year and it's pretty common when did this start up two days ago",
          "maskedITN": "that's called this year and it's pretty common when did this start up two days ago",
          "display": "that's called this year and it's pretty common when did this start up two days ago"
        },
        {
          "confidence": 0.764943,
          "lexical": "that's called dysuria and it's pretty common when did this start up two days ago",
          "itn": "that's called dysuria and it's pretty common when did this start up two days ago",
          "maskedITN": "that's called dysuria and it's pretty common when did this start up two days ago",
          "display": "that's called dysuria and it's pretty common when did this start up two days ago"
        }
      ]
    },

I also checked in studio. It gives same result.

Jigar Shah 0 Reputation points

2023-03-20T13:41:24.3733333+00:00

DoctorPatientConv.txt

This is the transcription generated for this audio file.
You can see that the same statements are shown for channel 0 as well as channel 1. @romungi-MSFT

Share via

Azure Speech to text service does not work for some audio files.