Diarisation is not picking up number of Speakers correctly when generated from Speech to text

Question

Diarisation is not picking up number of Speakers correctly when generated from Speech to text

Mayur Patel 0

Hello,

We are using speech to text service to generate transcript on our Videos. Now there 2 issues we are facing in this -

The transcript is coming fine, but we are using diarisation feature to get number of speakers in the call - "properties":{"diarizationEnabled": True}, now when there are 2 people in call, it is getting it right by giving speaker1 & speaker2, but when there are 3 or more than 3 people in call, it still gives only 2 speakers.
After transcript is generated, the format which is given by azure - [{"end": "2025-03-21 04:10:15", "text": "Dummy Text", "name": "Speaker_1", "duration": 64000000.0, "Recognized": "true", "Language": "en-IN"}, {"end": "2025-03-21 04:10:22", "text": "Dummy Text.", "name": "Speaker_1", "duration": 41600000.0, "Recognized": "true", "Language": "en-IN"}, {"end": "2025-03-21 04:10:26", "text": "Dummy Text", "name": "Speaker_1", "duration": 35600000.0, "Recognized": "true", "Language": "en-IN"}], is something like this, in which we have modified and added some extra parameter's. Now the problem here is with end time and duration, where duration gives for how much time speaker_1 was speaking, and end time is time stamp when speaker_1 started speaking. If we add duration in endtime it will give us end for next sentence. But in our case video was for 34 minutes and 38 seconds & our transcript generated for 36 minutes and 2 seconds, and our assumption is it's because of duration value is not generated correctly which adds up through out whole transcript and gives overall delay of 1-2 minutes in Transcript.

Thanks,

Saideep Anchuri 9,500 Reputation points Moderator

2025-03-24T07:34:33.6133333+00:00

Hi Mayur Patel

Just checking in to see if the below answer provided by @Amira Bedhiafi helped.

Thank You

1 answer

Your answer

Saideep Anchuri 9,500 Reputation points Moderator

2025-03-24T07:34:33.6133333+00:00

Hi Mayur Patel

Just checking in to see if the below answer provided by @Amira Bedhiafi helped.

Thank You

Answer 1

Amira Bedhiafi 34,656 Volunteer Moderator

Hi Mayur,

Thanks for sharing your issue on Microsoft Learn.

I think you’re running into two main problems :

Issue 1 : Diarization not detecting >2 speakers :

Azure real-time and batch transcription APIs have known limitations with diarization — especially in correctly separating more than 2 speakers. By default, speaker count estimation is not guaranteed to exceed 2 reliably.

So you may need to explicitly specify the number of expected speakers via diarization.speakerCount (if available in the API version you use).

Keep in mind that this feature is available only in some locales and with specific models, so please try to check Azure Speech API diarization documentation for details.

I recommend you that you consider using Conversation Transcription API (for meetings, group calls).

https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-stt-diarization?tabs=windows&pivots=programming-language-csharp

Issue 2 : Timestamp and duration drift :

Azure returns timestamps and durations based on audio segmentation, which may include silence or speaker pauses, leading to cumulative drift.

These durations are not always tightly synchronized with the actual video/audio timeline.

Instead of summing duration, use actual offset and end from Azure's NBest or word-level timestamp output (depending on format).

You may want to align speaker segments based on timestamps from the original video/audio if available.

If you're not already, I recommend that you enable wordLevelTimestampsEnabled: true. It will help you reconstruct the true timing more granularly.

"wordLevelTimestampsEnabled": true

Mayur Patel 0 Reputation points

2025-03-25T09:24:02.87+00:00

Hello @Amira Bedhiafi ,

Thanks for your quick response.

Below are our observation for your solutions -

Issue 1 : Diarization not detecting >2 speakers :

Solution which you provided helped, when we are passing No. of Speakers while running this we are getting more than 3 speakers.

We are testing further more on this just to make sure that we are getting correct data against each speaker as well. Will keep posted on this, once closed.

Issue 2 : Timestamp and duration drift :

On this, we added "wordLevelTimestampsEnabled": true, but it only provide with extra Details.

The video length of the tested video is 34 mins but the duration of the video in the transcript is 36 mins. This is similar to when the wordLeveltimestamp was false , we just recieved extra details but the overall duration is more than actual Video length.

How can we ensure that the duration of the transcript remains the same as the video length?

Marking @saideep anchuri

Thanks,
Saideep Anchuri 9,500 Reputation points Moderator

2025-03-25T09:36:20.32+00:00

Hi Mayur Patel

The transcript duration might be longer than the video because it includes silent parts or extra details from word-level timestamps. To fix this, you can edit the transcript to remove silent segments. Using the Batch Speech-to-Text API could also help you manage timestamps better and reduce these mismatches.

Kindly refer below link: where-can-i-found-timestamps-fo

Thank You.
Vivek Purohit 0 Reputation points

2025-03-25T10:25:54.8766667+00:00

@Amira Bedhiafi / @Saideep Anchuri
Please confirm if there is any way to set a speaker limitation for ConversationTranscriber in Java. I have an issue where the API detects multiple speakers (up to 6 or 7) for a conversation between two people.

Share via

Diarisation is not picking up number of Speakers correctly when generated from Speech to text

1 answer

Your answer