Diarisation is not picking up number of Speakers correctly when generated from Speech to text

Mayur Patel 0 Reputation points
2025-03-21T09:54:26.18+00:00

Hello,

We are using speech to text service to generate transcript on our Videos. Now there 2 issues we are facing in this -

  1. The transcript is coming fine, but we are using diarisation feature to get number of speakers in the call - "properties":{"diarizationEnabled": True}, now when there are 2 people in call, it is getting it right by giving speaker1 & speaker2, but when there are 3 or more than 3 people in call, it still gives only 2 speakers.
  2. After transcript is generated, the format which is given by azure - [{"end": "2025-03-21 04:10:15", "text": "Dummy Text", "name": "Speaker_1", "duration": 64000000.0, "Recognized": "true", "Language": "en-IN"}, {"end": "2025-03-21 04:10:22", "text": "Dummy Text.", "name": "Speaker_1", "duration": 41600000.0, "Recognized": "true", "Language": "en-IN"}, {"end": "2025-03-21 04:10:26", "text": "Dummy Text", "name": "Speaker_1", "duration": 35600000.0, "Recognized": "true", "Language": "en-IN"}], is something like this, in which we have modified and added some extra parameter's. Now the problem here is with end time and duration, where duration gives for how much time speaker_1 was speaking, and end time is time stamp when speaker_1 started speaking. If we add duration in endtime it will give us end for next sentence. But in our case video was for 34 minutes and 38 seconds & our transcript generated for 36 minutes and 2 seconds, and our assumption is it's because of duration value is not generated correctly which adds up through out whole transcript and gives overall delay of 1-2 minutes in Transcript.

Thanks,

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
2,069 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Amira Bedhiafi 33,631 Reputation points Volunteer Moderator
    2025-03-21T11:39:09.3433333+00:00

    Hi Mayur,

    Thanks for sharing your issue on Microsoft Learn.

    I think you’re running into two main problems :

    Issue 1 : Diarization not detecting >2 speakers :

    Azure real-time and batch transcription APIs have known limitations with diarization — especially in correctly separating more than 2 speakers. By default, speaker count estimation is not guaranteed to exceed 2 reliably.

    So you may need to explicitly specify the number of expected speakers via diarization.speakerCount (if available in the API version you use).

    Keep in mind that this feature is available only in some locales and with specific models, so please try to check Azure Speech API diarization documentation for details.

    I recommend you that you consider using Conversation Transcription API (for meetings, group calls).

    https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-stt-diarization?tabs=windows&pivots=programming-language-csharp

    Issue 2 : Timestamp and duration drift :

    Azure returns timestamps and durations based on audio segmentation, which may include silence or speaker pauses, leading to cumulative drift.

    These durations are not always tightly synchronized with the actual video/audio timeline.

    Instead of summing duration, use actual offset and end from Azure's NBest or word-level timestamp output (depending on format).

    You may want to align speaker segments based on timestamps from the original video/audio if available.

    If you're not already, I recommend that you enable wordLevelTimestampsEnabled: true. It will help you reconstruct the true timing more granularly.

    "wordLevelTimestampsEnabled": true
    

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.