Hi Mayur,
Thanks for sharing your issue on Microsoft Learn.
I think you’re running into two main problems :
Issue 1 : Diarization not detecting >2 speakers :
Azure real-time and batch transcription APIs have known limitations with diarization — especially in correctly separating more than 2 speakers. By default, speaker count estimation is not guaranteed to exceed 2 reliably.
So you may need to explicitly specify the number of expected speakers via diarization.speakerCount
(if available in the API version you use).
Keep in mind that this feature is available only in some locales and with specific models, so please try to check Azure Speech API diarization documentation for details.
I recommend you that you consider using Conversation Transcription API (for meetings, group calls).
Issue 2 : Timestamp and duration drift :
Azure returns timestamps and durations based on audio segmentation, which may include silence or speaker pauses, leading to cumulative drift.
These durations are not always tightly synchronized with the actual video/audio timeline.
Instead of summing duration
, use actual offset
and end
from Azure's NBest
or word-level timestamp output (depending on format).
You may want to align speaker segments based on timestamps from the original video/audio if available.
If you're not already, I recommend that you enable wordLevelTimestampsEnabled: true
. It will help you reconstruct the true timing more granularly.
"wordLevelTimestampsEnabled": true