Thank you for using the Microsoft Q&A forum.
I suggest you can explore using the speaker ID field to distinguish between different speakers who participate in the conversation. The speaker ID is a generic identifier assigned to each conversation participant by the service during the recognition as different speakers are being identified from the provided audio content. The speaker information is included in the result in the speaker ID field.
The service performs best with at least 7 seconds of continuous audio from a single speaker. This allows the system to differentiate the speakers properly. Otherwise, the Speaker ID is returned as Unknown.
Above is what mentioned in official documentation. Here is link. It contains details about conversation transcription that support Multi-speaker diarization: Determine who said what by synthesizing the audio stream with each speaker identifier. Although conversation transcription doesn't put a limit on the number of speakers in the room, it's optimized for 2-10 speakers per session.
I hope you understand. Thank you.
If this answers your query, do click Accept Answer
and Yes
for was this answer helpful.