Has Diarization in Speech SDK been implemented for overlapping audio of multiple speakers speaking simultaneously ?

Shyamal Goel 0 Reputation points
2024-10-01T09:16:56.08+00:00

To the Microsoft Support Team,

We have been using ConversationTranscriber of the Azure Speech SDK, to implement Diarization in our project, and have encountered an issue in which we need your assistance.

In our project, the Transcriber works well when 2 or more speakers speak separately, i.e., their audios do not overlap. In this scenario, separate speakers and their spoken audio is recognized.
But when 2 or more speakers speak simultaneously, i.e., their audios overlap, the Transcriber does not identify the speakers separately. Instead, it clubs their spoken audio together, and classifies it as a single speaker. Sometimes, it detects different parts of the different audios, returning erroneous results.

Our project setup is as follows :

  1. We have a Gstreamer C++ project, in which we are implementing the Azure Speech SDK.
  2. The project receives an OPUS audio stream, containing audio of speakers speaking in real time.
  3. The OPUS audio stream is converted into a raw audio stream (format : S16LE, rate : 16000, channel : mono)
  4. Samples from this raw audio stream are pushed to a pushstream (whenever they become available). The pushstream has been configured with the Transcriber.
  5. The transcribing asynchronous process is running in the background, and it transcribes audio from the pushstream.

We have been using the following documentation as reference :  

  1. https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/samples/cpp/windows/console/samples/conversation_transcriber_samples.cpp   (ConversationTranscriptionWithPushAudioStream())
     2. https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-stt-diarization?tabs=linux&pivots=programming-language-cpp

As mentioned above, we get correct results when speakers speak separately. But when they speak simultaneously, we get erroneous results.

We wished to know whether Diarization using ConversationTranscriber has been implemented for overlapping speakers’? If so, could you kindly assist us in identifying what might be going wrong with our project setup or our approach to implementing the Transcriber? Are we using the correct functions from the Speech SDK to implement overlapping audios’ Diarization ? Could you also provide us with the relevant documentation/working examples to help us further?

Thanks and regards,
Shyamal Goel (edited) 

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,728 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. romungi-MSFT 45,971 Reputation points Microsoft Employee
    2024-10-01T11:14:26.6166667+00:00

    @Shyamal Goel Based on my experience of using the speech service, I think you are using the right sample and settings as mentioned in SDK sample. There is another feature that uses multi channel audio for conversation transcription but it is in preview and recently it is announced that the feature is being retired. This uses speaker profiles to recognize speakers with their signature.

    Conversation transcription multichannel diarization (preview) is retiring on March 28, 2025. For more information about migrating to other speech to text features, see Migrate away from conversation transcription multichannel diarization.

    There is also a new feature in preview, fast transcription API but it uses audio files and it is currently available through REST API.

    With respect to your scenario of overlap, I think you can raise an issue in the same speech SDK repo and check guidance from the SDK team on any properties that are available to set.

    In the current sample, for the issue with Unknow speakers, the recommendation is to set a property that is not documented in reference. So, if there is any property for this scenario the SDK team would best advise.

    You might see Speaker ID=Unknown in some of the early intermediate results when the speaker is not yet identified. Without intermediate diarization results (if you don't set the PropertyId.SpeechServiceResponse_DiarizeIntermediateResults property to "true"), the speaker ID is always "Unknown".

    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.