Two Channel Audio with live transcription and diarization trough channels

Question

Two Channel Audio with live transcription and diarization trough channels

Anonymous

I have an audio file with two persons speaking. Person 1 on Channel 1 and Person 2 on Channel 2.

The automatic diarization with the ConversationTranscriber is sometimes not recognizing a Person and returning Unknown. Since the Persons are already split by the Channels, is it possible to just use the Channelinfo?

The json that get's returned returns always "Channel":0. I already tried the standard way of just passing a filename string and here as an Audio Stream with 2 Channels.

[...]
    def conversation_transcriber_transcribed_cb(evt: speechsdk.SpeechRecognitionEventArgs):
        transcription_results.append({"speaker_id": evt.result.speaker_id, "text": evt.result.text})
        print(f"Recognized {evt.result.json}")

    def stop_cb(evt: speechsdk.SessionEventArgs):
        print(f'CLOSING on {evt}')
        nonlocal transcribing_stop
        transcribing_stop = True

    speech_key = os.environ.get('SPEECH_KEY')
    speech_region = os.environ.get('SPEECH_REGION')

    if not speech_key or not speech_region:
        raise EnvironmentError("SPEECH_KEY and/or SPEECH_REGION environment variables are not set.")

    speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=speech_region)
    speech_config.speech_recognition_language = language
    speech_config.output_format = speechsdk.OutputFormat.Detailed
    speech_config.request_word_level_timestamps()


    with wave.open(audio_filename, 'rb') as audio_file:
        samples_per_second = audio_file.getframerate()
        bits_per_sample = audio_file.getsampwidth() * 8
        channels = audio_file.getnchannels()

    stream_format = speechsdk.audio.AudioStreamFormat(samples_per_second=samples_per_second, bits_per_sample=bits_per_sample, channels=channels)
    print("Channels set to ", channels)
    with open(audio_filename, 'rb') as audio_file:
        audio_stream = speechsdk.audio.PushAudioInputStream(stream_format=stream_format)
        audio_stream.write(audio_file.read())

    audio_config = speechsdk.audio.AudioConfig(stream=audio_stream)

    conversation_transcriber = speechsdk.transcription.ConversationTranscriber(speech_config=speech_config, audio_config=audio_config)
[...]

3 answers

Your answer

Answer 1

dupammi 8,615 Microsoft External Staff

Hi @Sebastian Bodza

Thank you for using the Microsoft Q&A forum.

I suggest you can explore using the speaker ID field to distinguish between different speakers who participate in the conversation. The speaker ID is a generic identifier assigned to each conversation participant by the service during the recognition as different speakers are being identified from the provided audio content. The speaker information is included in the result in the speaker ID field.

The service performs best with at least 7 seconds of continuous audio from a single speaker. This allows the system to differentiate the speakers properly. Otherwise, the Speaker ID is returned as Unknown.

Above is what mentioned in official documentation. Here is link. It contains details about conversation transcription that support Multi-speaker diarization: Determine who said what by synthesizing the audio stream with each speaker identifier. Although conversation transcription doesn't put a limit on the number of speakers in the room, it's optimized for 2-10 speakers per session.

I hope you understand. Thank you.

If this answers your query, do click Accept Answer and Yes for was this answer helpful.

Anonymous

2024-03-26T09:01:57.9766667+00:00

Hi dupammi,

That's not adressing the main issues. Just using the Speaker ID (as it is implemented above) is not helping as it sometimes does not recognize the speaker and produces an "Unknown".

Also the main Person is always on the left track/track 0 and i want this to be always Guest-1. This is currently not the case.

The easiest would be to have a working "Channel" from the "evt.result.json". This is currently always constant 0. The main question is: Does it support multi channel or not? Why is there a Channel in the json?

Or is that only supported for the batch API? I can only find issues from 2019. Does it still hold true that the batch transcription can take up to 30 min?
dupammi 8,615 Reputation points Microsoft External Staff

2024-03-26T10:16:28.39+00:00

Hi @Sebastian Bodza

I hope it helped.
dupammi 8,615 Reputation points Microsoft External Staff

2024-03-26T12:47:14.8633333+00:00

Hi @Sebastian Bodza

If you're experiencing issues that require further assistance, I request you to raise a support case through Azure portal.

I hope you understand. Thank you!

Answer 2

The documentation is not clear but looking at the output of the ConversationTranscriber and the SpeechRecognizer, the json always outputs Channel 0.

Github issues also confirm this:

https://github.com/Azure-Samples/cognitive-services-speech-sdk/issues/1485

https://github.com/Azure-Samples/cognitive-services-speech-sdk/issues/1748

A hint in the docs would be great especially considering the Channel is returned in the json.

So the answer is either use the batch transcription (which can take according to the docs up to 30 mins) or use a post processing with the help of audio rms and adjust it yourself.

Answer 3

Hi @Sebastian Bodza

I'm glad that you were able to resolve your issue and thank you for posting your solution, so that others experiencing the same thing can easily reference this! Since the Microsoft Q&A community has a policy that "The question author cannot accept their own answer. They can only accept answers by others ", I'll repost your solution in case you'd like to accept the answer.

The answer is either use the batch transcription (which can take according to the docs up to 30 mins) or use a post processing with the help of audio rms and adjust it.

Hope this helps.

If this answers your query, do click Accept Answer and Yes for was this answer helpful.

Share via

Two Channel Audio with live transcription and diarization trough channels

3 answers

Your answer