Two Channel Audio with live transcription and diarization trough channels

Anonymous
2024-03-25T16:31:23.44+00:00

I have an audio file with two persons speaking. Person 1 on Channel 1 and Person 2 on Channel 2.

The automatic diarization with the ConversationTranscriber is sometimes not recognizing a Person and returning Unknown. Since the Persons are already split by the Channels, is it possible to just use the Channelinfo?

The json that get's returned returns always "Channel":0. I already tried the standard way of just passing a filename string and here as an Audio Stream with 2 Channels.

[...]
    def conversation_transcriber_transcribed_cb(evt: speechsdk.SpeechRecognitionEventArgs):
        transcription_results.append({"speaker_id": evt.result.speaker_id, "text": evt.result.text})
        print(f"Recognized {evt.result.json}")

    def stop_cb(evt: speechsdk.SessionEventArgs):
        print(f'CLOSING on {evt}')
        nonlocal transcribing_stop
        transcribing_stop = True

    speech_key = os.environ.get('SPEECH_KEY')
    speech_region = os.environ.get('SPEECH_REGION')

    if not speech_key or not speech_region:
        raise EnvironmentError("SPEECH_KEY and/or SPEECH_REGION environment variables are not set.")

    speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=speech_region)
    speech_config.speech_recognition_language = language
    speech_config.output_format = speechsdk.OutputFormat.Detailed
    speech_config.request_word_level_timestamps()


    with wave.open(audio_filename, 'rb') as audio_file:
        samples_per_second = audio_file.getframerate()
        bits_per_sample = audio_file.getsampwidth() * 8
        channels = audio_file.getnchannels()

    stream_format = speechsdk.audio.AudioStreamFormat(samples_per_second=samples_per_second, bits_per_sample=bits_per_sample, channels=channels)
    print("Channels set to ", channels)
    with open(audio_filename, 'rb') as audio_file:
        audio_stream = speechsdk.audio.PushAudioInputStream(stream_format=stream_format)
        audio_stream.write(audio_file.read())

    audio_config = speechsdk.audio.AudioConfig(stream=audio_stream)

    conversation_transcriber = speechsdk.transcription.ConversationTranscriber(speech_config=speech_config, audio_config=audio_config)
[...]
Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
2,078 questions
Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
3,666 questions
0 comments No comments
{count} votes

3 answers

Sort by: Most helpful
  1. dupammi 8,615 Reputation points Microsoft External Staff
    2024-03-26T08:27:18.2+00:00

    Hi @Sebastian Bodza

    Thank you for using the Microsoft Q&A forum.

    I suggest you can explore using the speaker ID field to distinguish between different speakers who participate in the conversation. The speaker ID is a generic identifier assigned to each conversation participant by the service during the recognition as different speakers are being identified from the provided audio content. The speaker information is included in the result in the speaker ID field.

    The service performs best with at least 7 seconds of continuous audio from a single speaker. This allows the system to differentiate the speakers properly. Otherwise, the Speaker ID is returned as Unknown.

    Above is what mentioned in official documentation. Here is link. It contains details about conversation transcription that support Multi-speaker diarization: Determine who said what by synthesizing the audio stream with each speaker identifier. Although conversation transcription doesn't put a limit on the number of speakers in the room, it's optimized for 2-10 speakers per session.

    I hope you understand. Thank you.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful.


  2. Anonymous
    2024-03-26T13:44:43.3133333+00:00

    The documentation is not clear but looking at the output of the ConversationTranscriber and the SpeechRecognizer, the json always outputs Channel 0.

    Github issues also confirm this:

    https://github.com/Azure-Samples/cognitive-services-speech-sdk/issues/1485

    https://github.com/Azure-Samples/cognitive-services-speech-sdk/issues/1748

    A hint in the docs would be great especially considering the Channel is returned in the json.

    So the answer is either use the batch transcription (which can take according to the docs up to 30 mins) or use a post processing with the help of audio rms and adjust it yourself.

    0 comments No comments

  3. dupammi 8,615 Reputation points Microsoft External Staff
    2024-03-26T14:04:14.9066667+00:00

    Hi @Sebastian Bodza

    I'm glad that you were able to resolve your issue and thank you for posting your solution, so that others experiencing the same thing can easily reference this! Since the Microsoft Q&A community has a policy that "The question author cannot accept their own answer. They can only accept answers by others ", I'll repost your solution in case you'd like to accept the answer.

    The answer is either use the batch transcription (which can take according to the docs up to 30 mins) or use a post processing with the help of audio rms and adjust it.

    Hope this helps.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.