Quickstart: Real-time conversation transcription multichannel diarization (preview)

Note

This feature is currently in public preview. This preview is provided without a service-level agreement, and is not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

With conversation transcription multichannel diarization, you can transcribe meetings with the ability to add, remove, and identify multiple participants by streaming audio to the Speech service. You first create voice signatures for each participant using the REST API, and then use the voice signatures with the Speech SDK to transcribe meetings. See the conversation transcription overview for more information.

Important

Conversation transcription multichannel diarization (preview) is retiring on March 28, 2025. For more information about migrating to other speech to text features, see Migrate away from conversation transcription multichannel diarization.

Limitations

  • Only available in the following subscription regions: centralus, eastasia, eastus, westeurope
  • Requires a 7-mic circular multi-microphone array. The microphone array should meet our specification.

Note

For the conversation transcription multichannel diarization feature, use MeetingTranscriber instead of ConversationTranscriber, and use CreateMeetingAsync instead of CreateConversationAsync.

Prerequisites

  • An Azure subscription. You can create one for free.
  • Create a Speech resource in the Azure portal.
  • Get the Speech resource key and region. After your Speech resource is deployed, select Go to resource to view and manage keys.

Set up the environment

Before you can do anything, you need to install the Speech SDK for JavaScript. If you just want the package name to install, run npm install microsoft-cognitiveservices-speech-sdk. For guided installation instructions, see the SDK installation guide.

Create voice signatures

If you want to enroll user profiles, the first step is to create voice signatures for the meeting participants so that they can be identified as unique speakers. This isn't required if you don't want to use pre-enrolled user profiles to identify specific participants.

The input .wav audio file for creating voice signatures must be 16-bit, 16-kHz sample rate, in single channel (mono) format. The recommended length for each audio sample is between 30 seconds and two minutes. An audio sample that is too short results in reduced accuracy when recognizing the speaker. The .wav file should be a sample of one person's voice so that a unique voice profile is created.

The following example shows how to create a voice signature by using the REST API in JavaScript. You must insert your subscriptionKey, region, and the path to a sample .wav file.

const fs = require('fs');
const axios = require('axios');
const formData = require('form-data');
 
const subscriptionKey = 'your-subscription-key';
const region = 'your-region';
 
async function createProfile() {
    let form = new formData();
    form.append('file', fs.createReadStream('path-to-voice-sample.wav'));
    let headers = form.getHeaders();
    headers['Ocp-Apim-Subscription-Key'] = subscriptionKey;
 
    let url = `https://signature.${region}.cts.speech.microsoft.com/api/v1/Signature/GenerateVoiceSignatureFromFormData`;
    let response = await axios.post(url, form, { headers: headers });
    
    // get signature from response, serialize to json string
    return JSON.stringify(response.data.Signature);
}
 
async function main() {
    // use this voiceSignature string with meeting transcription calls below
    let voiceSignatureString = await createProfile();
    console.log(voiceSignatureString);
}
main();

Running this script returns a voice signature string in the variable voiceSignatureString. Run the function twice so you have two strings to use as input to the variables voiceSignatureStringUser1 and voiceSignatureStringUser2 below.

Note

Voice signatures can only be created using the REST API.

Transcribe meetings

The following sample code demonstrates how to transcribe meetings in real-time for two speakers. It assumes that you created voice signature strings for each speaker as shown above. Substitute real information for subscriptionKey, region, and the path filepath for the audio you want to transcribe.

If you don't use pre-enrolled user profiles, it takes a few more seconds to complete the first recognition of unknown users as speaker1, speaker2, etc.

Note

Make sure the same subscriptionKey is used across your application for signature creation, or you will encounter errors.

This sample code does the following:

  • Creates a push stream to use for transcription, and writes the sample .wav file to it.
  • Creates a Meeting using createMeetingAsync().
  • Creates a MeetingTranscriber using the constructor.
  • Adds participants to the meeting. The strings voiceSignatureStringUser1 and voiceSignatureStringUser2 should come as output from the steps above.
  • Registers to events and begins transcription.
  • If you want to differentiate speakers without providing voice samples, enable DifferentiateGuestSpeakers feature as in Meeting Transcription Overview.

If speaker identification or differentiate is enabled, then even if you have already received transcribed results, the service is still evaluating them by accumulated audio information. If the service finds that any previous result was assigned an incorrect speakerId, then a nearly identical Transcribed result is sent again, where only the speakerId and UtteranceId are different. Since the UtteranceId format is {index}_{speakerId}_{Offset}, when you receive a transcribed result, you could use UtteranceId to determine if the current transcribed result is going to correct a previous one. Your client or UI logic could decide behaviors, like overwriting previous output, or to ignore the latest result.

(function() {
    "use strict";
    var sdk = require("microsoft-cognitiveservices-speech-sdk");
    var fs = require("fs");
    
    var subscriptionKey = "your-subscription-key";
    var region = "your-region";
    var filepath = "audio-file-to-transcribe.wav"; // 8-channel audio
    
    var speechTranslationConfig = sdk.SpeechTranslationConfig.fromSubscription(subscriptionKey, region);
    var audioConfig = sdk.AudioConfig.fromWavFileInput(fs.readFileSync(filepath));
    speechTranslationConfig.setProperty("ConversationTranscriptionInRoomAndOnline", "true");

    // en-us by default. Adding this code to specify other languages, like zh-cn.
    speechTranslationConfig.speechRecognitionLanguage = "en-US";
    
    // create meeting and transcriber
    var meeting = sdk.Meeting.createMeetingAsync(speechTranslationConfig, "myMeeting");
    var transcriber = new sdk.MeetingTranscriber(audioConfig);
    
    // attach the transcriber to the meeting
    transcriber.joinMeetingAsync(meeting,
    function () {
        // add first participant using voiceSignature created in enrollment step
        var user1 = sdk.Participant.From("user1@example.com", "en-us", voiceSignatureStringUser1);
        meeting.addParticipantAsync(user1,
        function () {
            // add second participant using voiceSignature created in enrollment step
            var user2 = sdk.Participant.From("user2@example.com", "en-us", voiceSignatureStringUser2);
            meeting.addParticipantAsync(user2,
            function () {
                transcriber.sessionStarted = function(s, e) {
                console.log("(sessionStarted)");
                };
                transcriber.sessionStopped = function(s, e) {
                console.log("(sessionStopped)");
                };
                transcriber.canceled = function(s, e) {
                console.log("(canceled)");
                };
                transcriber.transcribed = function(s, e) {
                console.log("(transcribed) text: " + e.result.text);
                console.log("(transcribed) speakerId: " + e.result.speakerId);
                };
    
                // begin meeting transcription
                transcriber.startTranscribingAsync(
                function () { },
                function (err) {
                    console.trace("err - starting transcription: " + err);
                });
        },
        function (err) {
            console.trace("err - adding user1: " + err);
        });
    },
    function (err) {
        console.trace("err - adding user2: " + err);
    });
    },
    function (err) {
    console.trace("err - " + err);
    });
}()); 

Prerequisites

  • An Azure subscription. You can create one for free.
  • Create a Speech resource in the Azure portal.
  • Get the Speech resource key and region. After your Speech resource is deployed, select Go to resource to view and manage keys.

Set up the environment

The Speech SDK is available as a NuGet package and implements .NET Standard 2.0. You install the Speech SDK later in this guide, but first check the platform-specific installation instructions for any more requirements.

Create voice signatures

If you want to enroll user profiles, the first step is to create voice signatures for the meeting participants so that they can be identified as unique speakers. This isn't required if you don't want to use preenrolled user profiles to identify specific participants.

The input .wav audio file for creating voice signatures must be 16-bit, 16-kHz sample rate, in single channel (mono) format. The recommended length for each audio sample is between 30 seconds and two minutes. An audio sample that is too short results in reduced accuracy when recognizing the speaker. The .wav file should be a sample of one person's voice so that a unique voice profile is created.

The following example shows how to create a voice signature by using the REST API in C#. You must insert your subscriptionKey, region, and the path to a sample .wav file.

using System;
using System.IO;
using System.Net.Http;
using System.Runtime.Serialization;
using System.Threading.Tasks;
using Newtonsoft.Json;

[DataContract]
internal class VoiceSignature
{
    [DataMember]
    public string Status { get; private set; }

    [DataMember]
    public VoiceSignatureData Signature { get; private set; }

    [DataMember]
    public string Transcription { get; private set; }
}

[DataContract]
internal class VoiceSignatureData
{
    internal VoiceSignatureData()
    { }

    internal VoiceSignatureData(int version, string tag, string data)
    {
        this.Version = version;
        this.Tag = tag;
        this.Data = data;
    }

    [DataMember]
    public int Version { get; private set; }

    [DataMember]
    public string Tag { get; private set; }

    [DataMember]
    public string Data { get; private set; }
}

private static async Task<string> GetVoiceSignatureString()
{
    var subscriptionKey = "your-subscription-key";
    var region = "your-region";

    byte[] fileBytes = File.ReadAllBytes("path-to-voice-sample.wav");
    var content = new ByteArrayContent(fileBytes);
    var client = new HttpClient();
    client.DefaultRequestHeaders.Add("Ocp-Apim-Subscription-Key", subscriptionKey);
    var response = await client.PostAsync($"https://signature.{region}.cts.speech.microsoft.com/api/v1/Signature/GenerateVoiceSignatureFromByteArray", content);
    
    var jsonData = await response.Content.ReadAsStringAsync();
    var result = JsonConvert.DeserializeObject<VoiceSignature>(jsonData);
    return JsonConvert.SerializeObject(result.Signature);
}

Running the function GetVoiceSignatureString() returns a voice signature string in the correct format. Run the function twice so you have two strings to use as input to the variables voiceSignatureStringUser1 and voiceSignatureStringUser2 below.

Note

Voice signatures can only be created using the REST API.

Transcribe meetings

The following sample code demonstrates how to transcribe meetings in real-time for two speakers. It assumes that you created voice signature strings for each speaker as shown above. Substitute real information for subscriptionKey, region, and the path filepath for the audio you want to transcribe.

If you don't use pre-enrolled user profiles, it takes a few more seconds to complete the first recognition of unknown users as speaker1, speaker2, etc.

Note

Make sure the same subscriptionKey is used across your application for signature creation, or you will encounter errors.

This sample code does the following:

  • Creates an AudioConfig from the sample .wav file to transcribe.
  • Creates a Meeting using CreateMeetingAsync().
  • Creates a MeetingTranscriber using the constructor, and subscribes to the necessary events.
  • Adds participants to the meeting. The strings voiceSignatureStringUser1 and voiceSignatureStringUser2 should come as output from the steps above from the function GetVoiceSignatureString().
  • Joins the meeting and begins transcription.
  • If you want to differentiate speakers without providing voice samples, enable the DifferentiateGuestSpeakers feature as in Meeting Transcription Overview.

Note

AudioStreamReader is a helper class you can get on GitHub.

If speaker identification or differentiate is enabled, then even if you have already received Transcribed results, the service is still evaluating them by accumulated audio information. If the service finds that any previous result was assigned an incorrect UserId, then a nearly identical Transcribed result is sent again, where only the UserId and UtteranceId are different. Since the UtteranceId format is {index}_{UserId}_{Offset}, when you receive a Transcribed result, you could use UtteranceId to determine if the current Transcribed result is going to correct a previous one. Your client or UI logic could decide behaviors, like overwriting previous output, or to ignore the latest result.

Call the function TranscribeMeetingsAsync() to start meeting transcription.

using System;
using System.IO;
using System.Threading.Tasks;
using Microsoft.CognitiveServices.Speech;
using Microsoft.CognitiveServices.Speech.Audio;
using Microsoft.CognitiveServices.Speech.Transcription;

class TranscribeMeeting
{
    // all your other code

    public static async Task TranscribeMeetingsAsync(string voiceSignatureStringUser1, string voiceSignatureStringUser2)
    {
        var subscriptionKey = "your-subscription-key";
        var region = "your-region";
        var filepath = "audio-file-to-transcribe.wav";

        var config = SpeechConfig.FromSubscription(subscriptionKey, region);
        config.SetProperty("ConversationTranscriptionInRoomAndOnline", "true");

        // en-us by default. Adding this code to specify other languages, like zh-cn.
        // config.SpeechRecognitionLanguage = "zh-cn";
        var stopRecognition = new TaskCompletionSource<int>();

        using (var audioInput = AudioConfig.FromWavFileInput(filepath))
        {
            var meetingID = Guid.NewGuid().ToString();
            using (var meeting = await Meeting.CreateMeetingAsync(config, meetingID))
            {
                // create a meeting transcriber using audio stream input
                using (var meetingTranscriber = new MeetingTranscriber(audioInput))
                {
                    meetingTranscriber.Transcribing += (s, e) =>
                    {
                        Console.WriteLine($"TRANSCRIBING: Text={e.Result.Text} SpeakerId={e.Result.UserId}");
                    };

                    meetingTranscriber.Transcribed += (s, e) =>
                    {
                        if (e.Result.Reason == ResultReason.RecognizedSpeech)
                        {
                            Console.WriteLine($"TRANSCRIBED: Text={e.Result.Text} SpeakerId={e.Result.UserId}");
                        }
                        else if (e.Result.Reason == ResultReason.NoMatch)
                        {
                            Console.WriteLine($"NOMATCH: Speech could not be recognized.");
                        }
                    };

                    meetingTranscriber.Canceled += (s, e) =>
                    {
                        Console.WriteLine($"CANCELED: Reason={e.Reason}");

                        if (e.Reason == CancellationReason.Error)
                        {
                            Console.WriteLine($"CANCELED: ErrorCode={e.ErrorCode}");
                            Console.WriteLine($"CANCELED: ErrorDetails={e.ErrorDetails}");
                            Console.WriteLine($"CANCELED: Did you set the speech resource key and region values?");
                            stopRecognition.TrySetResult(0);
                        }
                    };

                    meetingTranscriber.SessionStarted += (s, e) =>
                    {
                        Console.WriteLine($"\nSession started event. SessionId={e.SessionId}");
                    };

                    meetingTranscriber.SessionStopped += (s, e) =>
                    {
                        Console.WriteLine($"\nSession stopped event. SessionId={e.SessionId}");
                        Console.WriteLine("\nStop recognition.");
                        stopRecognition.TrySetResult(0);
                    };

                    // Add participants to the meeting.
                    var speaker1 = Participant.From("User1", "en-US", voiceSignatureStringUser1);
                    var speaker2 = Participant.From("User2", "en-US", voiceSignatureStringUser2);
                    await meeting.AddParticipantAsync(speaker1);
                    await meeting.AddParticipantAsync(speaker2);

                    // Join to the meeting and start transcribing
                    await meetingTranscriber.JoinMeetingAsync(meeting);
                    await meetingTranscriber.StartTranscribingAsync().ConfigureAwait(false);

                    // waits for completion, then stop transcription
                    Task.WaitAny(new[] { stopRecognition.Task });
                    await meetingTranscriber.StopTranscribingAsync().ConfigureAwait(false);
                }
            }
        }
    }
}

Prerequisites

  • An Azure subscription. You can create one for free.
  • Create a Speech resource in the Azure portal.
  • Get the Speech resource key and region. After your Speech resource is deployed, select Go to resource to view and manage keys.

Set up the environment

Before you can do anything, install the Speech SDK for Python. You can install the Speech SDK from PyPI by running pip install azure-cognitiveservices-speech.

Create voice signatures

If you want to enroll user profiles, the first step is to create voice signatures for the meeting participants so that they can be identified as unique speakers. This isn't required if you don't want to use pre-enrolled user profiles to identify specific participants.

The input .wav audio file for creating voice signatures must be 16-bit, 16-kHz sample rate, in single channel (mono) format. The recommended length for each audio sample is between 30 seconds and two minutes. An audio sample that is too short results in reduced accuracy when recognizing the speaker. The .wav file should be a sample of one person's voice so that a unique voice profile is created.

The following example shows how to create a voice signature by using the REST API in Python. You must insert your subscriptionKey, region, and the path to a sample .wav file.

import requests
from scipy.io.wavfile import read
import json

speech_key, service_region = "your-subscription-key", "your-region"
endpoint = f"https://signature.{service_region}.cts.speech.microsoft.com/api/v1/Signature/GenerateVoiceSignatureFromByteArray"

#Enrollment audio for each speaker. In this example, two speaker enrollment audio files are added.
enrollment_audio_speaker1 = "enrollment-audio-speaker1.wav"
enrollment_audio_speaker2 = "enrollment-audio-speaker2.wav"

def voice_data_converter(enrollment_audio):
  with open(enrollment_audio, "rb") as wav_file:
    input_wav = wav_file.read()
  return input_wav
  
def voice_signature_creator(endpoint, speech_key, enrollment_audio):
  data = voice_data_converter(enrollment_audio)
  headers = {"Ocp-Apim-Subscription-Key":speech_key}
  r = requests.post(url = endpoint,headers = headers, data = data)
  voice_signature_string = json.dumps(r.json()['Signature'])
  return voice_signature_string

voice_signature_user1 = voice_signature_creator(endpoint, speech_key, enrollment_audio_speaker1)
voice_signature_user2 = voice_signature_creator(endpoint, speech_key, enrollment_audio_speaker2)

You can use these two voice_signature_string as input to the variables voice_signature_user1 and voice_signature_user2 later in the sample code.

Note

Voice signatures can only be created using the REST API.

Transcribe meetings

The following sample code demonstrates how to transcribe meetings in real-time for two speakers. It assumes that you created voice signature strings for each speaker as shown previously. Substitute real information for subscriptionKey, region, and the path filepath for the audio you want to transcribe.

If you don't use pre-enrolled user profiles, it takes a few more seconds to complete the first recognition of unknown users as speaker1, speaker2, etc.

Note

Make sure the same subscriptionKey is used across your application for signature creation, or you will encounter errors.

Here's what the sample does:

  • Creates speech configuration with subscription information.
  • Create audio configuration using the push stream.
  • Creates a MeetingTranscriber and Subscribe to the events fired by the meeting transcriber.
  • Meeting identifier for creating meeting.
  • Adds participants to the meeting. The strings voiceSignatureStringUser1 and voiceSignatureStringUser2 should come as output from the previous steps.
  • Read the whole wave files at once and stream it to SDK and begins transcription.
  • If you want to differentiate speakers without providing voice samples, you enable the DifferentiateGuestSpeakers feature as in Meeting Transcription Overview.

If speaker identification or differentiate is enabled, then even if you received transcribed results, the service is still evaluating them by accumulated audio information. If the service finds that any previous result was assigned an incorrect speakerId, then a nearly identical Transcribed result is sent again, where only the speakerId and UtteranceId are different. Since the UtteranceId format is {index}_{speakerId}_{Offset}, when you receive a transcribed result, you could use UtteranceId to determine if the current transcribed result is going to correct a previous one. Your client or UI logic could decide behaviors, like overwriting previous output, or to ignore the latest result.

import azure.cognitiveservices.speech as speechsdk
import time
import uuid
from scipy.io import wavfile

speech_key, service_region="your-subscription-key","your-region"
meetingfilename= "audio-file-to-transcribe.wav" # 8 channel, 16 bits, 16kHz audio

def meeting_transcription():
    
    speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
    speech_config.set_property_by_name("ConversationTranscriptionInRoomAndOnline", "true")
    # If you want to differentiate speakers without providing voice samples, uncomment the following line.
    # speech_config.set_property_by_name("DifferentiateGuestSpeakers", "true")

    channels = 8
    bits_per_sample = 16
    samples_per_second = 16000
    
    wave_format = speechsdk.audio.AudioStreamFormat(samples_per_second, bits_per_sample, channels)
    stream = speechsdk.audio.PushAudioInputStream(stream_format=wave_format)
    audio_config = speechsdk.audio.AudioConfig(stream=stream)

    transcriber = speechsdk.transcription.MeetingTranscriber(audio_config)

    meeting_id = str(uuid.uuid4())
    meeting = speechsdk.transcription.Meeting(speech_config, meeting_id)
    done = False

    def stop_cb(evt: speechsdk.SessionEventArgs):
        """callback that signals to stop continuous transcription upon receiving an event `evt`"""
        print('CLOSING {}'.format(evt))
        nonlocal done
        done = True
        
    transcriber.transcribed.connect(lambda evt: print('TRANSCRIBED: {}'.format(evt)))
    transcriber.session_started.connect(lambda evt: print('SESSION STARTED: {}'.format(evt)))
    transcriber.session_stopped.connect(lambda evt: print('SESSION STOPPED {}'.format(evt)))
    transcriber.canceled.connect(lambda evt: print('CANCELED {}'.format(evt)))
    # stop continuous transcription on either session stopped or canceled events
    transcriber.session_stopped.connect(stop_cb)
    transcriber.canceled.connect(stop_cb)

    # Note user voice signatures are not required for speaker differentiation.
    # Use voice signatures when adding participants when more enhanced speaker identification is required.
    user1 = speechsdk.transcription.Participant("user1@example.com", "en-us", voice_signature_user1)
    user2 = speechsdk.transcription.Participant("user2@example.com", "en-us", voice_signature_user2)

    meeting.add_participant_async(user1).get()
    meeting.add_participant_async(user2).get()
    transcriber.join_meeting_async(meeting).get()
    transcriber.start_transcribing_async()
    
    sample_rate, wav_data = wavfile.read(meetingfilename)
    stream.write(wav_data.tobytes())
    stream.close()
    while not done:
        time.sleep(.5)

    transcriber.stop_transcribing_async()