Quickstart: Real-time conversation transcription multichannel diarization (preview)
Note
This feature is currently in public preview. This preview is provided without a service-level agreement, and is not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.
With conversation transcription multichannel diarization, you can transcribe meetings with the ability to add, remove, and identify multiple participants by streaming audio to the Speech service. You first create voice signatures for each participant using the REST API, and then use the voice signatures with the Speech SDK to transcribe meetings. See the conversation transcription overview for more information.
Important
Conversation transcription multichannel diarization (preview) is retiring on March 28, 2025. For more information about migrating to other speech to text features, see Migrate away from conversation transcription multichannel diarization.
Limitations
- Only available in the following subscription regions:
centralus
,eastasia
,eastus
,westeurope
- Requires a 7-mic circular multi-microphone array. The microphone array should meet our specification.
Note
For the conversation transcription multichannel diarization feature, use MeetingTranscriber
instead of ConversationTranscriber
, and use CreateMeetingAsync
instead of CreateConversationAsync
.
Prerequisites
- An Azure subscription. You can create one for free.
- Create a Speech resource in the Azure portal.
- Get the Speech resource key and region. After your Speech resource is deployed, select Go to resource to view and manage keys.
Set up the environment
Before you can do anything, you need to install the Speech SDK for JavaScript. If you just want the package name to install, run npm install microsoft-cognitiveservices-speech-sdk
. For guided installation instructions, see the SDK installation guide.
Create voice signatures
If you want to enroll user profiles, the first step is to create voice signatures for the meeting participants so that they can be identified as unique speakers. This isn't required if you don't want to use pre-enrolled user profiles to identify specific participants.
The input .wav
audio file for creating voice signatures must be 16-bit, 16-kHz sample rate, in single channel (mono) format. The recommended length for each audio sample is between 30 seconds and two minutes. An audio sample that is too short results in reduced accuracy when recognizing the speaker. The .wav
file should be a sample of one person's voice so that a unique voice profile is created.
The following example shows how to create a voice signature by using the REST API in JavaScript. You must insert your subscriptionKey
, region
, and the path to a sample .wav
file.
const fs = require('fs');
const axios = require('axios');
const formData = require('form-data');
const subscriptionKey = 'your-subscription-key';
const region = 'your-region';
async function createProfile() {
let form = new formData();
form.append('file', fs.createReadStream('path-to-voice-sample.wav'));
let headers = form.getHeaders();
headers['Ocp-Apim-Subscription-Key'] = subscriptionKey;
let url = `https://signature.${region}.cts.speech.microsoft.com/api/v1/Signature/GenerateVoiceSignatureFromFormData`;
let response = await axios.post(url, form, { headers: headers });
// get signature from response, serialize to json string
return JSON.stringify(response.data.Signature);
}
async function main() {
// use this voiceSignature string with meeting transcription calls below
let voiceSignatureString = await createProfile();
console.log(voiceSignatureString);
}
main();
Running this script returns a voice signature string in the variable voiceSignatureString
. Run the function twice so you have two strings to use as input to the variables voiceSignatureStringUser1
and voiceSignatureStringUser2
below.
Note
Voice signatures can only be created using the REST API.
Transcribe meetings
The following sample code demonstrates how to transcribe meetings in real-time for two speakers. It assumes that you created voice signature strings for each speaker as shown above. Substitute real information for subscriptionKey
, region
, and the path filepath
for the audio you want to transcribe.
If you don't use pre-enrolled user profiles, it takes a few more seconds to complete the first recognition of unknown users as speaker1, speaker2, etc.
Note
Make sure the same subscriptionKey
is used across your application for signature creation, or you will encounter errors.
This sample code does the following:
- Creates a push stream to use for transcription, and writes the sample
.wav
file to it. - Creates a
Meeting
usingcreateMeetingAsync()
. - Creates a
MeetingTranscriber
using the constructor. - Adds participants to the meeting. The strings
voiceSignatureStringUser1
andvoiceSignatureStringUser2
should come as output from the steps above. - Registers to events and begins transcription.
- If you want to differentiate speakers without providing voice samples, enable
DifferentiateGuestSpeakers
feature as in Meeting Transcription Overview.
If speaker identification or differentiate is enabled, then even if you have already received transcribed
results, the service is still evaluating them by accumulated audio information. If the service finds that any previous result was assigned an incorrect speakerId
, then a nearly identical Transcribed
result is sent again, where only the speakerId
and UtteranceId
are different. Since the UtteranceId
format is {index}_{speakerId}_{Offset}
, when you receive a transcribed
result, you could use UtteranceId
to determine if the current transcribed
result is going to correct a previous one. Your client or UI logic could decide behaviors, like overwriting previous output, or to ignore the latest result.
(function() {
"use strict";
var sdk = require("microsoft-cognitiveservices-speech-sdk");
var fs = require("fs");
var subscriptionKey = "your-subscription-key";
var region = "your-region";
var filepath = "audio-file-to-transcribe.wav"; // 8-channel audio
var speechTranslationConfig = sdk.SpeechTranslationConfig.fromSubscription(subscriptionKey, region);
var audioConfig = sdk.AudioConfig.fromWavFileInput(fs.readFileSync(filepath));
speechTranslationConfig.setProperty("ConversationTranscriptionInRoomAndOnline", "true");
// en-us by default. Adding this code to specify other languages, like zh-cn.
speechTranslationConfig.speechRecognitionLanguage = "en-US";
// create meeting and transcriber
var meeting = sdk.Meeting.createMeetingAsync(speechTranslationConfig, "myMeeting");
var transcriber = new sdk.MeetingTranscriber(audioConfig);
// attach the transcriber to the meeting
transcriber.joinMeetingAsync(meeting,
function () {
// add first participant using voiceSignature created in enrollment step
var user1 = sdk.Participant.From("user1@example.com", "en-us", voiceSignatureStringUser1);
meeting.addParticipantAsync(user1,
function () {
// add second participant using voiceSignature created in enrollment step
var user2 = sdk.Participant.From("user2@example.com", "en-us", voiceSignatureStringUser2);
meeting.addParticipantAsync(user2,
function () {
transcriber.sessionStarted = function(s, e) {
console.log("(sessionStarted)");
};
transcriber.sessionStopped = function(s, e) {
console.log("(sessionStopped)");
};
transcriber.canceled = function(s, e) {
console.log("(canceled)");
};
transcriber.transcribed = function(s, e) {
console.log("(transcribed) text: " + e.result.text);
console.log("(transcribed) speakerId: " + e.result.speakerId);
};
// begin meeting transcription
transcriber.startTranscribingAsync(
function () { },
function (err) {
console.trace("err - starting transcription: " + err);
});
},
function (err) {
console.trace("err - adding user1: " + err);
});
},
function (err) {
console.trace("err - adding user2: " + err);
});
},
function (err) {
console.trace("err - " + err);
});
}());
Prerequisites
- An Azure subscription. You can create one for free.
- Create a Speech resource in the Azure portal.
- Get the Speech resource key and region. After your Speech resource is deployed, select Go to resource to view and manage keys.
Set up the environment
The Speech SDK is available as a NuGet package and implements .NET Standard 2.0. You install the Speech SDK later in this guide, but first check the platform-specific installation instructions for any more requirements.
Create voice signatures
If you want to enroll user profiles, the first step is to create voice signatures for the meeting participants so that they can be identified as unique speakers. This isn't required if you don't want to use preenrolled user profiles to identify specific participants.
The input .wav
audio file for creating voice signatures must be 16-bit, 16-kHz sample rate, in single channel (mono) format. The recommended length for each audio sample is between 30 seconds and two minutes. An audio sample that is too short results in reduced accuracy when recognizing the speaker. The .wav
file should be a sample of one person's voice so that a unique voice profile is created.
The following example shows how to create a voice signature by using the REST API in C#. You must insert your subscriptionKey
, region
, and the path to a sample .wav
file.
using System;
using System.IO;
using System.Net.Http;
using System.Runtime.Serialization;
using System.Threading.Tasks;
using Newtonsoft.Json;
[DataContract]
internal class VoiceSignature
{
[DataMember]
public string Status { get; private set; }
[DataMember]
public VoiceSignatureData Signature { get; private set; }
[DataMember]
public string Transcription { get; private set; }
}
[DataContract]
internal class VoiceSignatureData
{
internal VoiceSignatureData()
{ }
internal VoiceSignatureData(int version, string tag, string data)
{
this.Version = version;
this.Tag = tag;
this.Data = data;
}
[DataMember]
public int Version { get; private set; }
[DataMember]
public string Tag { get; private set; }
[DataMember]
public string Data { get; private set; }
}
private static async Task<string> GetVoiceSignatureString()
{
var subscriptionKey = "your-subscription-key";
var region = "your-region";
byte[] fileBytes = File.ReadAllBytes("path-to-voice-sample.wav");
var content = new ByteArrayContent(fileBytes);
var client = new HttpClient();
client.DefaultRequestHeaders.Add("Ocp-Apim-Subscription-Key", subscriptionKey);
var response = await client.PostAsync($"https://signature.{region}.cts.speech.microsoft.com/api/v1/Signature/GenerateVoiceSignatureFromByteArray", content);
var jsonData = await response.Content.ReadAsStringAsync();
var result = JsonConvert.DeserializeObject<VoiceSignature>(jsonData);
return JsonConvert.SerializeObject(result.Signature);
}
Running the function GetVoiceSignatureString()
returns a voice signature string in the correct format. Run the function twice so you have two strings to use as input to the variables voiceSignatureStringUser1
and voiceSignatureStringUser2
below.
Note
Voice signatures can only be created using the REST API.
Transcribe meetings
The following sample code demonstrates how to transcribe meetings in real-time for two speakers. It assumes that you created voice signature strings for each speaker as shown above. Substitute real information for subscriptionKey
, region
, and the path filepath
for the audio you want to transcribe.
If you don't use pre-enrolled user profiles, it takes a few more seconds to complete the first recognition of unknown users as speaker1, speaker2, etc.
Note
Make sure the same subscriptionKey
is used across your application for signature creation, or you will encounter errors.
This sample code does the following:
- Creates an
AudioConfig
from the sample.wav
file to transcribe. - Creates a
Meeting
usingCreateMeetingAsync()
. - Creates a
MeetingTranscriber
using the constructor, and subscribes to the necessary events. - Adds participants to the meeting. The strings
voiceSignatureStringUser1
andvoiceSignatureStringUser2
should come as output from the steps above from the functionGetVoiceSignatureString()
. - Joins the meeting and begins transcription.
- If you want to differentiate speakers without providing voice samples, enable the
DifferentiateGuestSpeakers
feature as in Meeting Transcription Overview.
Note
AudioStreamReader
is a helper class you can get on GitHub.
If speaker identification or differentiate is enabled, then even if you have already received Transcribed
results, the service is still evaluating them by accumulated audio information. If the service finds that any previous result was assigned an incorrect UserId
, then a nearly identical Transcribed
result is sent again, where only the UserId
and UtteranceId
are different. Since the UtteranceId
format is {index}_{UserId}_{Offset}
, when you receive a Transcribed
result, you could use UtteranceId
to determine if the current Transcribed
result is going to correct a previous one. Your client or UI logic could decide behaviors, like overwriting previous output, or to ignore the latest result.
Call the function TranscribeMeetingsAsync()
to start meeting transcription.
using System;
using System.IO;
using System.Threading.Tasks;
using Microsoft.CognitiveServices.Speech;
using Microsoft.CognitiveServices.Speech.Audio;
using Microsoft.CognitiveServices.Speech.Transcription;
class TranscribeMeeting
{
// all your other code
public static async Task TranscribeMeetingsAsync(string voiceSignatureStringUser1, string voiceSignatureStringUser2)
{
var subscriptionKey = "your-subscription-key";
var region = "your-region";
var filepath = "audio-file-to-transcribe.wav";
var config = SpeechConfig.FromSubscription(subscriptionKey, region);
config.SetProperty("ConversationTranscriptionInRoomAndOnline", "true");
// en-us by default. Adding this code to specify other languages, like zh-cn.
// config.SpeechRecognitionLanguage = "zh-cn";
var stopRecognition = new TaskCompletionSource<int>();
using (var audioInput = AudioConfig.FromWavFileInput(filepath))
{
var meetingID = Guid.NewGuid().ToString();
using (var meeting = await Meeting.CreateMeetingAsync(config, meetingID))
{
// create a meeting transcriber using audio stream input
using (var meetingTranscriber = new MeetingTranscriber(audioInput))
{
meetingTranscriber.Transcribing += (s, e) =>
{
Console.WriteLine($"TRANSCRIBING: Text={e.Result.Text} SpeakerId={e.Result.UserId}");
};
meetingTranscriber.Transcribed += (s, e) =>
{
if (e.Result.Reason == ResultReason.RecognizedSpeech)
{
Console.WriteLine($"TRANSCRIBED: Text={e.Result.Text} SpeakerId={e.Result.UserId}");
}
else if (e.Result.Reason == ResultReason.NoMatch)
{
Console.WriteLine($"NOMATCH: Speech could not be recognized.");
}
};
meetingTranscriber.Canceled += (s, e) =>
{
Console.WriteLine($"CANCELED: Reason={e.Reason}");
if (e.Reason == CancellationReason.Error)
{
Console.WriteLine($"CANCELED: ErrorCode={e.ErrorCode}");
Console.WriteLine($"CANCELED: ErrorDetails={e.ErrorDetails}");
Console.WriteLine($"CANCELED: Did you set the speech resource key and region values?");
stopRecognition.TrySetResult(0);
}
};
meetingTranscriber.SessionStarted += (s, e) =>
{
Console.WriteLine($"\nSession started event. SessionId={e.SessionId}");
};
meetingTranscriber.SessionStopped += (s, e) =>
{
Console.WriteLine($"\nSession stopped event. SessionId={e.SessionId}");
Console.WriteLine("\nStop recognition.");
stopRecognition.TrySetResult(0);
};
// Add participants to the meeting.
var speaker1 = Participant.From("User1", "en-US", voiceSignatureStringUser1);
var speaker2 = Participant.From("User2", "en-US", voiceSignatureStringUser2);
await meeting.AddParticipantAsync(speaker1);
await meeting.AddParticipantAsync(speaker2);
// Join to the meeting and start transcribing
await meetingTranscriber.JoinMeetingAsync(meeting);
await meetingTranscriber.StartTranscribingAsync().ConfigureAwait(false);
// waits for completion, then stop transcription
Task.WaitAny(new[] { stopRecognition.Task });
await meetingTranscriber.StopTranscribingAsync().ConfigureAwait(false);
}
}
}
}
}
Prerequisites
- An Azure subscription. You can create one for free.
- Create a Speech resource in the Azure portal.
- Get the Speech resource key and region. After your Speech resource is deployed, select Go to resource to view and manage keys.
Set up the environment
Before you can do anything, install the Speech SDK for Python. You can install the Speech SDK from PyPI by running pip install azure-cognitiveservices-speech
.
Create voice signatures
If you want to enroll user profiles, the first step is to create voice signatures for the meeting participants so that they can be identified as unique speakers. This isn't required if you don't want to use pre-enrolled user profiles to identify specific participants.
The input .wav
audio file for creating voice signatures must be 16-bit, 16-kHz sample rate, in single channel (mono) format. The recommended length for each audio sample is between 30 seconds and two minutes. An audio sample that is too short results in reduced accuracy when recognizing the speaker. The .wav
file should be a sample of one person's voice so that a unique voice profile is created.
The following example shows how to create a voice signature by using the REST API in Python. You must insert your subscriptionKey
, region
, and the path to a sample .wav
file.
import requests
from scipy.io.wavfile import read
import json
speech_key, service_region = "your-subscription-key", "your-region"
endpoint = f"https://signature.{service_region}.cts.speech.microsoft.com/api/v1/Signature/GenerateVoiceSignatureFromByteArray"
#Enrollment audio for each speaker. In this example, two speaker enrollment audio files are added.
enrollment_audio_speaker1 = "enrollment-audio-speaker1.wav"
enrollment_audio_speaker2 = "enrollment-audio-speaker2.wav"
def voice_data_converter(enrollment_audio):
with open(enrollment_audio, "rb") as wav_file:
input_wav = wav_file.read()
return input_wav
def voice_signature_creator(endpoint, speech_key, enrollment_audio):
data = voice_data_converter(enrollment_audio)
headers = {"Ocp-Apim-Subscription-Key":speech_key}
r = requests.post(url = endpoint,headers = headers, data = data)
voice_signature_string = json.dumps(r.json()['Signature'])
return voice_signature_string
voice_signature_user1 = voice_signature_creator(endpoint, speech_key, enrollment_audio_speaker1)
voice_signature_user2 = voice_signature_creator(endpoint, speech_key, enrollment_audio_speaker2)
You can use these two voice_signature_string as input to the variables voice_signature_user1
and voice_signature_user2
later in the sample code.
Note
Voice signatures can only be created using the REST API.
Transcribe meetings
The following sample code demonstrates how to transcribe meetings in real-time for two speakers. It assumes that you created voice signature strings for each speaker as shown previously. Substitute real information for subscriptionKey
, region
, and the path filepath
for the audio you want to transcribe.
If you don't use pre-enrolled user profiles, it takes a few more seconds to complete the first recognition of unknown users as speaker1, speaker2, etc.
Note
Make sure the same subscriptionKey
is used across your application for signature creation, or you will encounter errors.
Here's what the sample does:
- Creates speech configuration with subscription information.
- Create audio configuration using the push stream.
- Creates a
MeetingTranscriber
and Subscribe to the events fired by the meeting transcriber. - Meeting identifier for creating meeting.
- Adds participants to the meeting. The strings
voiceSignatureStringUser1
andvoiceSignatureStringUser2
should come as output from the previous steps. - Read the whole wave files at once and stream it to SDK and begins transcription.
- If you want to differentiate speakers without providing voice samples, you enable the
DifferentiateGuestSpeakers
feature as in Meeting Transcription Overview.
If speaker identification or differentiate is enabled, then even if you received transcribed
results, the service is still evaluating them by accumulated audio information. If the service finds that any previous result was assigned an incorrect speakerId
, then a nearly identical Transcribed
result is sent again, where only the speakerId
and UtteranceId
are different. Since the UtteranceId
format is {index}_{speakerId}_{Offset}
, when you receive a transcribed
result, you could use UtteranceId
to determine if the current transcribed
result is going to correct a previous one. Your client or UI logic could decide behaviors, like overwriting previous output, or to ignore the latest result.
import azure.cognitiveservices.speech as speechsdk
import time
import uuid
from scipy.io import wavfile
speech_key, service_region="your-subscription-key","your-region"
meetingfilename= "audio-file-to-transcribe.wav" # 8 channel, 16 bits, 16kHz audio
def meeting_transcription():
speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
speech_config.set_property_by_name("ConversationTranscriptionInRoomAndOnline", "true")
# If you want to differentiate speakers without providing voice samples, uncomment the following line.
# speech_config.set_property_by_name("DifferentiateGuestSpeakers", "true")
channels = 8
bits_per_sample = 16
samples_per_second = 16000
wave_format = speechsdk.audio.AudioStreamFormat(samples_per_second, bits_per_sample, channels)
stream = speechsdk.audio.PushAudioInputStream(stream_format=wave_format)
audio_config = speechsdk.audio.AudioConfig(stream=stream)
transcriber = speechsdk.transcription.MeetingTranscriber(audio_config)
meeting_id = str(uuid.uuid4())
meeting = speechsdk.transcription.Meeting(speech_config, meeting_id)
done = False
def stop_cb(evt: speechsdk.SessionEventArgs):
"""callback that signals to stop continuous transcription upon receiving an event `evt`"""
print('CLOSING {}'.format(evt))
nonlocal done
done = True
transcriber.transcribed.connect(lambda evt: print('TRANSCRIBED: {}'.format(evt)))
transcriber.session_started.connect(lambda evt: print('SESSION STARTED: {}'.format(evt)))
transcriber.session_stopped.connect(lambda evt: print('SESSION STOPPED {}'.format(evt)))
transcriber.canceled.connect(lambda evt: print('CANCELED {}'.format(evt)))
# stop continuous transcription on either session stopped or canceled events
transcriber.session_stopped.connect(stop_cb)
transcriber.canceled.connect(stop_cb)
# Note user voice signatures are not required for speaker differentiation.
# Use voice signatures when adding participants when more enhanced speaker identification is required.
user1 = speechsdk.transcription.Participant("user1@example.com", "en-us", voice_signature_user1)
user2 = speechsdk.transcription.Participant("user2@example.com", "en-us", voice_signature_user2)
meeting.add_participant_async(user1).get()
meeting.add_participant_async(user2).get()
transcriber.join_meeting_async(meeting).get()
transcriber.start_transcribing_async()
sample_rate, wav_data = wavfile.read(meetingfilename)
stream.write(wav_data.tobytes())
stream.close()
while not done:
time.sleep(.5)
transcriber.stop_transcribing_async()