How to split sdk.SpeechRecognizer result by speakers in "Azure Speech to Text" using NodeJS SDK ?

Question

How to split sdk.SpeechRecognizer result by speakers in "Azure Speech to Text" using NodeJS SDK ?

Vovkotrub Bohdan 30

I want to recognize speech to text witn this format:

Speaker #1: Hello, I am speaker 1 Speaker #2: Hello, I am speaker 2.

I use SDK "microsoft-cognitiveservices-speech-sdk" in NodeJS.

speech is "example.wav"

I enable differeniate guest speakers

speechConfig.setProperty("DifferentiateGuestSpeakers", true);

I get result with text but "privSpeakerId" is undefined.

What need to enable speakerId ?

Accepted answer

0 additional answers

Your answer

Answer 1

romungi-MSFT 48,911 Microsoft Employee Moderator

Vovkotrub Bohdan I think you are using the property without setting the property for conversation transcription. Here is a similar issue about the usage of this property when you are not using voice profiles for users but still want to have speaker differentiation.

There is a quickstart on setting the conversation transcription with the config so you can recognize the speakers with/without enrolling them. The quickstart snippet is using SpeechTranslationConfig() which is incorrect and should be SpeechConfig()

I hope this helps. Thanks!!

    var speechConfig = sdk.SpeechConfig.fromSubscription(subscriptionKey, region);
    var audioConfig = sdk.AudioConfig.fromWavFileInput(fs.readFileSync(filepath));
    speechConfig.setProperty("ConversationTranscriptionInRoomAndOnline", "true");

    // en-us by default. Adding this code to specify other languages, like zh-cn.
    speechConfig.speechRecognitionLanguage = "en-US";
    speechConfig.setProperty("DifferentiateGuestSpeakers", true);
    
    // create conversation and transcriber
    var conversation = sdk.Conversation.createConversationAsync(speechConfig, "myConversation");
    var transcriber = new sdk.ConversationTranscriber(audioConfig);

Vovkotrub Bohdan 30

I should to use transcriber instead of a recognizer?

I have an example.wav PCM 16khz mono. In this audio has speech of 4 humans. I need to recognize this .wav and split text by humans. I don't want to identify the name of speakers and their veryfication.

I try in NodeJS:

const sdk = require("microsoft-cognitiveservices-speech-sdk");
const fs = require("fs");

const audioConfig = sdk.AudioConfig.fromWavFileInput(fs.readFileSync("example.wav"));
const speechConfig = sdk.SpeechConfig.fromSubscription(<my_

speechConfig.setProperty("DifferentiateGuestSpeakers", true);
speechConfig.speechRecognitionLanguage = "uk-UA";

// create conversation and transcriber
const conversation = sdk.Conversation.createConversationAsync(speechConfig, "myConversation");
const transcriber = new sdk.ConversationTranscriber(audioConfig);
// join a conversation
transcriber.joinConversationAsync(conversation);

// Add the event listener for the realtime events
transcriber.transcribed = (o, e) => {
    console.log(o);
    console.dir(e);
};

transcriber.startTranscribingAsync();

The script working about 5 min (duration of example.wav 5 min too), but result is nothing. As I understand, the event transcriber.transcribed is not triggered. How to recognize my "example.wav" whith 4 speakers in NodeJS ?

romungi-MSFT 48,911 Reputation points Microsoft Employee Moderator

2023-03-01T15:50:28.11+00:00

I understand you are using a mono channel audio file, which could be the reason that this might not be working as you are trying to identify upto 4 speakers.

As such, conversation transcription should work for all speech-to-text languages in the following regions: centralus, eastasia, eastus, westeurope. Is your resource also created in one of these regions?

If you are using a single channel audio then you could use batch transcription API but that only supports diarization for 2 speakers. I would recommend to also try the speech studio to try out this feature and check if this works with single channel for your audio.
Vovkotrub Bohdan 30 Reputation points

2023-03-01T21:25:39.5366667+00:00

Yes, I use westeurope region.

The Captioning is not show me where "Human #1" or "Human #2". This is not quite what I need. As I correctly understand You, I cannot recognize audio-speech by speakers, as, for example, in Google Speech-to-Text? In Google I only need to set minimum and maximum speakers count and then I receive something as: "human#1: Hi!; human#2: Hello too!".
Vovkotrub Bohdan 30 Reputation points

2023-03-01T21:40:48.5233333+00:00

The Captioning is need the Internet connection? How is the price different from the usual one "sdk.SpeechRecognizer(speechConfig, audioConfig); "?
romungi-MSFT 48,911 Reputation points Microsoft Employee Moderator

2023-03-02T05:06:39.5866667+00:00

Apologies for confusion, I meant to try out the call transcription and analytics feature. This should list out the speakers in the transcript but there is a limitation on number of speakers it can identify on single channel.

Yes, the service does need connection to the cloud API. If you are interested to do it offline you can use disconnected containers of cognitive services which also includes speech to text API container. This will require you to move to a commitment tier for pricing which are charged annually.
Vovkotrub Bohdan 30 Reputation points

2023-03-04T12:11:10.49+00:00

Thank you!

Share via

How to split sdk.SpeechRecognizer result by speakers in "Azure Speech to Text" using NodeJS SDK ?

0 additional answers

Your answer