How to get speaker identification in speech translation code (using MS Cognitive Services)?

Mitch Clark 20 Reputation points
2024-04-10T22:40:40.86+00:00

I want to perform speaker identification in speech translation code (using MS Cognitive Services) in a way similar to the speech transcription code in the following (via accessing the SpeakerId property):


                conversationTranscriber.Transcribed += (s, e) =>

                {

                    if (e.Result.Reason == ResultReason.RecognizedSpeech)

                    {

                        Console.WriteLine($"TRANSCRIBED: Text={e.Result.Text} Speaker ID={e.Result.SpeakerId}");

                    }

                    else if (e.Result.Reason == ResultReason.NoMatch)

                    {

                        Console.WriteLine($"NOMATCH: Speech could not be transcribed.");

                    }

                };


https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-stt-diarization?tabs=windows&pivots=programming-language-csharp

 

My current code does speech translation in a way similar to:

https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-speech-translation?tabs=windows%2Cterminal&pivots=programming-language-csharp

Is there a way of modifying the code in the speech translation quickstart above to get the SpeakerId property (with only one call to MS Azure)? Or, is there an alternative way of achieving this with only one call to MS Azure?

 

NOTE: I would prefer to avoid making two calls from my code to MS Azure by: 1st transcribing the speech (and getting the data in the SpeakerId property) and then 2nd making a call to MS Azure to machine translate the transcribed speech. This is because I’m developing a real-time app and making two calls to MS Azure would likely be inefficient (i.e., I want to translate speech using one call to MS Azure while identifying the speaker).

NOTE: I did ask the virtual assistant (Q&A Assist) and got the following answer, but would like to check with a human professional:

"Unfortunately, it is not possible to get speaker identification in speech translation code using only one call to MS Cognitive Services. Speaker identification is only available in the speech-to-text transcription service, which is a separate service from the speech translation service. Therefore, you would need to make two separate calls to the MS Cognitive Services API to achieve both speaker identification and speech translation."

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,409 questions
{count} votes

Accepted answer
  1. santoshkc 4,435 Reputation points Microsoft Vendor
    2024-04-11T05:43:24.6333333+00:00

    Hi @Mitch Clark,

    Thanking you for reaching out to Microsoft Q&A forum!

    The speaker identification is only available in the speech-to-text transcription service and not in the speech translation service. Therefore, you would need to make two separate calls to the MS Cognitive Services API to achieve both speaker identification and speech translation.

    One approach you could consider is to use the speech-to-text transcription service to get the SpeakerId property and then use that SpeakerId to identify the speaker in subsequent translations. This would require two separate calls to the API, but it would allow you to identify the speaker in real-time without having to transcribe the speech again.

    Hope this helps. And, if you have any further query do let us know.


    Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.


1 additional answer

Sort by: Most helpful
  1. Mitch Clark 20 Reputation points
    2024-04-12T18:30:21.6033333+00:00

    Hi @santoshkc ,

    Thank you very much for your detailed and prompt answer to my follow up question!

    Okay, I understand that I'll need to make a call to the MS Cognitive Services API to transcribe the utterance for all utterances (to get the SpeakerId property) and then make a 2nd call to machine translate the results.

    0 comments No comments