Azure Cognitive Services Speech: How to speed up the display of intermediate speech translation results when using system audio?

Question

I was able to get C# code working that does speech translation for both microphone audio and system audio via Azure Cognitive Services. (By “system audio,” I mean, for example, the voices of remote Web meeting participants or the audio output from audio files played locally on the PC).

Although the display of intermediate speech translation results for the microphone audio is pretty fast (that is, within a few seconds after the utterance in the source language starts) and I was able to set this up very easily with event handlers (using code similar to here), the display of intermediate speech translation results for system audio is much slower than microphone audio (starts after the source utterance is finished).

Questions:

1. Is it possible to set up Azure API event handlers to do speech translation of system audio (NOT microphone audio) without needing to save audio to WAV files (or to a stream) and then doing speech translation on those WAV files (or stream)?

The documentation here seems to indicate that this is not possible at the present time but would like to confirm.

I’m currently doing speech translation of system audio via code similar to the following:

using (var audioInput = AudioConfig.FromWavFileInput(curAudioFileForSpeechRecognitionProcessing))

{

using (var recognizerFromSystemAudio = new TranslationRecognizer(config, autoDetectSourceLanguageConfig, audioInput))

{

...

recognizerFromSystemAudio.Recognizing += (s, e) =>

{

var lidResult = e.Result.Properties.GetProperty(PropertyId.SpeechServiceConnection_AutoDetectSourceLanguageResult);

...

I’d like to be able to set up for speech translation of system audio more easily via code similar to the following and have the intermediate speech translation results displayed quickly even before the utterance is finished (which I’m using for microphone audio).

using (var recognizerFromMicrophone = new TranslationRecognizer(config, autoDetectSourceLanguageConfig))

{

...

recognizerFromMicrophone.Recognizing += (s, e) =>

{

var lidResult = e.Result.Properties.GetProperty(PropertyId.SpeechServiceConnection_AutoDetectSourceLanguageResult);

...

2. If setting up speech translation of system audio via event handlers (no reading from WAV files or stream) is not possible, do you have any ideas on how I can speed up the display of intermediate speech translation results for system audio?

Ideally, I would NOT need to do the following, but the Azure API would handle this automatically (just like for microphone audio):

a. Record the first half of the sentence of the speech of the person talking.

b. Do speech translation on this audio fragment to display the intermediate speech translation results.

c. Record the remainder of the sentence.

d. Do speech translation for the audio of the complete sentence and then display this.

NOTE: To reduce latency, I need to do both the speech-to-text and machine translation in one call to the Azure server (i.e., NOT do speech-to-text and machine translation in two separate calls).

Environment I’m using:

· Windows 10 (Version 22H2 (OS Build 19045.4529))

· Microsoft .NET Framework (Version 4.8.04084)

· Microsoft Visual Studio Professional 2019 (Version 16.11.35)

Accepted Answer

Hello Hirai, Tetu,

Welcome to the Microsoft Q&A and thank you for posting your questions here. You have done really great job. Well done.

Regarding your question:

Is it possible to set up Azure API event handlers to do speech translation of system audio (NOT microphone audio) without needing to save audio to WAV files (or to a stream) and then doing speech translation on those WAV files (or stream)?

Yes, Azure Speech SDK requires audio input to be provided in specific formats like WAV files or through streams when it comes to speech translation.

The documentation here seems to indicate that this is not possible at the present time but would like to confirm.

The documentation and current API capabilities suggest that you cannot directly handle system audio or perform real-time speech translation without first capturing the audio into a compatible format.

Meanwhile, I like difficult tasks, what of if we make impossible become possible by developing our own Azure API to do this or integrate additional components to handle audio capture and format conversion before passing it to Azure Speech API. This might be a long-time project, and I'm NOT really promise it will be soon but it's achievable. I will start working on it.

I hope this is helpful! Do not hesitate to let me know if you have any other questions.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful

Answer

@Sina Salam Thank you very much for your quick response and your answers! I very much appreciate your generous offer to create your own Azure API to do this or integrate additional components to handle audio capture and format conversion before passing it to Azure Speech API. And yes, I understand this might be a long-time project, and you're NOT really promising it will be soon.

Share via

Azure Cognitive Services Speech: How to speed up the display of intermediate speech translation results when using system audio?

1 additional answer

Your answer