Cognitive Services Speech to Text with Teams Unmixed Audio Buffer

Ali B 0 Reputation points
2024-01-20T23:19:52.0166667+00:00

Hi,

So I'm trying to create a STT solution that's part of a Teams Bot. The Bot takes part of the meeting and listens to participants' audio. The audio is set to Unmixed = true so that each speaker gets their own channel. The solution is in C#.

I receive the separate audio buffers in real-time, and I send them to a Cognitive Services class to recognize the speech. Every participant gets their own Cognitive Services recognizer class instance.

My problem is that the messages get to the recognizer, and you can see the recognizing event fire consistently. However, the final recognized event only fires sporadically, even with long periods of silence. I'm not sure where the issue is. Is there a way to set a silence threshold (noise level, not time-based) Any help is greatly appreciated.

Here's where the audio is sent to langServices:

if (audioFrame.UnmixedAudioBuffers != null)
{
var tasks = new List();
foreach (var buffer in audioFrame.UnmixedAudioBuffers)
{
var length = buffer.Length;
var data = new byte[length];
Marshal.Copy(buffer.Data, data, 0, (int)length);

     var participant = CallHandler.GetParticipantFromMSI(this.callHandler.Call, buffer.ActiveSpeakerId);
     var identity = CallHandler.TryGetParticipantIdentity(participant);
     if (identity != null)
     {

         buffers.Add(identity.Id, (new AudioBuffer(data, audioFormat), audioFrameTimestamp));

         //send to Cognitive Services to transcript
         if (!langServices.ContainsKey(identity))
         {
             langServices.Add(identity, new CognitiveServicesService(identity, this.callHandler.botConfiguration, logger, this.callHandler.Call.Id));
         }

         tasks.Add(langServices[identity].AppendAudioBuffer(data));

         //try a new instance every time!
         //var c = new CognitiveServicesService(identity, this.callHandler.botConfiguration, logger, this.callHandler.Call.Id);
         //tasks.Add(c.AppendAudioBuffer(data));
     }
     else
     {
         this.logger.Warn($"Couldn't find participant for ActiveSpeakerId: {buffer.ActiveSpeakerId}");
     }
 }
 await Task.WhenAll(tasks);
}

Here are the relevant snippets from CognitiveServices class:

public CognitiveServicesService(Identity identity, BotConfiguration settings, IGraphLogger logger, string callId)
{
_logger = logger;
_callId = callId;
_identity = identity;

_speechConfig = SpeechConfig.FromSubscription(settings.SpeechConfigKey, settings.SpeechConfigRegion);
_speechConfig.SpeechSynthesisLanguage = settings.BotLanguage;
_speechConfig.SpeechRecognitionLanguage = settings.BotLanguage;

//_speechConfig.SetProperty(PropertyId.Speech_SegmentationSilenceTimeoutMs, "1000");
//_speechConfig.SetProperty(PropertyId.SpeechServiceConnection_InitialSilenceTimeoutMs, "1000");

var audioConfig = AudioConfig.FromStreamOutput(_audioOutputStream);
}

public async Task AppendAudioBuffer(byte[] audioBuffer)
{
//RealtimeTranscriptionHelper.TranscribeAsync(audioBuffer, _speechConfig, _logger);
if (!_isRunning)
{
Start();
await ProcessSpeech();
}

try
{
    _audioInputStream.Write(audioBuffer);
}
catch (Exception e)
{
    _logger.Log(System.Diagnostics.TraceLevel.Info, e, "Exception happend writing to input stream");
}
}

private async Task ProcessSpeech()
{
try
{

     var stopRecognition = new TaskCompletionSource<int>();

     using (var audioInput = AudioConfig.FromStreamInput(_audioInputStream))
     {
         if (_recognizer == null)
         {
             _logger.Log(System.Diagnostics.TraceLevel.Info, "init recognizer");
             _recognizer = new SpeechRecognizer(_speechConfig, audioInput);
                    
         }
     }

     _recognizer.SpeechStartDetected += async (s, e) =>
     {
         Console.WriteLine($"Speech Start Detected. Offest: {e.Offset}");
     };

     _recognizer.SpeechEndDetected += async (s, e) =>
     {
         Console.WriteLine($"Speech End Detected. Offest: {e.Offset}");
     };

     _recognizer.Recognizing += (s, e) =>
     {
         string msg = $"RECOGNIZING: Text={e.Result.Text}";
         _logger.Log(System.Diagnostics.TraceLevel.Info, msg);
         Console.WriteLine(msg);
     };

     _recognizer.Recognized += async (s, e) =>
     {
         if (e.Result.Reason == ResultReason.RecognizedSpeech)
         {
             if (string.IsNullOrEmpty(e.Result.Text))
                 return;

             // We recognized the speech
             var msg = $"'timestamp': '{DateTime.Now}', 'speaker': '{_identity.DisplayName}', 'text': '{ e.Result.Text}'";

             Console.WriteLine(msg);
             _logger.Log(System.Diagnostics.TraceLevel.Info, $"***Recognized***: {msg}");
                    
                  
         }
         else if (e.Result.Reason == ResultReason.NoMatch)
         {
             _logger.Log(System.Diagnostics.TraceLevel.Info, $"NOMATCH: Speech could not be recognized.");
         }
     };

     _recognizer.Canceled += (s, e) =>
     {
         _logger.Log(System.Diagnostics.TraceLevel.Info, $"CANCELED: Reason={e.Reason}");

         if (e.Reason == CancellationReason.Error)
         {
             _logger.Log(System.Diagnostics.TraceLevel.Info, $"CANCELED: ErrorCode={e.ErrorCode}");
             _logger.Log(System.Diagnostics.TraceLevel.Info, $"CANCELED: ErrorDetails={e.ErrorDetails}");
             _logger.Log(System.Diagnostics.TraceLevel.Info, $"CANCELED: Did you update the subscription info?");
         }

         stopRecognition.TrySetResult(0);
     };

     _recognizer.SessionStarted += async (s, e) =>
     {
         _logger.Log(System.Diagnostics.TraceLevel.Info, "\nSession started event.");
     };

     _recognizer.SessionStopped += (s, e) =>
     {
         _logger.Log(System.Diagnostics.TraceLevel.Info, "\nSession stopped event.");
         _logger.Log(System.Diagnostics.TraceLevel.Info, "\nStop recognition.");
         stopRecognition.TrySetResult(0);
     };

     // Starts continuous recognition. Uses StopContinuousRecognitionAsync() to stop recognition.
     await _recognizer.StartContinuousRecognitionAsync().ConfigureAwait(false);

     // Waits for completion.
     // Use Task.WaitAny to keep the task rooted.
     Task.WaitAny(new[] { stopRecognition.Task });

     // Stops recognition.
     await _recognizer.StopContinuousRecognitionAsync().ConfigureAwait(false);
 }
 catch (ObjectDisposedException ex)
 {
     _logger.Log(System.Diagnostics.TraceLevel.Error, ex, "The queue processing task object has been disposed.");
 }
 catch (Exception ex)
 {
     // Catch all other exceptions and log
     _logger.Log(System.Diagnostics.TraceLevel.Error, ex, "Caught Exception");
 }
}


Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,555 questions
Microsoft Teams
Microsoft Teams
A Microsoft customizable chat-based workspace.
9,627 questions
Microsoft Teams Development
Microsoft Teams Development
Microsoft Teams: A Microsoft customizable chat-based workspace.Development: The process of researching, productizing, and refining new or existing technologies.
3,065 questions
Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
2,645 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Amira Bedhiafi 20,176 Reputation points
    2024-01-21T21:37:41.7733333+00:00

    Meanwhile, Azure speech SDK does not offer noise reduction as it processes audio from the source being chosen (file or microphone) in an unmodified form that is then passed for processing to a service.

    However, a solution can be to replace the device microphone with an external one that does not pick up other sounds or use the speech devices SDK: which is actually like hardware device having its own noise suppression and echo cancellation feature.

    https://learn.microsoft.com/en-us/azure/ai-services/Speech-Service/speech-devices

    1 person found this answer helpful.