Mix channels in speech-to-text transcription?

Eric Schoen 21 Reputation points
2022-10-12T16:56:15.64+00:00

I'm trying to transcribe videos containing single speakers but recorded in two-channel format (using https://eastus.api.cognitive.microsoft.com/speechtotext/v3.0/transcriptions).
In the results, I'm seeing slightly different (+/- 10 millisecond) time stamps for the same utterrances, and recognized phrases from corresponding utterances between one channel and the other aren't precisely the same--and neither is 100% correct. I hate to throw away one channel arbitrarily. Is there a way to cause the transcription process to "mix" the channels to produce a single channel result that's better than the individual channel results, to simplify producing the highest accuracy/lowest phrase duplication transcripts possible from the videos.

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,443 questions
Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
2,450 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Ramr-msft 17,621 Reputation points
    2022-10-13T10:56:37.173+00:00

    @Eric Schoen Thanks for the question. You can get the result from a single channel audio, otherwise remove the below 3 lines of code in the sample code link.

    You could find the current SDK sample code in the public document, just adding 3 line of code (yellow highlighted and bold) to support single channel audio.

    // keeping these lines here for single channel audio   
                    var connection = Connection.FromRecognizer(conversationTranscriber);  
                    connection.SetMessageProperty("speech.config", "DisableReferenceChannel", $"\"True\"");  
                    connection.SetMessageProperty("speech.config", "MicSpec", $"\"1_0_0\"");   
                    // end  
    
    0 comments No comments