Mix channels in speech-to-text transcription?

Question

I'm trying to transcribe videos containing single speakers but recorded in two-channel format (using https://eastus.api.cognitive.microsoft.com/speechtotext/v3.0/transcriptions).
In the results, I'm seeing slightly different (+/- 10 millisecond) time stamps for the same utterrances, and recognized phrases from corresponding utterances between one channel and the other aren't precisely the same--and neither is 100% correct. I hate to throw away one channel arbitrarily. Is there a way to cause the transcription process to "mix" the channels to produce a single channel result that's better than the individual channel results, to simplify producing the highest accuracy/lowest phrase duplication transcripts possible from the videos.

Answer

@Eric Schoen Thanks for the question. You can get the result from a single channel audio, otherwise remove the below 3 lines of code in the sample code link.

You could find the current SDK sample code in the public document, just adding 3 line of code (yellow highlighted and bold) to support single channel audio.

// keeping these lines here for single channel audio   
                var connection = Connection.FromRecognizer(conversationTranscriber);  
                connection.SetMessageProperty("speech.config", "DisableReferenceChannel", $"\"True\"");  
                connection.SetMessageProperty("speech.config", "MicSpec", $"\"1_0_0\"");   
                // end

Share via

Mix channels in speech-to-text transcription?

1 answer