Using WAV File Input with SR Engines (SAPI 5.3)

Article
05/21/2012

Microsoft Speech API 5.3

Using WAV File Input with SR Engines

Overview

This document is intended to help developers of speech recognition (SR) applications use the Microsoft Speech API's speech recognition and audio APIs to connect a wav file with an SR engine. The topics covered include:

Typical file input scenario
In-process (InProc) versus shared engines
Relevant to setting up and using wav files as input to the speech recognizer
Sample source code (written in both C++ and Visual Basic 6.0) to help guide developers.

Typical file input scenario

There are many different types of audio input configurations used by SR applications, which include:

A microphone shared by all desktop applications
A telephony card communicating with one or more SR engines
Sending audio from a persisted wav file to an SR engine

The shared desktop microphone scenario uses the default SR engine and the default audio input. The user selects each in Speech properties in Control Panel, and each is hosted in the shared speech server.

The telephony scenario can use either the SAPI 5 standard multimedia audio input object or a custom audio object combined with an InProc SR engine.

The wav file input scenario is special because it uses controlled, reproducible audio input and requires a dedicated SR engine, without interference from other applications (e.g., a shared desktop microphone). The file input scenario should use a generic SAPI audio stream connected to the input wav file and an InProc SR engine.

Typical scenarios that would use the wav file audio input configuration include:

Offline transcription applications (e.g., convert voice mail to email)
SR engine testing (e.g., measure and improve engine accuracy with reproducible audio input data)
SR application testing (e.g., verify and improve application behavior when responding to reproducible voice commands)

Follow these basic steps to perform SR on a wav file:

Create and configure basic SAPI audio stream object for wav file input
Create an InProc SR engine using the code samples in this document
Set the audio stream object from step 1 as the SR engine's input
Activate grammars and begin SR
Respond to recognition events until end of audio stream is reached

Relevant wav audio file input APIs for COM/C/C++ Developers:

SpStream object, ISpStream interface: Basic SAPI audio stream
ISpStream::BindToFile: Set up audio stream for wav file input
SpBindToFile: Helper function to setup stream with a wav file
SpInprocRecognizer, ISpRecognizer: InProc SR engine
ISpRecognizer::SetInput: Set stream object as engine's input
SPEI_START_SR_STREAM, SPEI_END_SR_STREAM: Event signaling engine has reached the beginning or the end of the wav file, respectively

Relevant wav audio file input APIs for Automation/Visual Basic/Scripting Developers:

SpFileStream object: Basic file-based SAPI audio stream
SpInprocRecognizer, ISpeechRecognizer: InProc SR engine
SpInprocRecoContext, ISpeechRecoContext: InProc SR context
ISpeechRecognizer::AudioInputStream property: Set file stream object as engine's input
ISpeechRecoContext::EndStream/StartStream events

Wav audio file input outcome specific to SAPI

Finite-length audio input stream

Unlike microphone input which has no predetermined stream length, a finite-length audio input stream is a file which has a specific length that is known before recognition begins. Similarly, applications that use microphone input will toggle between actively listening and not listening states until the speech application is closed. However, transcription applications are typically designed to listen to one continuous audio stream, and then close when the stream ends. Consequently, the application must specifically acknowledge the end audio stream event (SPEI_SR_END_STREAM for C/C++, ISpeechRecoContext::EndStream event for Automation). Transcription applications can potentially record multiple recognitions on a single audio stream, if the speaker pauses or breaks between sections of audio. If the transcription application exits after the first recognition event is received, it will miss any further recognizable audio that remains.

Non-real-time audio input

Microphone input and networked audio streams are typically real-time audio objects. This means that the audio object is designed to support audio buffering and dynamic state manipulation (e.g. stop->play->pause->play->stop) to handle delays and latency in the audio source and/or the SR engine's processing.

Sample wav audio file input source code

COM/C++ Developers

C-style is very similar to C++ and COM

  {
   CComPtr<ISpStream> cpInputStream;
   CComPtr<ISpRecognizer> cpRecognizer;
   CComPtr<ISpRecoContext> cpRecoContext;
   CComPtr<ISpRecoGrammar> cpRecoGrammar;

   // Create basic SAPI stream object
   // NOTE: The helper SpBindToFile can be used to perform the following operations
   hr = cpInputStream.CoCreateInstance(CLSID_SpStream);
   // Check hr
   CSpStreamFormat sInputFormat;
   // generate WaveFormatEx structure, assuming the wav format is 22kHz, 16-bit, Stereo
   hr = sInputFormat.AssignFormat(SPSF_22kHz16BitStereo);
   // Check hr

   // setup stream object with wav file MY_WAVE_AUDIO_FILENAME
   //   for read-only access, since it will only be access by the SR engine
   hr = cpInputStream->BindToFile(MY_WAVE_AUDIO_FILENAME,
      SPFM_OPEN_READONLY,
      sInputFormat.FormatId(),
      sInputFormat.WaveFormatExPtr(),
      SPFEI_ALL_EVENTS);

   // Check hr

   // Create in-process speech recognition engine
   hr = cpRecognizer.CoCreateInstance(CLSID_SpInprocRecognizer);
   // Check hr

   // connect wav input to recognizer
   // SAPI will negotiate mismatched engine/input audio formats using system audio codecs, so second parameter is not important - use default of TRUE
   hr = cpRecognizer->SetInput(cpInputStream, TRUE);
   // Check hr

   // Create recognition context to receive events
   hr = cpRecognizer->CreateRecoContext(&cpRecoContext;);
   // Check hr

   // Create grammar, and load dictation
   // ignore grammar ID for simplicity's sake
   // NOTE: Voice command apps would load CFG here
   hr = cpRecognizer->CreateGrammar(NULL, &cpRecoGrammar;);
   // Check hr
   hr = cpRecoGrammar->LoadDictation(NULL,SPLO_STATIC);	
   // Check hr

   // check for recognitions and end of stream event
   hr = cpRecoContext->SetInterest(SPFEI(SPEI_RECOGNITION) | SPFEI(SPEI_SR_END_STREAM), SPFEI(SPEI_RECOGNITION) | SPFEI(SPEI_SR_END_STREAM));

   // use Win32 events for command-line style application
   hr = cpRecoContext->SetNotifyWin32Event();
   // Check hr

   // activate dictation, and begin recognition
   hr = cpRecoGrammar->SetDictationState(SPRS_ACTIVE);
   // Check hr

   // while events occur, continue processing
   // timeout should be greater than the audio stream length, or a reasonable amount of time expected to pass before no more recognitions are expected in an audio stream
   BOOL fEndStreamReached = FALSE;
   while (!fEndStreamReached && S_OK == cpRecoContext->WaitForNotifyEvent(MY_REASONABLE_TIMEOUT))
   {
      CSpEvent spEvent;
      // pull all queued events from the reco context's event queue

      while (!fEndStreamReached && S_OK == spEvent.GetFrom(cpRecoContext))
      {
         // Check event type
         switch (spEvent.eEventId)
         {
            // speech recognition engine recognized some audio
            case SPEI_RECOGNITION:
            // TODO: log/report recognized text
            break;

            // end of the wav file was reached by the speech recognition engine
            case SPEI_SR_END_STREAM:
               fEndStreamReached = TRUE;
               break;
         }

         // clear any event data/object references
         spEvent.Clear();
         }// END event pulling loop - break on empty event queue OR end stream
      }// END event polling loop - break on event timeout OR end stream

   // deactivate dictation
   hr = cpRecoGrammar->SetDictationState(SPRS_INACTIVE);
   // Check hr

   // unload dictation topic
   hr = cpRecoGrammar->UnloadDictation();
   // Check hr

   // close the input stream, since we're done with it
   // NOTE: smart pointer will call SpStream's destructor, and consequently ::Close, but code may want to check for errors on ::Close operation
   hr = cpInputStream->Close();
   // Check hr
}

Automation/Visual Basic 6.0 Developers

Scripting code is similar to Visual Basic.

  Option Explicit
Dim WithEvents RecoContext as ISpeechRecoContext ' context for receiving SR events
Dim Grammar as ISpeechRecoGrammar	' grammar
Dim InputFile as SpeechLib.SpFileStream	' wav audio input file stream

' Setup InProc reco context and wav audio input file
Private Sub MyForm_Load()
   ' Create new recognizer
   Dim Recognizer as New SpInprocRecognizer

   ' create input file stream
   Set InputFile as New SpFileStream
   ' Defaults to open for read-only, and DoEvents false
   InputFile.Open MY_WAVE_AUDIO_FILENAME

   ' connect wav audio input to speech recognition engine
   Set Recognizer.AudioInputStream = InputFile

   ' create recognition context
   Set RecoContext = Recognizer.CreateRecoContext

   ' create grammar
   Set Grammar = RecoContext.CreateGrammar
   ' ... and load dictation
   Grammar.DictationLoad

   ' start dictating
   Grammar.DictationSetState SGDSActive
End Sub

' Event fired on app shutdown
Private Sub MyForm_Unload(Cancel as Boolean)
   InputFile.Close ' close audio input file
End Sub

' Event fired when speech recognition engine recognizes audio
Private Sub RecoContext_Recognition(StreamNumber as Long, StreamPosition as Variant, RecognitionType As SpeechRecognitionType, Result As ISpeechRecoResult)
   ' Log/Report recognized phrase/information
End Sub

' End of wav Input Stream reached by speech recognition engine
Private Sub RecoContext_EndStream(StreamNumber as Long, StreamPosition as Variant)
   ' Disable dictation and unload grammars on app close
   Grammar.DictationSetState SGDSInactive
   Grammar.DictationUnload

   Unload Me ' shutdown app on end of stream
End Sub

Share via