Share via


Note

Please see Azure Cognitive Services for Speech documentation for the latest supported speech solutions.

How to enable continuous dictation

Learn how to capture and recognize long-form, continuous dictation speech input.

Note  Voice commands and speech recognition are not supported by Windows Store apps in Windows 8 and Windows 8.1.

 

In [How to dictate short speech responses] you learned how to capture and recognize relatively short speech input using the RecognizeAsync or RecognizeWithUIAsync methods of a SpeechRecognizer object. For example, when composing a short message service (SMS) message or when asking a question.

For longer, continuous speech recognition sessions, such as dictation or email, use the ContinuousRecognitionSession property of a SpeechRecognizer to obtain a SpeechContinuousRecognitionSession object.

What you need to know

Technologies

Prerequisites

This topic builds on Quickstart: Speech recognition and references the "Continuous Dictation Scenario" of the [Speech and TTS sample]. You don’t need the sample to understand the key points and code snippets explained here, but it does let you experiment freely with the code.

To complete this tutorial, have a look through these topics to get familiar with the technologies discussed here.

Instructions

Set up

Your app needs a few objects to manage a continuous dictation session:

  • An instance of a SpeechRecognizer object.
  • A reference to a UI dispatcher to update the UI during dictation.
  • A way to track the accumulated words spoken by the user.

Here, we declare a SpeechRecognizer instance as a private field of the code-behind class. Your app needs to store a reference elsewhere if you want continuous dictation to persist beyond a single Extensible Application Markup Language (XAML) page.

private SpeechRecognizer speechRecognizer;

During dictation, the recognizer raises events from a background thread. Because a background thread cannot directly update the UI in XAML, your app must use a dispatcher to update the UI in response to recognition events.

Here, we declare a private field that will be initialized later with the UI dispatcher.

// Speech events may originate from a thread other than the UI thread.
// Keep track of the UI thread dispatcher so that we can update the
// UI in a thread-safe manner.
private CoreDispatcher dispatcher;

To track what the user is saying, you need to handle recognition events raised by the speech recognizer. These events provide the recognition results for chunks of user utterances.

Here, we use a StringBuilder object to hold all the recognition results obtained during the session. New results are appended to the StringBuilder as they are processed.

private StringBuilder dictatedTextBuilder;

Initialization

During initialization of continuous speech recognition, you must:

  • Fetch the dispatcher for the UI thread if you update the UI of your app in the continuous recognition event handlers.
  • Initialize the speech recognizer.
  • Compile the built-in dictation grammar. Note   Speech recognition requires at least one constraint to define a recognizable vocabulary. If no constraint is specified, a predefined dictation grammar is used. See Quickstart: Speech recognition.  
  • Setup the event listeners for recognition events.

We initialize speech recognition in the OnNavigatedTo page event.

  1. Because events raised by the speech recognizer occur on a background thread, create a reference to the dispatcher for updates to the UI thread. OnNavigatedTo is always invoked on the UI thread.

    this.dispatcher = CoreWindow.GetForCurrentThread().Dispatcher;
    
  2. We then initialize the SpeechRecognizer instance.

    this.speechRecognizer = new SpeechRecognizer();
    
  3. We then add and compile the grammar that defines all of the words and phrases that can be recognized by the SpeechRecognizer.

    If you don't specify a grammar explicitly, a predefined dictation grammar is used by default. Typically, the default grammar is best for general dictation.

    Here, we call CompileConstraintsAsync immediately without adding a grammar.

    SpeechRecognitionCompilationResult result =
      await speechRecognizer.CompileConstraintsAsync();
    

Handle recognition events

Here, you can capture a single, brief utterance or phrase by calling RecognizeAsync or RecognizeWithUIAsync. However, we want to capture a longer, continuous recognition session.

To do this, we specify event listeners to run in the background as the user speaks and define handlers to build the dictation string.

We then use the ContinuousRecognitionSession property of our recognizer to obtain a SpeechContinuousRecognitionSession object that provides methods and events for managing a continuous recognition session.

Two events in particular are critical:

  • ResultGenerated, which occurs when the recognizer has generated some results.
  • Completed, which occurs when the continuous recognition session has ended.

The ResultGenerated event is raised as the user speaks. The recognizer continuously listens to the user and periodically raises an event that passes a chunk of speech input. You must examine the speech input, using the Result property of the event argument, and take appropriate action in the event handler, such as appending the text to a StringBuilder object.

As an instance of SpeechRecognitionResult, the Result property is useful for determining whether you want to accept the speech input:

  • Status indicates whether the recognition was successful. Recognition can fail for a variety of reasons.
  • Confidence indicates the relative confidence that the recognizer understood the correct words.
  1. Here, we register the handler for the ResultGenerated continuous recognition event in the OnNavigatedTo page event.

    speechRecognizer.ContinuousRecognitionSession.ResultGenerated +=
        ContinuousRecognitionSession_ResultGenerated;
    
  2. We then check the Confidence property. If the value of Confidence is Medium or better, we append the text to the StringBuilder. We also update the UI as we collect input.

    Note  the ResultGenerated event is raised on a background thread that cannot update the UI directly. If a handler needs to update the UI (as the [Speech and TTS sample] does), you must dispatch the updates to the UI thread through the RunAsync method of the dispatcher.

     

    private async void ContinuousRecognitionSession_ResultGenerated(
      SpeechContinuousRecognitionSession sender,
      SpeechContinuousRecognitionResultGeneratedEventArgs args)
      {
    
        if (args.Result.Confidence == SpeechRecognitionConfidence.Medium ||
          args.Result.Confidence == SpeechRecognitionConfidence.High)
          {
            dictatedTextBuilder.Append(args.Result.Text + " ");
    
            await dispatcher.RunAsync(CoreDispatcherPriority.Normal, () =>
            {
              dictationTextBox.Text = dictatedTextBuilder.ToString();
              btnClearText.IsEnabled = true;
            });
          }
        else
        {
          await dispatcher.RunAsync(CoreDispatcherPriority.Normal, () =>
            {
              dictationTextBox.Text = dictatedTextBuilder.ToString();
            });
        }
      }
    
  3. We then handle the Completed event, which indicates the end of continuous dictation.

    The session ends when you call the StopAsync or CancelAsync methods (described the next section). The session can also end when an error occurs, or when the user has stopped speaking. Check the Status property of the event argument to determine why the session ended (SpeechRecognitionResultStatus).

    Here, we register the handler for the Completed continuous recognition event in the OnNavigatedTo page event.

    speechRecognizer.ContinuousRecognitionSession.Completed +=
      ContinuousRecognitionSession_Completed;
    
  4. The event handler checks the Status property to determine whether the recognition was successful. It also handles the case where the user has stopped speaking. Often, a TimeoutExceeded is considered successful recognition as it means the user has finished speaking. You should handle this case in your code for a good experience.

    Note  the ResultGenerated event is raised on a background thread that cannot update the UI directly. If a handler needs to update the UI (as the [Speech and TTS sample] does), you must dispatch the updates to the UI thread through the RunAsync method of the dispatcher.

     

    private async void ContinuousRecognitionSession_Completed(
      SpeechContinuousRecognitionSession sender,
      SpeechContinuousRecognitionCompletedEventArgs args)
      {
        if (args.Status != SpeechRecognitionResultStatus.Success)
        {
          if (args.Status == SpeechRecognitionResultStatus.TimeoutExceeded)
          {
            await dispatcher.RunAsync(CoreDispatcherPriority.Normal, () =>
            {
              rootPage.NotifyUser(
                "Automatic Time Out of Dictation",
                NotifyType.StatusMessage);
    
              DictationButtonText.Text = " Continuous Recognition";
              dictationTextBox.Text = dictatedTextBuilder.ToString();
            });
          }
          else
          {
            await dispatcher.RunAsync(CoreDispatcherPriority.Normal, () =>
            {
              rootPage.NotifyUser(
                "Continuous Recognition Completed: " + args.Status.ToString(),
                NotifyType.StatusMessage);
    
              DictationButtonText.Text = " Continuous Recognition";
            });
          }
        }
      }
    

Provide ongoing recognition feedback

When people converse, they often rely on context to fully understand what is being said. Similarly, the speech recognizer often needs context to provide high-confidence recognition results. For example, by themselves, the words "weight" and "wait" are indistinguishable until more context can be gleaned from surrounding words. Until the recognizer has some confidence that a word, or words, have been recognized correctly, it will not raise the ResultGenerated event.

This can result in a less than ideal experience for the user as they continue speaking and no results are provided until the recognizer has high enough confidence to raise the ResultGenerated event.

Handle the HypothesisGenerated event to improve this apparent lack of responsiveness. This event is raised whenever the recognizer generates a new set of potential matches for the word being processed. The event argument provides an Hypothesis property that contains the current matches. Show these to the user as they continue speaking and reassure them that processing is still active. Once confidence is high and a recognition result has been determined, replace the interim Hypothesis results with the final Result provided in the ResultGenerated event.

Here, we append the hypothetical text and an ellipsis ("…") to the current value of the output TextBox. The contents of the text box are updated as new hypotheses are generated and until the final results are obtained from the ResultGenerated event.

private async void SpeechRecognizer_HypothesisGenerated(
  SpeechRecognizer sender,
  SpeechRecognitionHypothesisGeneratedEventArgs args)
  {

    string hypothesis = args.Hypothesis.Text;
    string textboxContent = dictatedTextBuilder.ToString() + " " + hypothesis + " ...";

    await dispatcher.RunAsync(CoreDispatcherPriority.Normal, () =>
    {
      dictationTextBox.Text = textboxContent;
      btnClearText.IsEnabled = true;
    });
  }

Start and stop recognition

Before starting a recognition session, check the value of the speech recognizer State property. The speech recognizer must be in an Idle state.

After checking the state of the speech recognizer, we start the session by calling the StartAsync method of the speech recognizer's ContinuousRecognitionSession property.

if (speechRecognizer.State == SpeechRecognizerState.Idle)
{
  await speechRecognizer.ContinuousRecognitionSession.StartAsync();
}

Recognition can be stopped in two ways:

  • StopAsync lets any pending recognition events complete (ResultGenerated continues to be raised until all pending recognition operations are complete).

  • CancelAsync terminates the recognition session immediately and discards any pending results.

After checking the state of the speech recognizer, we stop the session by calling the CancelAsync method of the speech recognizer's ContinuousRecognitionSession property.

if (speechRecognizer.State != SpeechRecognizerState.Idle)
{
  await speechRecognizer.ContinuousRecognitionSession.CancelAsync();
}

Note  

A ResultGenerated event can occur after a call to CancelAsync.

Because of multithreading, a ResultGenerated event might still remain on the stack when CancelAsync is called. If so, the ResultGenerated event still fires.

If you set any private fields when canceling the recognition session, always confirm their values in the ResultGenerated handler. For example, don't assume a field is initialized in your handler if you set them to null when you cancel the session.

 

Summary and next steps

Here, you learned how to handle long-form, unconstrained speech dictation, which is useful for authoring emails or documents.

Next, you might like to know how to listen for a continuous series of verbal commands, such as those in a video game. See, [How to listen for continuous phrases from a list].

Responding to speech interactions

Designers

Speech design guidelines