Speech Recognition

Article
03/01/2012

Glossary Item Box

VPL User Interface Services: Text To Speech

Microsoft Robotics Developer Studio

Speech Recognition

Speech Recognizer Icon

Speech recognition (SR) converts spoken words to written text and as a result can be used to provide user interfaces that use spoken input. The Speech Recognizer service enables you to include speech recognition support for your application. Speech recognition requires a special type of software, called an SR engine. The SR engine may be installed with the operating system or at a later time with other software. Speech-enabled packages such as word processors and web browsers, may install their own engines or they can use existing ones. Additional engines are also available through third party manufacturers. These engines are typically designed to only support a specific language and may also target a certain vocabulary; for example, a vocabulary specializing in medical or legal terminology.

Note that speech recognition is not available on all versions of Windows. Before you attempt using the Speech Recognizer service, use the Windows Control Panel on your PC to confirm that a compatible (SAPI) speech recognition engine is installed and make sure that it is properly configured and working by using the Help or user documentation that is provided for it.

You will also need a microphone or some other sound input device to receive the sound. In general, the microphone should be a high quality device with noise filters built in. The speech recognition accuracy is directly related to the quality of the input. The recognition rate will be significantly lower or perhaps even unacceptable with a poor microphone.

The .NET 3.0 (or later) runtime is also required for using this service (and may be available from the Microsoft website).

Operations

The Speech Recognizer services supports the following requests and notifications.

Operation	Description
Get	Returns the entire state of the Speech Recognizer service.
InsertGrammarEntry	Inserts the specified entry (or entries) of the supplied grammar into the current grammar dictionary. If certain entries exist already a Fault is returned and the whole operation fails without the current dictionary being modified at all.
UpdateGrammarEntry	Updates entries that already exist in the current grammar dictionary with the supplied grammar entries. If certain entries in the supplied grammar do not exist in the current dictionary no Fault is returned. Instead, only the existing entries are updated.
UpsertGrammarEntry	Inserts entries from the supplied grammar into the current dictionary if they do not exist yet or updates entries that already exist with entries from the supplied grammar.
DeleteGrammarEntry	Deletes those entries from the current grammar directory whose keys are equal to one of the supplied grammar entries. If a key from the supplied grammar entries does not exist in the current directory no Fault is returned, but any matching entries are deleted.
SetSrgsGrammarFile	Sets the grammar type to SRGS file and tries to load the specified file, which has to reside inside your application's /store folder (directory). If loading the file fails, a Fault is returned and the speech recognizer returns the state it was before it processed this request. SRGS grammars require Windows 7 and will not work with Windows Server 2003.
EmulateRecognize	Sets the SR engine to emulate speech input but by using Text (string). This is mostly used for testing and debugging.
Replace	Configures the speech recognizer service, or indicates that the service's configuration has been changed.
SpeechDetected	Indicates that speech (audio) has been detected and is being processed.
SpeechRecognized	Indicates that speech has been recognized.
SpeechRecognitionRejected	Indicates that speech was detected, but not recognized as one of the words or phrases in the current grammar dictionary. The duration of the speech is available as DurationInTicks.

To support SR you define a grammar - the words and phrases to be recognized and then use notifications provided by the service to determine what SR engine recognized as the spoken input. The Speech Recognizer service supports usage of simple dictionary-style grammars as well as W3C SRGS grammars.

Note that you also cannot use the service's grammar operations as requests from VPL because they require a special data structure that cannot easily be supported by VPL, but you can receive them as notifications.

To define which type of grammar you want the Speech Recognizer service to use, you set the state of this service by either using setting its initial configuration in the Properties window (setting Configuration to Set initial configuration) or by using a Replace request or a SetSrgsGrammarFile request.

The service's initial state includes the following properties:

Name	Type	Description
IgnoreAudioInput	Boolean	Specifies whether the speech service listens for audio (spoken) input (when this is set to false). This may useful for turning off the SR engine temporarily(or when using emulation recognition).
GrammarType	GrammarType	Specifies the type of grammar the SR engine will use, either a simple Dictionary grammar or SRGS grammar.
SrgsFileLocation	string	Specifies the SRGS grammar file to be loaded (only used if you set GrammarType to SRGS).

Setting GrammarType to Dictionary configures the service to use a simple dictionary-style grammar. A dictionary-style grammar is a list of entries that each consist of a set of words for the speech engine to listen for and an optional corresponding semantic tag that represents that recognition. For example, you might define an entry like, Tell me the time, and call its semantic tag, TimeQuery.

To create a simple dictionary-style grammar you define the grammar as part of the service's configuration XML file (SpeechRecognizer.config.xml) for the service. This file is automatically created for your project if you choose the Set initial configuration option. If you save your project, and then open this file (using any XML editor including Windows Notepad) you can add entries for your grammar and save it back to this location. The file must be saved to this location. It is loaded when your project runs.

For each entry, add a beginning and ending XML Elem tag, and an XML string entry for the words the SR engine should listen for and its for the optional semantic tag. You can also use the SpeechRecognizerGui service to generate a Web page that enables you to enter and save a simple dictionary grammar file. For further details, see information on the SpeechRecognizerGui service.

The following is an example for a simple dictionary-style grammar file:

<?xml version="1.0" encoding="utf-8"?>
<SpeechRecognizerState xmlns="https://schemas.microsoft.com/robotics/2008/02/speechrecognizer.html">
<DictionaryGrammar>
    <Elem>
      <string >Backward</string>
      <string >Backward</string>
    </Elem>
    <Elem>
      <string >Follow me</string>
      <string >FollowMe</string>
    </Elem>
    <Elem>
      <string >Forward</string>
      <string >Forward</string>
    </Elem>
    <Elem>
      <string >Left</string>
      <string >Left</string>
    </Elem>
    <Elem>
      <string >Right</string>
      <string >Right</string>
    </Elem>
    <Elem>
      <string >Stop moving</string>
      <string >Stop</string>
    </Elem>
</DictionaryGrammar>
<IgnoreAudioInput>false</IgnoreAudioInput>
<GrammarType>DictionaryStyle</GrammarType>
</SpeechRecognizerState>

To use an SRGS grammar, set the GrammarType to Srgs and supply the filename to SrgsFileLocation, either by setting the initial configuration properties or using a Replace request. Using the SetSrgsGrammarFile request automatically sets GrammarType and tries to load the specified SRGS grammar file (which must be located in your application's \Store folder).

SRGS grammars are also XML files which can be created using a simple editor or from speech tools that generate this format. Details about this format can be found at http://www.w3.org/TR/speech-grammar/.

Service State

You can use a Get request to return the general state of the Speech Recognizer service. However, the recognition state is provided by the SpeechDetected, SpeechRecognized, and SpeechRecognitionRejected notifications.

SpeechDetected returns StartTime (DateTime), which is the time when the SR detects audio input.

When the SR recognizes the input a SpeechRecognized notification returns the following state:

Name	Type	Description
Confidence	float	Return a value between 0 and 1 indicating the SR engine's rating of the certainty of correct recognition for the phrase information returned (higher is better). However, it is a relative measure of the certainty and therefore may vary for each recognition engine. If -1 is returned the speech engine does not provide confidence information.
Text	string	Returns the words recognized.
Semantics	RecognizedSemanticValue	Returns the semantic value object(s), if any, of the recognized words.
DurationInTicks	long integer	Returns the duration of the utterance recognized. There are 10,000,000 ticks per second.

If you load an SRGS grammar you can use Semantics to access the semantic information which applies to the recognized utterance. It may also include a collection of the child semantic values of the utterance recognized.

Name	Type	Description
Children	DssDictionary	Returns the collection of (child) semantic value objects.
Confidence	float	Returns a value between 0 and 1 indicating the SR engine's rating of the certainty of correct recognition for the phrase information returned (higher is better). However, it is a relative measure of the certainty and therefore may vary for each recognition engine. If -1 is returned the speech engine does not provide confidence information.
KeyName	string	Returns thekey string by which this semantic value can be referenced.
TypeOfValue	RecognizedValueType	Returns the type of this semantic value.
ValueBool	Boolean	Returns the Boolean value of the semantic value.
ValueFloat	float	Returns the float value of the semantic value.
ValueInt	int	Returns the int value of the semantic value.
ValueString	string	Returns the string value of the semantic value.

To access child semantic values, use the dot notation. For example, you can access the number of child semantic values by using Semantics.Children.Count. To access the values of the children, you can use their grammar rule name. For example, if you have a number of cities listed under a rule called "Destination", and the recognition matched a destination, you could access the confidence rating of the destination match using Semantics.Children["Destination"].Confidence and its value using Semantics.Children["Destination"].Value.

If the SR engine fails to recognize the input as matching anything in its grammar, then it instead generates a SpeechRecognitionRejected notification, returning StartTime (DateTime) and DurationInTicks (long integer).

Speech Recognizer Gui

Speech Recognizer Gui Icon

The Speech Recognizer Gui service is a companion service that you can use with the Speech Recognizer service. Including the Speech Recognizer Gui service in your project enables you to enter a simple dictionary-style grammar or to upload SRGS (Speech Recognition Grammar Specification) grammar files through a HTML page.

To access the Speech Recognizer Gui service in VPL, drag a copy of the service block into your diagram. It does not require any connections, and it will start up when you run the diagram. You can also optionally start an instance of the Speech Recognizer Gui once you have a DSS node running by using a web browser and going to the Control Panel page. Starting the service will automatically attempt to load the default SR engine.

Once the Speech Recognizer Gui service is running, browse to the page for the service. To do this start your browser and enter in http:/localhost:50000 (50000 is the default port setting. If you run services on a different port use that.) Then click on Service Directory in the left column to display the list of running services. You should find the entry /speechrecognizergui in the list. Click this and you should see a page like the figure below.

Speech Recognizer

Speech Recognizer Gui - Service page

At the bottom of the Speech Recognizer Gui page you select either a simple dictionary style grammar or load an SRGS grammar. Click Save to use the grammar. This will create a copy of the grammar file in the \Store folder. If you chose to create a Dictionary grammar, the Save command creates a SpeechRecognizer.config.xml file in the \Store folder. If you chose to load a SRGS grammar file, the file you browse to will be copied to \Store.

Note that if you create a grammar using the Speech Recognizer Gui service and do not explicitly configure the Speech Recognizer service to load a grammar file, the Speech Recognizer service will automatically attempt to load and use the grammar file you created.

The Speech Recognizer Gui service page also displays the notification generated by the SR engine such as speech detected or speech recognized in a scrolling area that can be cleared. Note that the SR engine only recognizes words and phrases that are in its grammar. If the grammar is empty, then nothing will be recognized.

The Speech Recognizer Gui service is not designed to be sent requests or to issue notifications. It is only intended to be used via a Web browser to build dictionary-style grammars and test speech recognition.

See Also

VPL User Interface Services: Text To Speech

Speech Recognition

Speech Recognition

Operations

Service State

Speech Recognizer Gui

Additional resources