Share via


Grammars: Purpose and Structure

This content is no longer actively maintained. It is provided as is, for anyone who may still be using these technologies, with no warranties or claims of accuracy with regard to the most recent product version or service release.

Grammars are structures that contain single words, complex phrases, or lists of words or complex phrases. These grammar structures use Extensible Markup Language (XML) elements and plain text to attempt to match human patterns of speech. Use grammar structures to process command and control situations in which the user speaks orders, commands, responses, or requests to a speech application.

Grammars form the guidelines that the application must use to recognize orders that a user might issue to it. A grammar contains an ordered list of words or phrases that the application uses to recognize what a user says. Unless the words or phrases are defined in the grammar structure, the application cannot recognize the user's spoken commands.

A very simple application can limit spoken commands to single words, such as Open or Print. In this case, a grammar is not much more than a list of words. However, many applications require more complex commands or sentences. The user experience demands that speech with computers approach a natural, spoken language level, so users can speak in normal and natural sounding sentences. For example, a ticket ordering application must accept "I want to order two tickets for the 10 P.M. show." This application must also recognize and respond to variations of the same phrase; "I want to buy," "I'd like to buy," or even the more impolite "gimme two tickets."

Voice commands require flexibility in accepting statements, but at the same time, grammars must impose limits on the application. For example, although the statement "my mother is sick" might imply an urgent need to buy an airline ticket, it is unreasonable for the ticket ordering application to process it as a purchase request.

To learn more about grammars, see the following sections:

  • Purpose of Grammars
  • XML Format
  • Rules
  • Elements

Purpose of Grammars

A grammar does the following:

  • Limits Vocabulary ??? The grammar contains only the exact words or phrases an application needs to match for a successful user response recognition. An application might need to recognize only a few words that appear in a grammar structure, therefore, the speech recognition engine does not need to search the entire dictionary. Explicitly providing words in a grammar also improves the recognition accuracy, because the speech recognition engine must process speech only to the extent of confirming a match.
    Grammars are often referred to as context-free grammars (CFG). The words or phrases do not need a context in which to assist recognition. Providing context helps, but is not required. The speech recognition engine is less likely to recognize a nonsensical command such as "horn swaggle" than the command "open the file." A good grammar interface allows for common or naturally spoken commands.
  • Filters Response Recognition ??? The speech recognition engine processes all audio signals it receives, regardless of what is contained in the grammars. The engine determines what the word is and matches the word or phrase with the word or phrase defined in the grammar. The advantage of a grammar is that the speech recognition engine returns a successful recognition event only if the grammar is matched. The grammar filters the results to the applications. Otherwise, the application receives many additional recognition results, few of which have meaning to the application.
  • Matches Speech ??? The grammar matches the speech input for a particular application. Although grammar structures need to be flexible and accommodate a multitude of phrases and phrasing, grammar structures also need to restrict the user's speech to a specific situation or task. Each application has its own natural language. A coffee ordering system, for example, concentrates on language used to order coffee, not language used to order airline tickets. Developers need to tailor grammar structures to serve the application's specific purpose or objective.
  • Identifies Rules ??? Grammar structures use rules or entities to define and order the component words of potential user utterances. Rules defining commonly used utterances can be referenced repeatedly by other rules within the containing grammar or by rules contained within other grammars. Another type of grammar structure is a grammar library, which is a predefined grammar file that contains a number of simple rules, complex sets of interrelated rules, or a combination of both that an application can use to recognize specific types of information. For example, the Speech Server grammar library contains a Date ruleset that developers can use when implementing a speech application that requires the ability to recognize calendar dates spoken by a user.
    As previously noted, a grammar can be composed of many rules. A voice interface for an application generally contains one rule for each menu, menu item, or dialog box that is accessed directly through spoken commands or responses. The combination of those rules forms the grammar. However, a statement can only match one rule at a time. Each rule is given an ID. When a successful recognition occurs, the speech recognition engine processes the rule ID as part of the recognition result. The speech recognition engine uses the SemanticItem or listen values defined using Speech Control Editor to process rule IDs and pass this information back to the speech application. For example, the command "open a file" matches only one rule-presumably the ID for the "file open" command rule. If the application must sort the results, such as a series of case or switch statements, the application matches the rule ID instead of each spoken word. Although the application can match each spoken word, the application most likely sorts the grammar using the rule ID.
    Grammars are tools for content identification. For example, a customer can say any of the following:
    • "I would like a coffee"
    • "I'd like coffee"
    • "Get me a coffee"
    • "Coffee please"
      In all four cases, the phrase is different, but the intent is the same; the customer wants coffee. Grammars can define all combinations of this intent in a single rule. The rule is identified by a unique name. It makes no difference which phrase in the rule is actually spoken. If the spoken phrase is defined within that rule, the rule is considered successfully matched by the application. The speech recognition engine returns the recognition back to the application with a single rule name. The application uses that name to process the coffee order. Instead of requiring the application to detect all words in each variation of the phrase, the speech recognition engine and the grammar determine that ahead of time and return only what the application expects ??? the rule name. For more information about implementing rules in grammar structures, see How to: Design Grammar Rules.
  • Provides Semantic Markup Language ??? Grammars provide the basis of the Semantic Markup Language (SML). SML is used inside the recognition results and allows the application to identify and parse the returned text. An SML output is an XML-formatted output that contains the grammar element SML. The grammar element SML can have zero, one, or more child elements, depending on whether the input grammar contains markup for semantic interpretation. Script expressions contained in tag elements generate semantic values for items and referenced rules contained in a parent rule.
    An SML output always contains a recognition confidence score, the recognized text, and the confidence score for the full utterance of every utterance that activates a grammar. However, using semantic interpretation can increase the granularity of the SML output to obtain confidence scores and semantic values at the rule level. For more information, see SML Output Overview.

XML Format

Grammars are based on the W3C Speech Recognition Grammar Specification (SRGS) version 1.0 format, which defines the structure of grammars and grammar rules that use XML markup. The grammar compiler transforms the XML elements that define grammar elements into a binary format used by speech recognition engines. This compiling process is performed before or during application run time. For specific information about XML, see the Extensible Markup Language (XML) specification.

XML provides a flexible structure for describing the list of words or phrases defined in grammars. Developers can use XML attributes, XML elements, and plain text to further identify and define text elements, so the grammar file is easy to maintain and organize. Text elements identify a grammar, and developers can organize the text elements into lists, strings, and numbers. Organizing the text element structures makes them reusable in other grammars.

Rules

The basic unit of a grammar is the rule. Grammars must contain at least one rule. A rule defines a pattern or sequence of words or phrases. If the user's statement matches that pattern, the rule is matched by the application. A rule is defined by the content of a rule element. A rule element can contain other elements including references to other rule elements. The following grammar defines a single rule by using a rule element that contains a single item element. The rule element's id attribute holds the unique identifying rule name, which in this case is ruleColors.

<grammar root="ruleColors" version="1.0" xmlns="http://www.w3.org/2001/06/grammar"
 xml:lang="en-US" tag-format="semantics-ms/1.0">
    <rule id="ruleColors" scope="public">
        <item>red</item>
    </rule>
</grammar>

In the previous example, the rule element (identified as ruleColors) contains a single item element that contains the text "red." If the user says "red," the speech recognition engine matches the utterance to the grammar and returns a successful recognition to the application. Any other utterance spoken by the user does not match the grammar and thus returns a false recognition.

Elements

A rule must contain at least one text element. A text element represents an utterance made by the user. By sequencing text elements, grammar designers can create the patterns or sequences needed for the command. The sequence can be simple, as in the previous ruleColors example, or the sequence can be complex as demonstrated in the Solitaire card game shown in Grammar Example: Solitaire.

Developers place elements, such as item elements, variations of item elements, and references to other rules (including those from other grammars) in a particular sequential order, so that grammars can offer rich selections and possibilities of word combinations. For more information about elements, see Grammar XML.

The following information describes some commonly used grammar elements.

  • item ??? Contains any legal rule expansion. A legal rule expansion can consist of a word or other entity that can be spoken, a ruleref element, a tag element, or any logical combination of these. In the previous example, the item element contains a single rule expansion consisting of the single word "red."
    When an item element contains a combination of rule expansions (for example, a combination of words), the sequence of the words in that item element must match the sequence of the words spoken by the user for recognition to be successful. For example, given the following grammar, the input spoken by the user must contain the phrase "metallic red" for recognition to be successful.

    <grammar root="ruleColors" version="1.0" xmlns="http://www.w3.org/2001/06/grammar"
     xml:lang="en-US" tag-format="semantics-ms/1.0">
        <rule id="ruleColors" scope="public">
            <item>metallic red</item>
        </rule>
    </grammar>
    
  • one-of ??? Contains a set of alternative rule expansions and increases the flexibility of the grammar by requiring that the input match only one of the alternatives. For example, in the following grammar, for recognition to be successful, the input must contain the initial phrase "I would like the car in." But the full input can be completed by any of the three color words: "red," "white," or "green."

    <grammar root="ruleColors" version="1.0" xmlns="http://www.w3.org/2001/06/grammar"
     xml:lang="en-US" tag-format="semantics-ms/1.0">
        <rule id="ruleColors" scope="public">
            <item>I would like the car in</item>
            <one-of>
                <item>red</item>
                <item>white</item>
                <item>green</item>
            </one-of>
        </rule>
    </grammar>
    
  • ruleref ??? Specifies a pointer to another rule with one or many elements that also requires recognition for a successful validation or recognition of the current rule.
    Rules are referenced in a grammar using ruleref elements. The ruleref elements have three special attributes ??? NULL, VOID, and GARBAGE ??? that define rules that are:

    • Automatically matched without the user speaking.
    • Never spoken.
    • Matched until the next rule is matched or until the end of spoken input.

    The following example defines a rule element identified as ruleColors for a color selection. Another rule then uses the ruleref element to reference ruleColors, twice.

    <grammar root="ruleColors" version="1.0" xmlns="http://www.w3.org/2001/06/grammar"
     xml:lang="en-US" tag-format="semantics-ms/1.0">
        <rule id="buyShirt" scope="public">
            <item>
               Get me a <ruleref uri="#ruleColors" />
               shirt and a <ruleref uri="#ruleColors"/>
               tie</item>
        </rule>
    
        <rule id="ruleColors" scope="public">
             <one-of>
                <item>red</item>
                <item>white</item>
                <item>green</item>
            </one-of>
        </rule>
    </grammar>
    

    The customer requests a color item twice, but the grammar need only define ruleColors once.