Text grammar format overview (SAPI 5.3)

Article
04/17/2012

Microsoft Speech API 5.3

Text Grammar Format Overview

The Extensible Markup Language (XML) format inside a GRAMMAR XML element (block), is an "expert–only–readable" declaration of a grammar that a speech application uses to accomplish the following:

Improve recognition accuracy by restricting and indicating to an engine what words it should expect.
Improve maintainability of textual grammars, by providing constructs for reusable text components (internal and external rule references), phrase lists, and string and numeric identifiers.
Improve translation of recognized speech into application actions. This is made easier by providing "semantic tags," (property name, and value associations) to words/phrases declared inside the grammar.

A GRAMMAR XML element (block) appears in a XML source code file. The XML source is compiled into a binary grammar format and is the format used by SAPI during application run time.

The following section covers:

Extensible Markup Language (XML)
Attributes
Contents
Comments
How SAPI utilizes XML information
Frequently used definitions
Non–empty concatenated recognition contents

Extensible Markup Language

The textual grammar format is an application of the XML. Every XML element consists of a start tag (<SOME_TAG>) and an end tag (</SOME_TAG>) with a case-insensitive tag name and contents between these tags. The start tag and the end tag are the same if the element is empty. For example, the tag (<SOME_TAG/>). For more information about tags in XML grammars, see Grammar Format Tags. Additionally, more information about XML and the XML specification is available at: http://www.w3.org/TR/REC-xml.

For example, all grammars contain the opening tag <GRAMMAR> as follows:

  <GRAMMAR>

... grammar content

</GRAMMAR>

Note that the contents of the grammar is contained between an opening tag and a trailing, closing tag.

Attributes

Attributes of an XML element appear inside the start tag. Each attribute is in the form of a name followed by an equal sign followed by a string which must be surrounded by either single or double quotation marks. An attribute of a given name may only appear once in a start tag.

In summary, the literal string cannot contain either < or ', if the string is surrounded by single quotation marks. It may not contain ", if the string is surrounded by double quotation marks. Furthermore, use all ampersand (&) characters only in an entity reference such as & and >. When a literal string is parsed, the resulting replacement text will resolve all entity references such as > into its corresponding text, such as >. In this specification, only the resulting replacement text needs to be defined for attribute value strings. More information about XML and the XML specification is available at: http://www.w3.org/TR/REC-xml.

For example, the grammar author can specify the language (id) of the grammar as follows.

  <GRAMMAR LANGID="409">
... grammar content
</GRAMMAR>

The grammar element (<GRAMMAR>) has an attribute, called LANGID which must be a numeric value. The grammar author specifies the language attribute by placing the attribute inside the brackets of the opening tag, and enclosing the attribute value (e.g. 409) in quotation marks.

The contents of an element consists of text or subelements. Formal definitions of valid contents in this specification are provided as regular and "multi-set" expressions. The pseudo-element name "Text" indicates untagged text. With these definitions, the XML specification defines the exact file syntax details.

For example, the grammar author can place either text or sub-elements inside a phrase tag as follows.

  <PHRASE>
   hello
</PHRASE>

<PHRASE>
   <OPT>world</OPT>
</PHRASE>

For more information about tags in XML grammars, see Grammar Format Tags.

Comments

The SAPI 5 XML parser treats HTML comment tags as unknown XML tag elements. The engine should provide support for comments and other unknown XML elements.

It is recommended that grammar authors place comments in their XML files (e.g. mygrammar.xml), similar to commenting source code, since the XML parser will safely parse the comments without affecting the grammar itself. Similarly, there is increase in size of the binary form of the grammar (e.g. mygrammar.cfg) since the SAPI 5 grammar compiler strips out the comments.

An example of a comment in an XML grammar is as follows.

     <!-- the 'travel' rule is the main voice command for our app, so it active by default -->
   <RULE ID="RID_Travel" TOPLEVEL="ACTIVE">
      <PHRASE>travel from</PHRASE>

      <!-- include location grammar component, so we can change the location list at runtime -->
      <RULEREF REFID="RID_Location" PROPID="PID_FromDestination"/>
      <PHRASE>to</PHRASE>

      <!-- include location grammar component, so we can change the location list at runtime -->
      <RULEREF REFID="RID_Location" PROPID="PID_ToDestination"/>
   </RULE>

Note that the comment blocks always begin with .

How SAPI utilizes XML information

SAPI uses XML content in the following two methods.

The SAPI context-free grammar compiler, compiles the XML grammar into a binary grammar format. The compiled binary grammar is loaded into the SAPI run-time environment from a file, memory, or object (.DLL) resource.
The speech recognition (SR) engine queries the run-time environment for available grammar information.

Frequently used definitions

Untagged text declaring a sequence of words that the recognition engine will recognize. Tentatively this text is only the not-necessarily-phonetic representation of words used for reading words whose pronunciation is unknown to the user (for example, for Japanese, kana, not kanji); this form will be called the spelling form. In further definitions in this section, Text will be referenced as though it were a pseudo-element.

Non–empty concatenated recognition contents

The contents of a number of XML elements in this specification such as, the P element, contain a sequence of grammar constructs which are concatenated together (one grammar construct after another). These grammar elements must be recognized in order for the contents defined to be recognized.

The contents must be one of the following (and not both):

Text and any number of L, P, O, or RULEREF elements in any order with at least one L, P, or RULEREF.