Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
For Interactive Voice Response (IVR) applications and broader speech recognition tasks for voice applications, a constrained list or "grammar"-based recognition offers significant advantages. It far outperforms traditional semantic-based speech recognition used in modern speech-to-text (STT) AI engines in terms of accuracy, performance, and cost. This performance increase is because grammar-based recognition can constrain the recognition output to a predefined set of rules, which bolsters accuracy.
Grammars adhere to the Speech Recognition Grammar Specification (SRGS), as detailed in the W3C specification. When a request comes into the engine, it converts spoken audio ("utterances") into text. The engine then compares the recognized text against the grammar and any associated artifacts, such as pronunciation lexicons. This process provides either a literal transcription or an interpretation that the grammar constrains to the information provided within the grammar. Extra logic, such as ECMAScript built into the grammar, can further refine the interpretation.
Constrained speech recognition is ideal for:
Recognizing constrained lists (addresses, stock tickers, zip codes, department names, and such).
Alphanumeric string recognition (tracking numbers, account numbers, confirmation codes, and such).
- Including having positional constraints. For example, the first two characters of member ID start with AN, FD, NT. Another example of positional constraints is a vehicle identification number.
Alphanumeric or digit recognition with checksums or similar constraints. For example, credit card numbers where there's a Luhn checksum.
Directed dialog applications whereby specific words or phrases should be uttered.
Authoring speech grammars
Write constrained speech grammars by using Grammar XML (GrXML). Like any XML document, a grammar file must begin with a header that specifies certain characteristics of the grammar. The main body of a grammar file consists of grammar rules that define the spoken words recognized by the grammar, and the corresponding variable values that the recognized items return.
Grammar file header
The header in a grammar file consists of the XML declaration, and the <grammar> element specifying the document language, root, and namespace.
<?xml version="1.0" encoding="UTF-8" ?>
<grammar xmlns="http://www.w3.org/2001/06/grammar"
version="1.0" xml:lang="en-US" root="YesNo"
tag-format="swi-semantics/1.0">
XML declaration and encoding type
The first element in the header is always the XML declaration. This element specifies the version of XML used in the document (1.0 or 1.1). It also specifies the encoding that applies to the document, which determines the languages that can or can't be used.
The version and encoding are required attributes. Use any encoding appropriate to your preferences (for example, your computer setup, text-processing application, and so on). The constrained speech recognition engine doesn't care which encoding you use.
A short list of typical encodings for various languages is shown in the following table:
| Encoding | Description |
|---|---|
| ISO-8859-1 | Latin-1. Used for English, French, German, and Spanish. |
| UTF-8 | Used for all languages. |
| UTF-16 | Used for all languages. |
| Big5 | Used for Cantonese (ce-HK). |
| GB | Used for Mandarin (zh-TW). |
| Shift-JIS and EUC-JP | Used for Japanese. |
| KSC and EUC-KR | Used for Korean. |
Most languages can be represented in more than one encoding.
At runtime, the system automatically converts the grammar file encoding into UTF-16 format by using the International Components for Unicode (ICU) libraries. The official website of the Unicode consortium is http://site.icu-project.org/.
Language, namespace, and semantic tag format
The second element in the header is the <grammar> element, whose attributes specify default information for the document. The required attributes are:
xml:lang: Specifies the identifier for the default human language to use, as defined in the Request For Comments (RFC) document RFC 3066 on the IETF web site.Microsoft supports a wide range of languages. The language you choose must be compatible with the grammar encoding type.
version: Specifies the version of GrXML (1.0).xmlns: Designates the grammar namespace. For GrXML grammars, this designation is alwayshttp://www.w3.org./2001/06/grammar.tag-format: Defines the format used for scripts within<tag>elements in the main body of the grammar to assign values.
The tag format must be one of these strings:
| Value | Format of semantic tags |
|---|---|
| swi-semantics/1.0 | Tag syntax (used if tag-format isn't defined). This syntax is known as swi syntax (for SpeechWorks International). |
| semantics/1.0 | W3C script tag syntax. |
| semantics/1.0-literals | W3C string literals tag syntax. |
Note
Strictly speaking, the tag-format attribute isn't required if your grammar doesn't use the
<tag>element. However, most grammars use<tag>to assign values, so Microsoft strongly recommends that you specify the tag-format.Always point GrXML attributes and elements, such as
xmlns, to the namespace<http://voicexml.site.com/grammar>.
Dictionaries
In some cases, the grammar might need to include words or phrases that the constrained speech recognition engine can't parse normally. For example, a name might be said and spelled differently, like the city "Worcester," which might be pronounced "wih-sta."
Use the <lexicon> element to import dictionaries that map utterances to matching text in the grammar file.
Grammar file main body
The main section of a grammar file contains the rules that actually define the grammar: the spoken words and phrases to recognize, and the values to return to the main application for each recognized item.
Rules
The main body of a grammar file consists of rules defined by using the GrXML <rule> element. Each rule has a unique identifier. Each rule lists the words and phrases it recognizes as text within an <item> element or <token> element. These elements might be nested within other GrXML elements:
The
<one-of>element presents a list of acceptable alternatives, and only one alternative is required to activate the rule.The
<ruleref>element refers to another rule, as to a subroutine.The
<tag>element specifies actions to carry out or values to assign to a variable. It might include a script written in the tag-format language.
When the user utters a word or phrase that the rule covers, the rule executes the actions, value assignments, or other code defined for that utterance.
Root rule
The root rule is the first rule in the file, unless the header specifies otherwise. It serves as the default op-level rule. When the grammar is referenced without specifying the rule to look up, this root rule is the first one consulted.
Rule scope
Assign each rule within the main body of a grammar file as a scope. The scope indicates whether you can reference the rule independently from external files (public) or only by another rule within the same grammar (private). All rules are private by default, unless you define them as public.
When the rule is public, you can use its ID attribute as an anchor for references from other documents. For example, consider the following syntax:
<grammar src="../grammars/universals.grxml#YesNo"/>
When you invoke this grammar element, it refers directly to the public "YesNo" rule within the universals.grxml file, regardless of whether it's the file’s root rule.
Note
The root rule of a grammar file might be private. This rule can't be referenced independently. However, it's used by default as the entry point to the grammar when you invoke the grammar file itself.
Extract meaning and return results
Note
The SWI_meaning key should contain the information returned to the voice-enabled agent operating within Copilot Studio.
The SWI_meaning key contains the semantic meaning of a recognized phrase. You can set it only for the root rule. This key is included in the swirec_extra_nbest_keys list by default, so it appears in the XML result if your grammar sets this key.
SWI_meaning filters out redundant answers so that entries on the n-best list are truly distinct. Eliminating redundancy improves confidence scores and the usefulness of the n-best list.
When one recognized phrase is similar to another in the grammar, it often has a low confidence score, because the constrained speech recognition engine is unsure which phrase is correct. When you use SWI_meaning properly, the constrained speech recognition engine groups redundant interpretations into the same slot on the n-best list. In the following example, SWI_meaning is set to "direct calls home" whether the recognized phrase is "direct my calls home" or "please direct my calls home."
Without SWI_meaning, the grammar might produce the following n-best list:
| N | Text |
|---|---|
| 1 | direct my calls to my car phone |
| 2 | direct calls to my car |
| 3 | send calls home |
| 4 | please send my calls to the office |
| 5 | send my calls to the office |
| 6 | direct calls to my home |
When you use SWI_meaning, the constrained speech recognition engine arranges the n-best list by the meaning of the interpretation rather than the exact phrase spoken, so that entries on the n-best list are truly distinct:
| N | Text | Top-level SWI_meaning key |
|---|---|---|
| 1 | direct my calls to my car phone | direct calls car |
| direct calls to my car | direct calls car | |
| 2 | send calls home | direct calls home |
| direct calls to my home | direct calls home | |
| 3 | please send my calls to the office | direct calls work |
| send my calls to the office | direct calls work |
The constrained speech recognition engine sets SWI_meaning automatically, even if you don't explicitly set it in a script within the grammar.
If you don't explicitly define SWI_meaning on the root, it's constructed by concatenating all the keys defined in the root and their values. However, this construction doesn't apply to any keys beginning with SWI_, such as SWI_literal. The key/value pairs are first sorted alphabetically. The reasoning is that as far as the application is concerned, the set of keys returned is the sentence’s meaning.
If there are no keys, the results depend on whether you're using SISR or SWI semantics. With SISR, the SWI_meaning key isn't set if there are no keys. In contrast, with SWI semantics, SWI_meaning is set to the following:
{SWI_literal:<literal>}
If SWI_meaning is an object, it's converted to a string representation.
While the application can access SWI_meaning, it more often uses other key/value pairs defined specifically for it.
Host speech grammars via Azure Storage
Copilot Studio supports constrained speech recognition through speech grammars. However, it doesn't support directly authoring, testing, or hosting these grammars. For hosting grammars, use Microsoft Azure Storage to create a trusted and secure connection between the voice-enabled agent and grammar storage.
Set up an Azure storage account
Create an Azure storage account. Ensure that the subscription, resource group, region, and resource name of the new storage account follow your organization's policies. Use the following settings:
For Primary Service, select Azure Blob Storage or Azure Data Lake Storage Gen 2.
Select Premium for the Performance.
Learn more in Create an Azure storage account.
Set up the storage container
Use the Static Website as the storage container for your uploaded grammar files. The storage container provides the primary endpoint and secondary endpoint for the website.
After uploading your grammar file, select the file from the directory to view the properties and details of the file. Save the URL for the file, which should be in the following format:
https://{resourceName}.blob.core.windows.net/\$web/{grammarFileName}
Learn more in Create a container.
Authenticate the constrained speech recognition engine
For constrained speech recognition to work within a voice-enabled agent, the system needs to authenticate by using the storage account created in the previous step as a trusted location. This authentication requires the Storage Blob Data Reader role.
Learn more in Assign Azure roles using the Azure portal.
Sign in to Azure portal, open an Azure Cloud Shell session, and run the following command to create a constrained speech recognition engine service principal in your tenant.
az ad sp create --id e0e7bef0-777c-40ef-86aa-79d83ba643c7
Note
When you search for the service principal, you see it contains "NRaaS" in the name.
Employ constrained speech in Copilot Studio
Create an external entity
You can think of an entity within Copilot Studio as a unit of information that represents a certain type of real-world subject. For example, a phone number, postal code, city, or even a person's name. By using entities, an agent can recognize the relevant information from a user input and save it for later use. In this scenario, a constrained speech grammar performs the recognition.
Use external entities to reference speech grammars. To create an external entity, open your voice-enabled agent and go to Settings > Entities > Add an entity > Register an external entity.
Note
When constructing your entity, use either a global or a system variable, and not an environment variable. If you have to use an environment variable, create a global variable assign it the value of the environment variable. Then this global variable can be used in the grammar ULR for reference.
Enter the following information:
Name: Grammar URL in the form of
https://{resourceName}.blob.core.windows.net/\$web/{grammarFileName}?constrainedrequired=trueNote
The URL is case sensitive.
The default recognition mode is Speech Only. For alternative grammar configurations, see the following table:
| Type | Query Parameter | Example |
|---|---|---|
| Speech Only | None | https://{resourceName}.blob.core.windows.net/\$web/{grammarFileName}?constrainedrequired=true |
| DTMF | &mode=dtmf |
https://{resourceName}.blob.core.windows.net/\$web/{grammarFileName}?constrainedrequired=true&mode=dtmf |
| Speech or DTMF (Same Grammar File) |
&mode=speechdtmf |
https://{resourceName}.blob.core.windows.net/\$web/{grammarFileName}?constrainedrequired=true&mode=speechdtmf |
| Speech or DTMF (Different Grammar Files) | &mode=speechdtmf&dtmfgrammar={grammarURL} |
https://{resourceName}.blob.core.windows.net/\$web/{grammarFileName}?constrainedrequired=true&mode=speechftmf&dtmfgrammar=https://{resourceName}.blob.core.windows.net/\$web/{DTMFgrammarFileName} |
Note
URLs might also include direct variable names within Copilot Studio.
For example, suppose there are two identical sets of grammars, one for English and one for Spanish, and each is stored in a different subdirectory. The multilingual agent should be able to use the English grammars when conversing in English and similarly in Spanish. In this case, the grammar URL would be: {Env.BaseURL}/common/**{System.User.Language}**/{grammarFileName}?
Where System.User.Language is en_US or es_US and changes when the language switches in your agent.
Description: A simple description of the grammar, which the selector on the canvas references as the entity name. For example, "credit card number."
Data type: Choose the Record data type, and define the schema of the expected tags in response. For example, if the grammar returns
SWI_meaningand city, the Record schema looks like this:kind: Record properties: city: String SWI_meaning: String
After you select Save, the entity appears in the list. In the authoring canvas, go to a Question node. Just like with traditional entities, select the external entity (attached to a grammar) that you want the agent to recognize as a result of the user's response to the prompt.
Runtime behavior
When a voice-enabled agent runs and encounters the logic that uses an external grammar, the voice-enabled agent reaches out to the Azure storage account and retrieves the grammar for interpretation. The agent then matches what the user said against the constraint applied within the grammar. If a match succeeds, the system returns the response in the Record variable, according to the schema defined in the external entity.
Canvas logic
The result saved in the node’s variable is always a Record type as defined in the schema of the external entity. Authors can use this Record variable to access keys as defined in the schema, like variableName.SWI_meaning or variableName.city through dot notation.
Debugging
Error codes
| Error | Definition |
|---|---|
| 400 | Bad request |
| 401 | Unauthenticated |
| 403 | Forbidden |
| 404 | No Speech |
| 408 | No Input Timeout |
| 418 | Session Time out |
| 419 | No active resources – missing a grammar |
| 500 | Internal Error – report to Microsoft in a support ticket |
SWI_Literal
For many grammars that work with the constrained speech recognition engine, the SWI_Literal feature returns the literal statement the user uttered, not the interpreted result. Log this value as one of the outputs in Copilot Studio to help with debugging.
Known limitations
The solution has the following limitations:
- Max size of individual grammar file, which is currently limited to 100 MB.
- Variable passing isn't supported.
- The storage account must be located within the same tenant as the agent.
- Size of the URL can't exceed 500 characters.
- Only Azure storage account endpoints are allowed.
- Subgrammars can only be hosted in the same storage account (using same FPA for authorization).
- Secure XML parser (that is, DTD isn't allowed, and must validate against SRGS/SISR schema).
- Only NLSML output format is supported internally.
- The legacy
swirec_simple_result_keyparam has no effect and all tags are returned.
Legal
The Microsoft Dynamics service processes Constrained Speech Recognition systems. By using this experience, you agree to the Dynamics Terms.