Speech Prompt Editor Tips

Article
07/05/2006

This topic contains tips to consider when using the Speech Prompt Editor. For more tips, see issues of the Microsoft Speech Technologies Newsletter on the Microsoft Speech web site.

Validating Prompt Coverage for a Speech Application Before Recording Audio

Open the project properties dialog box by right-clicking on the prompt project in Solution Explorer and clicking Properties. In the properties dialog box, select the Use filler for missing waveforms check box. When you validate prompt coverage for your solution, the prompt editor will insert a few samples of silence for every missing waveform, allowing the project to build (a necessary step for validation).

Making Recordings Made at Different Volumes Sound Natural

Use volume normalization to make recordings sound more natural when they are made at different volumes and then concatenated. Open the project properties dialog box by right-clicking on the prompt project in Solution Explorer and clicking Properties. The options available are to either maximize the peak for each wave, or to adjust the average energy in each wave to match a specific recording.

To match a recording, select Match to promptdb recording, browse to a .promptdb file, and enter the name of a .wav audio file within the prompt file. Note that .wav file names are visible in the Speech Prompt Editor. The next time the project is built, the volume of the audio in the output .prompts file will be normalized based on the .wav file selected.

Forming Alignments with Imported .wav Files

When 8KHz .wav files are imported, alignment may fail. This is because alignment in Speech Prompt Editor is designed to work with higher quality recordings: 22.05kHz or 44.1kHz, not 8kHz. However, the English Telephony engine can align to 8kHz .wav files. Accordingly, a workaround is available if Speech Server is installed on the same computer as Speech Application SDK. The workaround is described in the following procedure.

Click Start and Control Panel.
In Control Panel, double-click Speech.
On the Speech Properties dialog box, in the Language panel, select the English Telephony engine as the default engine.
Click OK.

Using Spectrogram View for Smooth Prompt Concatenation

When a waveform audio file is imported, the speech recognizer automatically aligns the text to the audio data. Generally, the alignments are quite accurate, but sometimes they need to be tuned. The spectrogram view shows concentrations of energy at different frequencies, so it may be easier to find word boundaries when looking at the spectrographic view. To view a spectrogram for a recording, from the Speech Prompt Editor, open a .wav file. On the Wave menu, click Show Spectrum. Look for the edges of the black areas.

Adding Prompts to a Multimodal Application

Multimodal applications often do not use prompts. QA controls are designed to be used for spoken dialog (voice-only interaction), when the prompting strategy is closely tied to a dialog flow. For multimodal browsers, QA controls degenerate to the equivalent of a SALT <listen> tag, enabling the tap-and-talk authoring paradigm.

To enable spoken dialog in a multimodal application, add speech synthesis (or waveform concatenation) capabilities to the application by using the SALT <prompt> tag directly.

Imagine a simple multimodal application that uses a text box to capture a destination city, as shown in the following code snippet.

<HTML>
  <body>
    <form id=“Form1” method=“post” runat=“server”>
      <asp:Label id=“Label1” runat=“server”>
        Destination city:</asp:Label>
      <asp:TextBox id=“DestinationTextBox”
        runat=“server”></asp:TextBox>
    </form>
  </body>
</HTML>

In order to speech-enable the text box, add a SemanticMap control and a QA control, and set the appropriate events to use tap-and-talk. For illustration purposes, this QA control has an inline grammar that accepts “New York,“ “Los Angeles,” and “Seattle” only. The following code sample shows what the HTML looks like after adding these controls using the visual designer. Note that this code is generated automatically.

<%@ Register TagPrefix=“speech”
    Namespace=“Microsoft.Web.UI.SpeechControls” 
    Assembly=“Microsoft.Web.UI.SpeechControls, Version=1.1.3200.0, 
    Culture=neutral, PublicToken=31bf3856ad364e35” %>
<HTML>
  <body>
    <form id=“Form1” method=“post” runat=“server”>
      <asp:Label id=“Label1” runat=“server”>
        Destination city:</asp:Label>
      <asp:TextBox id=“DestinationTextBox” runat=“server”>
        </asp:TextBox>
      <speech:SemanticMap id=“Sm1” runat=“server”>
          <speech:SemanticItem ID=“Destination”
                TargetElement=“DestinationTextBox”
                TargetAttribute=“value”>
          </speech:SemanticItem>
      </speech:SemanticMap>


      <speech:QA id=“QA1” runat=“server”>
        <Prompt InlinePrompt=“Where are you flying to?”>
           </Prompt>
        <Reco StartElement=“DestinationTextBox” 
          StartEvent=“onclick”>
          <Grammars>
            <speech:Grammar id=“CityGrammar”>
              <grammar lang=&qout;en-US&qout; 
                tag-format=&qout;semantics-ms/1.0&qout; 
                version=&qout;1.0&qout; mode=&qout;voice&qout; 
                root=&qout;Rule1&qout; 
                xmlns=&qout;http://www.w3.org/2001/06/grammar&qout;>
                <rule id=&qout;Rule1&qout;>
                  <one-of>
                    <item>New York</item>
                    <item>Los Angeles</item>
                    <item>Seattle</item>
                  </one-of>
                </rule>
              </grammar>
            </speech:Grammar>
          </Grammars>
        </Reco>
        <Answers>
        <speech:Answer SemanticItem=“Destination” 
            XpathTrigger=“/SML”></speech:Answer>
        </Answers>
      </speech:QA>

    </form>
  </body>
</HTML>

At this point, the basic interaction is ready. However, end-users might have difficulty understanding how to interact with this application. One solution is to add a help button that will trigger a prompt that provided further instructions, as show in the following code snippet.

<input type=“button” value=“Help” onclick=“HelpPrompt.start()”>
      <salt:prompt id=“HelpPrompt”>Click on the textbox 
        before speaking.</salt:prompt>

Managing Frequently Changed Prompts

In cases where prompts need to be changed frequently, it is not necessary to recompile and redeploy a .prompts file following every change. Since a .NET speech application can use more than one .prompts file simultaneously, it is possible to put the prompts that change frequently in separate, smaller databases. When a prompt changes, recompile and redeploy just the one file containing that prompt. Use the Manage Prompt Databases... link in the Properties window, or the Manage this application's prompt databases on the Voice Output pane in the property builder to add additional prompt databases.

Alternatively, use the SALT content tag to set a frequently changed prompt to reference a single wave file, which can be updated as necessary. To do this, enter <salt:content href='/MyWav.wav' \> as the text of the inline prompt, where MyWav.wav references the audio file.

Overriding a Single Application Control Prompt

Application controls contain built-in prompts, which makes them immediately usable with little or no modification. However, sometimes changing built-in prompts can improve the dialogue. This is easily done by writing a prompt select function for the application control. The prompt returned by the application control overrides the default prompt. However, authors want to change one of the built-in prompts only. This can be achieved by returning 'null' from the prompt select function instead of a prompt. The application control will then play the built-in prompt. In the following code excerpt - part of the prompt select function - the confirmation prompt is changed but the acknowledge prompt is not.

function NaturalNumber1_prompt_confirm (lastCommandOrException, count)
    {
          return “Do you want to book a table for” + GuestNumber.value + “?”;
    }
     
    function NaturalNumber1_prompt_acknowledge(lastCommandOrException, count)
    {
          return null;
    }

Achieving Natural-Sounding Output with Telephone Numbers

Authors want their applications to speak telephone numbers in a recorded voice. However, just recording the digits zero through nine and using them to construct phone numbers results in output that sounds strange and artificial. Clearly it is not reasonable to record all possible telephone numbers and then pick a whole recording each time. However, it is possible to produce much more natural output by recording the digits in each of the contexts that they will appear in the final phone number.

For example, take the phone number 203 535 3245. Each of the examples of the digit three sounds very different because of their position in the phone number. People normally read phone numbers with a rising intonation for the first block, a relatively flat intonation during the second and a falling intonation during the final block. The key to getting natural-sounding output is to capture each of the digits in every position in the phone number, tag these contextually different recordings appropriately, and then use them to construct the final number. Thus, in order to make a reasonable capture of all of the positional contexts, only twelve recordings are needed. An example of the required recordings is shown in the following list.

321 230 1234
132 302 4123 
213 023 3412 
654 879 2341 
465 798 5678 
546 987 8567 
987 546 7856 
798 465 6785 
879 654 9012 
023 213 2901 
302 132 1290 
230 321 0129

Next, identify each of the digits with a tag that can be easily constructed programmatically in the prompt functions. For example, the tag b3_4_9 refers to the digit nine in position four of the third block of numbers. The output of the prompt function is a string of the form <WITHTAG TAG=b1_1_2> 2 </WITHTAG> <WITHTAG TAG=b1_2_0> 0 </WITHTAG> and so on.

Note This method only captures the positional context of the digits, not the coarticulatory affects of different bordering digits. For example, in natural speech the two in 206 and in 216 sounds slightly different because of the bordering digits.

Monitoring the Number of Times an Exception is Called

Use the ActiveQA.History property to return a history of commands and exceptions. ActiveQA.History accesses an array that stores command types and recognition exceptions. The Command/Exception History is maintained by Runspeech and behaves as a stack, inserting the most recent Command or exception into the last position in the array. It is available only on the active QA control. Once a different QA is activated, the array is cleared.

For example, use the following statement to access the most recent Command or exception.

lastCommand = RunSpeech.ActiveQA.History[RunSpeech.ActiveQA.History.length - 1];

Playing an Audio File in a Prompt

Use the SALT content element within prompt functions to play a .wav file. The content element is primarily used to refer to external content like .wav or SSML (speech synthesis markup language) files, but it can also hold optional inline or XML text that renders if there is a problem with the external content.

To use the content element in the Prompt Function Editor, pass in the .wav file path as a parameter and build the content element around it. For example, the following statement will speak the prompt "Your recording follows" and then play the .wav file referenced by the file path parameter strPlayBackURL. If the .wav file is not available, the alternate prompt "I'm sorry, the recording wasn't found" plays.

return ("Your recording follows.<salt:content href=\"" + strPlayBackURL + "\">I'm sorry, the recording wasn't found.</salt:content>");

Handling Speech Synthesis Markup Language (SSML) Differences Between Client and Server

The prompt engine used in the development environment is different than the prompt engine deployed to Microsoft Speech Server. In some cases, especially when text normalization is used, prompts will sound different in the development environment versus the production environment. To discover these differences, test prompts thoroughly in the production environment. If differences are found, try rewriting the prompts to minimize any problems.

Understanding Prompt Engine Database Searches

To find extractions to combine into a prompt, the prompt engine searches the extractions and transcriptions contained in the prompt database. The following text lists the rules for matching prompts with transcriptions and extractions.

The prompt engine normalizes the text for white space, case, and punctuation in the same manner as that used when loading the databases.
The maximum length of a search string is 1000 characters, including spaces, control characters, and line breaks (line breaks are treated as a single character).

Note To cause the prompt engine to speak a longer string, use the peml:id Element to specify prompts that compose the longer string.
XML elements that the prompt engine recognizes delimit segment boundaries. The prompt engine performs a separate search for each segment.
If an entry for a given segment does not exist in the databases that are loaded into memory, or if the audio file associated with the segment is not a valid .wav file, the prompt engine creates a "fallback" TTS engine that synthesizes the output.
.Wav files containing silence can be used to produce silent output using the prompt engine.

Share via