More on TTS navigation

Whew! What a week! Finally got back to my blog. :-)

My answer to Sushant’s question is pretty long, so first I want to lay out a cheaper alternative that will be just fine for many apps.

If the specific features Sushant described aren’t important to you, and you’re more interested in just getting a general TTS navigation experience implemented quickly, then SAPI has a few functions that enable you to easily play & pause TTS, as well as skip forward/backward by sentence.

 

  • To start reading the whole document from the beginning:
    • Call SPVoice::Speak.
    • Pass in the text you want synthesized.
    • It may already be speaking something, so you want to make sure this is stopped by passing in the SVSFPurgeBeforeSpeak flag.
  • To pause
    • Simply call SPVoice::Pause
  • To resume
    • Call SPVoice::Resume
  • Skip forward or backward by sentence
    • SPVoice::Skip
    • You specify how many sentences to skip. A positive number skips forward. A negative number skips backward.
  • Stop
    • Call SPVoice::Speak with the SVSFPurgeBeforeSpeak flag and a null string for the text.

This is pretty much all you need. There are very few lines of code to write and debug. You may decide to get fancy make resume and skip default to play when there’s nothing to resume or skip to. SPVoice::Status.RunningState should provide enough info for you to implement the right logic. You may also want to control rate and volume, and there are function calls for each of these too.

 

Now, on to Sushant’s challenge…

 

First, I’d like to clarify and make some assumptions:

  1. Characters are difficult to use as the smallest unit of utterance (how do you pronounce the “s” in “fish”?) Phonemes are more suitable.
  2. “Line” is defined totally by the application. For this discussion I’ll take the more general approach of using sentences.
  3. How is “stop” different from “pause”? Let’s say that if you call stop, then the next time you call play or resume, synthesis starts from the beginning of the text.

Let me know if these assumptions don’t match what you had in mind.

 

What makes this challenging is that while SAPI allows you to easily skip to the beginning of sentences, the problem requires the ability to skip to the beginning of words or phonemes, and SAPI has no particular functions to do this.

 

One approach

 

One approach is to forget about phonemes and only worry about words and sentences. For English, you can come up with a reasonable word and sentence breaking algorithm based on searching for white-space and punctuation (there will be some glitches in this approach, but only in uncommon cases and nothing drastic). Word and sentence breaks are very difficult to do for a lot of written languages, so this approach isn’t generic. But for English it’s probably fine.

 

If you’re in sentence mode, just call SPVoice::Speak with the particular sentence. When it’s done, the SPVoice::EndStream event will fire, and you can just Speak the next sentence. Skipping is easy – just Speak the appropriate sentence, using the SVSFPurgeBeforeSpeak flag. The same approach works for word mode: just call SPVoice::Speak one word at a time.

 

An alternative

 

An alternative is to work with the full string, but just pass in the tail of the text to SPVoice::Speak, starting from the point you want to skip to. For example, if you want to skip to “over” in “the quick brown fox jumped over the lazy dog”, call SPVoice::Speak with “over the lazy dog”. Of course, you’ll want to keep a cursor into the complete string so you can figure out how to jump backwards.

 

If you take this approach, another thing you’ll need to do is keep track of every sentence, word & phoneme that’s spoken. Although SAPI can’t skip to specific words & phonemes, it will tell you when they’ve been spoken using the SPVoice::Sentence, ::Word and ::Phoneme events. The great thing about these is that they include the index of the corresponding first character in the string that was passed to SPVoice::Speak. So if you trap these events, you’ll be able to figure out exactly where the synthesizer is up to in the master string, and be able to use this as the base index for skipping forward or backward. However, you wouldn’t be able to skip any further forward than has already been rendered.