SpVoice (SAPI 5.3)

Microsoft Speech API 5.3




An application creates the SpVoice object and uses the ISpVoice interface to submit and control speech synthesis.  Applications can speak text strings, text files, and audio files.  Although this object is named the "SpVoice," it is actually a much higher-level object than a single voice.  Conceptually, it is an object which accepts input data streams that are then rendered to the specified output, potentially using multiple speech synthesis voices in the process.  Each SpVoice instance contains its own queue of input streams (usually just text) and its own output stream (usually an audio device).  When an application calls ISpVoice::Speak, another item is added to the end of the SpVoice queue. 

Basic Synthesis

The main speech synthesis method is ISpVoice::Speak.  Almost everything having to do with controlling synthesis (for example, rate, pitch, and volume) is performed by this single function.  This function can speak plain text, or the application can mark up the text using synthesis markup tags.   The speak method enables the application to specify whether the call should be synchronous or asynchronous.  If the call is synchronous, the Speak method will not return until all of the text has been rendered.  Speak returns immediately for asynchronous Speak calls, and the text is rendered on a background thread.

ISpVoice::SpeakStream is similar to the Speak method, but by using SpeakStream, streams of text or audio data can be added to the rendering queue.

Overriding Defaults

SAPI will automatically use the default voice and default audio output device if the application does not specify otherwise.  The output can be controlled by the application through ISpVoice::SetOutput.  The default voice can be overridden in one of two ways:  The application can call ISpVoice::SetVoice or it could speak a <VOICE> synthesis markup tag.

Audio Device Sharing

When an SpVoice object is rendering to an audio device (as opposed to a stream), it will attempt to cooperate with other SpVoice objects that are sharing the same device based on the priority of the SpVoice.  By default, a voice is set to SPVPRI_NORMAL which means that it will wait until other voices in the system have completed before it will begin rendering its input queue.  A voice set to SPVPRI_ALERT will interrupt a normal priority voice by stopping the normal voice, rendering its own queue, and then restarting the normal priority voice.  An SpVoice with a priority of SPVPRI_OVER will simply render its data immediately even if another voice is currently speaking (they would both speak at the same time).

Applications can control the priority of a voice by calling ISpVoice::SetPriority.

Rendering to Streams

The SpVoice can render data to any object that implements ISpStreamFormat, which is a simple derivative of the COM standard IStream.  The SpStream object is provided to allow easy conversion of existing IStreams to support ISpStreamFormat or to read or write wav or other files.  Applications can ISpVoice::SetOutput to force the SpVoice to render to a stream.  When rendering to a stream, the voice will render the data as quickly as possible.

Synthesis Events

The SpVoice implements the ISpEventSource interface. It forwards events back to the application when the corresponding audio data has been rendered to the output device.  Examples of events are reaching a word boundary, speaking a phoneme, reaching a bookmark, etc.  Some applications can simply be notified when events occur and then call ISpVoice::GetStatus to determine the current stat of the SpVoice object.  More complex applications may need to queue events.  See the documentation for ISpEventSource for information on setting interest in events.

How Created

Create the SpVoice object by calling ::CoCreateInstance with CLSID_SpVoice.