Glossary
A | B | C | D | E | F | G | I | L | M
N | O | P | R | S | T | U | V | W
A
ACM (Audio Compression Module)
Code typically used by an engine that converts PCM data to a different format.
active voice menu
A set of voice commands that can be recognized.
archiving
Storing copies of programs and data to ensure against loss.
asleep state
The state in which an application listens to each sound, but responds only to commands on the sleep menu. See also awake state.
audio destination
A device such as an audio speaker or the telephone over which text is played as speech. An audio-destination object is an OLE COM object that supports audio communication interfaces in common with a text-to-speech engine.
audio signal
An electrical signal with varying voltage that becomes sound when amplified and converted to vibrations played by an audio speaker.
audio source
A device such as a microphone or telephone that provides audio data for speech recognition. An audio-source object is an OLE COM object that supports audio communication interfaces in common with a speech recognition engine.
awake state
The state in which an application recognizes and executes commands on active voice menus. See also asleep state.
B
bookmark
A marker embedded in an audio recording that can be used to locate and play back an audio segment.
C
complete-phrase value
The number of milliseconds that the engine waits before regarding a phrase as complete after the user has stopped speaking.
component object
An object defined according to the OLE Component Object Model (COM). A component object has a set of interfaces that communicate with the object, data associated with an instance of the object at run-time, and the ability to support multiple instances of the object running at the same time.
COM (Component Object Model)
See OLE Component Object Model.
context-free grammar
Uses rules that predict the words that might follow the word just spoken, reducing the number of candidates that need to be evaluated to recognize the next word.
continuous speech
A continuous utterance without pauses between words. Some speech recognition engines can recognize continuous speech.
D
degradation
A reduction in quality or performance of a communications channel.
deterioration
The gradual loss of data stored by a speech recognition results object. The information in a results object can occupy a significant amount of memory, so an engine developer may permit the object to discard data automatically as time passes.
dictation grammar
Defines a context for the speaker by identifying the subject of the dictation, the expected style of language, and what dictation has already been done.
digital-audio format
Audio format controlled by binary or numeric data.
digital-audio stream
Continuous audio data received from or sent to an audio device.
Digital Signal Processor (DSP)
A general-purpose multiprocessor tailored to a particular type of operation. Applications involving communications, compression and audio are more efficiently performed on a DSP than on the host computer.
diphone
A sound consisting of two phonemes: one that leads into the sound and one that finishes the sound. For example, the word "hello" consists of these diphones: silence-h h-eh eh-l l-oe oe-silence.
diphone concatenation
The text-to-speech engine concatenates short digital-audio segments and performs intersegment smoothing to produce a continuous sound.
discrete speech
Every word must be isolated by a pause before and after the word-usually about a quarter of a second-in order for the engine to recognize it.
DTMF (Dual Tone Multi-Frequency)
Touch-tone or push-button dialing. Pushing a button on a telephone keypad generates a sound that is a combination of two tones, one high frequency and the other low frequency.
E
echo canceling
A method of controlling echoing on communication lines, in which the sender checks the inbound channel for a slightly delayed duplicate of its own transmission. In echo canceling, the sender adds an appropriately modified, reversed version of its transmission to the path on which it receives information. The result is to erase the echo electronically but leave incoming data intact.
energy floor
See noise floor.
engine
A program that does the actual work of recognizing speech or translating text into speech. Most speech recognition engines convert incoming audio data to engine-specific phonemes, which are then translated into text for use by an application. A text-to-speech engine performs the same process, only in reverse. An engine object is an OLE COM object that represents a mode of a speech recognition or text-to-speech engine.
engine enumerator
Enumerates the speech recognition or text-to-speech modes supported by a particular engine.
engine-specific phoneme character set
A character set that describes phonemes, pauses, and so on, and that is specific to a text-to-speech engine.
F
frequency
The rate of vibration or oscillation, measured in hertz (Hz). The normal human ear can detect sounds ranging from 20 Hz to 20,000 Hz.
G
gain
The increase in signaling power, measured in decibels (dB), that occurs as the signal is boosted by an electronic device.
global voice menu
A voice menu that is active all of the time regardless of which window is in the foreground.
grammar
A set of words and phrases that can be recognized by an engine. A grammar object is an OLE COM object that an application uses to control how an engine uses the grammar to recognize speech.
GUID
Globally unique identifier used by an interface or object for identification.
I
incomplete-phrase value
The number of milliseconds that the speech recognition engine waits before discarding an incomplete phrase because the user has stopped speaking.
interface
A set of semantically related functions that an application can call to perform the actions defined for that interface.
interference
Noise or other external signals that affect the performance of a communications channel; also, the electromagnetic signals generated by electronic devices, such as computers, that can disturb radio or television reception.
IPA (International Phonetic Alphabet)
A standard system for indicating specific sounds, first introduced in 1886. The Unicode character set includes all single symbols and diacritics in the most recent revision of the IPA, which occurred in 1989, as well as a few IPA symbols no longer in use.
L
lexicon
See pronunciation lexicon.
limited-domain grammar
Provides a set of words to recognize without using strict syntax structures. A limited-domain grammar is a hybrid between a context-free grammar and a dictation grammar.
localization
Adaptation of a software package from English to the needs of a foreign country.
M
marshaling
If an instance uses a separate process space from that of the application that invokes it, its data must be marshaled across the process boundary. Each interface contains marshaling code that allows its parameters to be transmitted across process boundaries.
matching techniques
The methods by which the engine matches a detected word to known words in its vocabulary.
N
node
A word or phoneme on a recognition path in a recognition/alternative graph generated by an engine.
noise
Any interference that affects the operation of a device. In communications, noise consists of random electronic signals, produced either naturally or by the circuitry, that degrade the quality or performance of a communications channel.
noise floor
The noise value in the signal-to-noise (SNR) ratio for an environment. In general, the higher the noise floor, the more sensitive the engine will be to background noise.
notification sink
Similar to a callback function, except the sink is implemented as an interface with a set of functions rather than as a single function.
O
OLE Component Object Model (COM)
A specification that defines a binary standard for OLE object implementation independent of programming language.
P
PCM (pulse code modulation)
The most common method of encoding an analog voice signal into a digital bit stream. First, the amplitude of the voice conversation is sampled. Then, the sample is coded into binary data, which can then be switched, transmitted, and stored digitally.
perplexity
The number of choices at a given node in a recognition path.
phoneme
The smallest structural unit of sound in any language that can be used to distinguish one word from another.
phrase
An ordered list of words that are spoken in the same utterance.
pitch
The tone of a sound, which generally is determined by the sound's frequency. A high-pitched sound has a higher frequency; a low-pitched sound has a lower frequency.
pronunciation lexicon
A database of pronunciations maintained by a speech recognition or text-to-speech engine. An engine may allow an application to collect new or corrected pronunciations from the end-user.
pronunciation rule
A rule followed by a text-to-speech engine to convert text into phonemes.
prosody
The inflection, timing and accent of speech.
R
recognition mode
Each speech recognition engine supports one or more recognition modes that conform to a different code set or data set. For example, each language (or dialect) supported by the engine will have a different mode.
recognition path
A sequence of words or phonemes that an engine analyzed while attempting to recognize an utterance.
recognition rule
A rule followed by a speech recognition engine using a context-free grammar to recognize speech.
recognition/alternative graph
A graph generated by a speech recognition engine that depicts the recognition paths explored by the engine in recognizing an utterance.
recursion
The number of levels of rules in a context-free grammar.
registry
The database in which configuration information is stored. The database takes the place of most configuration and initialization files for Microsoft® Windows® and new Windows-based programs.
results object
See speech recognition results object.
rules
See pronunciation rule, and recognition rule.
S
SAPI
Microsoft Speech application programming interface. A set of routines, protocols, and tools that enable programmers to build speech-enabled applications for Microsoft Windows platforms.
SNR (signal-to-noise ratio)
The amount of power, measured in decibels (dB), by which a signal exceeds the amount of channel noise at the same point of transmission. It provides an indication of the clarity or accuracy with which communication can take place.
speaker
The end-user who utters the speech to be recognized by an application. Training performed by a speaker may be stored in a speaker profile.
speaker-adaptive
The engine trains itself to recognize the user's voice while the user performs ordinary tasks.
speaker-dependent
The engine requires the user to train it to recognize his or her voice.
speaker-independent
The engine does not require training. Speaker-independent engines typically start with an accuracy above 95 percent for most users (those who speak without accents).
speaker profile
All of the information the engine has about the speaker, such as a data header, languages for which training has been done, known patterns of speech and the language model, how specific words are pronounced, phonetic training, speaker ID, and speaker preferences.
speech recognition
The ability of a computer to understand the spoken word for the purpose of receiving command and data input from the speaker.
speech-recognition engine
An OLE Component Object Model dynamic-link library (DLL) or executable file (.exe) that performs recognition from a digital-audio stream. Speech recognition engines are supplied by vendors who specialize in the software.
speech-recognition enumerator
Enumerates the engines that are available to an application.
speech-recognition mode
An engine typically provides an assortment of modes that can be used to recognize speech in different languages, dialects, and audio-sampling rates.
speech-recognition results object
Provides detailed information about a speech recognition event.
speech-recognition sharing object
Enumerates shared engine-audio source pairs, or creates new ones.
subword matching
The engine looks for subwords—usually phonemes—and then performs further pattern recognition on those.
synthesis
The text-to-speech engine synthesizes the glottal pulse from human vocal cords and applies various filters to simulate throat length, mouth cavity, lip shape and tongue position.
T
tags
See text-to-speech control tags.
TAPI
Microsoft Telephony application programming interface. A set of routines, protocols, and tools that enable programmers to build telephony applications for Microsoft Windows platforms.
Telephony
Refers to computer hardware and software that performs functions traditionally performed by telephone equipment (like voice mail or fax services).
text-to-speech
Technologies for converting textual (ASCII) information into synthetic speech output. Used in voice-processing applications requiring production of broad, unrelated, and unpredictable vocabularies, such as products in a catalog or names and addresses. This technology is appropriate when system design constraints prevent the more efficient use of speech concatenation alone.
text-to-speech control tags
Instructions that can be embedded in text sent to a text-to-speech engine to improve the prosody of the spoken text.
text-to-speech engine
An OLE Component Object Model dynamic-link library (DLL) or executable file (.exe) that provides functionality for converting text to digital-audio speech. Text-to-speech engines are supplied by vendors who specialize in the software.
text-to-speech enumerator
Enumerates the text-to-speech modes provided by all of the engines available to the application.
text-to-speech mode
Analogous to voice quality or personality. Every text-to-speech mode is different, and each allows for different properties such as timbre, accent, language and digital-audio sampling rate.
threshold
The point below which an utterance is rejected as unrecognized.
training
The process of speaking a series of pre-selected phrases for the engine. This provides the engine with more information about the voice of the speaker and can improve speech recognition.
U
Unicode
A 16-bit character set that replaces ASCII and allows any character from any language to be represented in a text string. The Unicode character set contains a subset for International Phonetic Alphabet (IPA) phonemes.
utterance
Anything heard by the engine as a finite series of sounds that the engine attempts to recognize as speech.
V
vocabulary
A set of words used in a grammar. A speech recognition engine typically supports several different sizes of vocabulary, which determine the words that the engine can recognize in a given state.
voice command
A word or phrase associated with a voice menu. When an engine recognizes a voice command, it notifies the application that owns the voice menu containing the command.
Voice Command site
A speech recognition mode and audio source that together serve as a source of Voice Command input.
voice menu
A list of voice commands to which an application can respond. A voice menu must be active before an engine can recognize its commands.
voice-text site
A text-to-speech mode and an audio destination that together serve as a destination for Voice Text output.
VU (Volume Units) Meter
An indicator that displays the volume of sound being received by the microphone or through the line-in port. Optimum reception is achieved when the meter registers in the middle area.
W
whole-word matching
The engine compares the incoming digital-audio signal against a prerecorded template of the word.
word
An atomic Unicode text string. A "word" can have several related vernacular words (such as "Los Angeles") within it because the vernacular words are always used in common.
word separation
The degree of isolation between words required for the engine to recognize a word.
word spotting
A series of words may be spoken in a continuous utterance, but the engine recognizes only one word or phrase.