Freigeben über


Recording Audio

  Microsoft Speech Technologies Homepage

There are several aspects of the recording process to keep in mind for successfully recording prompts. The actor needs good direction. Some attention should be given to numbers and other recorded elements that will be concatenated in order to make them sound natural. After the recording session or sessions, it is necessary to edit the audio files, and there are a number of things that will make editing more efficient. There are considerations for post-production and for deployment of the audio files.

Directing the Actor

The first session with the voice actor should help the actor to get into character. Is the character an assistant, a sales representative, an agent, a navigator? The back story is excellent for this purpose. The actor needs to know the back story, plus any other insights and ideas about the character.

The actor needs a sense of place. Is he or she in a call center somewhere or in an office adjacent to their employer's, with access to all of the company's information? Or is he or she in the passenger seat of a car, reading maps for the user?

Some voices are more difficult to imagine. We all have a clear idea of what a receptionist might sound like, depending on the corporate identity and style of the company and the character's description. But what does something less obvious sound like: the voice of a microwave, a bank machine, or a car? Designers will break more and more creative ground as they start to give voices to appliances, automated parking attendants and machines such as ATMs.

It is important for the actor to warm up with some top-level conversational prompts. It is helpful to work through some conversational statements before the more exacting work of recording the error handling prompts and concatenated speech. It may be necessary to try different microphones to find the one best suited to the actor's voice. Sometimes a less expensive microphone can fit with an actor's voice in a way that just works. As a rule, it is a good idea to record the voice with minimal processing, or flat in recording parlance. Any post-processing, tone-equalizing, down-sampling or other techniques will be applied during the post-production phase. When the actor is warmed up and in character and any audio issues have been straightened out, the actor is ready to record.

There are some key elements to monitor carefully during recording, so that the prompts will match and make sense, and make using the system a pleasant and productive experience. These elements include pronunciation, amplitude, emotional intensity and pacing.

Recording Numbers and Concatenated Speech

All systems benefit from the use of prerecorded speech. Systems, however, often must read variable data that is unpredictable, and in those cases, must rely on text-to-speech (TTS). Recent advances in TTS technology have greatly improved the quality of TTS, but it is not yet and may never be equal to the expressiveness of the human voice.

Speech systems that provide voice access to databases of information often rely on concatenation of prerecorded speech. Carefully planned libraries of prerecorded speech fragments can handle many types of variable data.

Numbers are a good example. Consider a simple phone number: 212-555-1203. Using the most basic scheme, any phone number can be delivered with ten prerecorded digits. 0,1, 2, 3, 4, 5, 6, 7, 8 and 9. However, a phone number recorded in this manner sounds robotic and inhuman. One reason is that, as humans, we infuse our spoken language with prosody, or expressive variations in intonation that add predictability and meaning to our phrases.

Recording an additional ten digits spoken with a downward inflection, can greatly improve the concatenation. Here's the twenty digit scheme for phone numbers:

  • rising intonation: 0,1,2,3,4,5,6,7,8,9
  • falling intonation: 0,1,2,3,4,5,6,7,8,9

So the telephone number in the previous example would be concatenated as:

  • (rising) 2 1 (falling) 2
  • (rising) 5 5 (falling) 5
  • (rising) 1 2 0 (falling) 3

With the addition of digit pairs designed for mid- and end-phrase use, the prosody of the concatenation is improved. Another feature of continuous human speech is the tendency to run words together. Speakers merge discrete word boundaries into slurred phrases so that "six seven eight" may sound more like "sick seva nate." The only way to simulate this slurring with concatenated speech is through brute force. That is, to record every example of slurred speech that may be needed by the system.

By recording 1000 three-digit combinations (000 through 999) and the numbers 00 through 99 with both rising and falling intonations, a system can reproduce any U.S. phone number. This method sounds almost completely natural when concatenating four files together.

[212] [555] [rising 12] [falling 03]

It may be useful when recording mid- and end-phrase intonations to use a carrier phrase. This is a spoken word preceding or following the phrase that helps the actor use the correct intonation. Often the name "Patrick" is used because it begins and ends with hard consonants that cannot be slurred. The carrier phrase is then removed from prompt after recording.

When recording files for concatenation, it is best to make test edits with a representative sample of the intonation to ensure that the pacing and pitch will splice together well and sound as natural as possible.

The Recording Script

This document has individual prompts organized by dialogue, and is often derived from the dialogue specification. The script should be printed in text large enough for the voice actor to read easily. The recording script is also used as a map or take sheet to the actual recordings.

During the session, the script is usually recorded in order. By working down the script, the files are generated sequentially. It saves a step, if the filenames and takes are annotated as the recording proceeds.

There are a number of software packages well suited to prompt recording, including Speech Prompt Editor in the Speech Application SDK (SASDK). Look for sound quality and the ability to manage large numbers of files. A soundproofed room is also helpful.

File Map

After the recording completes, the filenames of each take must be noted on the recording script. Often, there are multiple takes of a prompt. By keeping careful notes of the recordings, the audio producer or editor can make choices based on the best takes or combination of takes. Eventually, this file map can be used to create the import script that loads the text of the recordings as well as the extractions and correct audio files into the prompt database, in the SASDK.

See Also

Audio Production