February 2011

Volume 26 Number 02

UI Frontiers - Sound Recording in Windows Phone 7

By Charles Petzold | February 2011

image: Charles Petzold In one of the very first print advertisements for the introduction of the Macintosh in 1984, Apple touted the design of its mouse with an exceptionally compelling observation: “Some mice have two buttons. Macintosh has one. So it’s extremely difficult to push the wrong button.”

This isn’t entirely true, of course. Overloading a single button with multiple functions can be just as confusing as multiple buttons. But the impossibility of pushing the wrong button is certainly a persuasive argument for simplicity in UI design.

Stripping down the UI to essentials is even more important when programming for a smartphone. Phones aren’t very large. They simply can’t have a lot of buttons, and the fingers that push these buttons are not as accurate as a mouse. Too many buttons means that it’s easier than ever to push the wrong one.

On the minus side, limiting a UI often limits the functionality of the program, so deciding where to draw the line can be a real struggle. Life is full of compromises.

Design Evolution

I thought it would be fun to write a Windows Phone 7 program that allows recording short vocal memos, such as “Remember to pick up the dry cleaning” and “Had a great idea for a movie: Boy meets girl.”

Such a program is useful, of course, and provides yet another excuse to show off our new Windows Phones by using them in public places. More importantly for myself, I thought it would be a great opportunity to get some hands-on experience using the sound recording and playback classes supported by the phone.

However, the program design turned out to be more problematic than I had anticipated. Even before I’d written a single line of code, the program went through several iterations of design and redesign in my head.

At first, I thought it would be fine to have just two buttons labeled Record and Play, both of which functioned as toggles. Press the Record button to start recording and press it again to stop. The program saves the audio data in isolated storage. Press the Play button to play it back. Each press of the Record button replaces the previous memo so the program doesn’t need a Delete button.

I even toyed around with reducing the program to just a Play button by implementing a voice-activation feature! The program would record continuously and only save the data when it contained some sounds. But this seemed like a devilishly difficult job in differentiating background sounds from real voice data without introducing some kind of manual threshold setting. I abandoned the single-button design.

My original plan was fine for one memo, but not for multiple memos. I then thought that the program would maintain a single audio file and tack each new memo on the end of the previous memos. Because it’s all just one big file, the Play button would play back all the memos in 
sequence. Of course, the program can’t let this file grow indefinitely, so this design definitely needs a Delete button that wipes out the entire file and, consequently, all memos.

No, this wasn’t good. I really needed to maintain separate files for each memo and allow these memos to be deleted individually. But that implied some way to present all the separate files to the user for playback and deletion, and all of a sudden the program got much more complex. I definitely needed a ListBox, and some way to identify each memo to the user, perhaps with user-supplied keywords or—horrors upon horrors—an actual file name.

No, no, no, not that! I glanced over at my telephone answering machine. Each call or memo is recorded separately, but they’re numbered on a simple display. The Play button is complemented with Previous and Next transport buttons to go to the previous call or next call. As each memo or call is deleted, however, they are renumbered. I knew I didn’t want to number the memos, but I could take advantage of the larger display on the phone to show more detail about each one, including the record date, the duration and the file size.

The real breakthrough came when I realized I could put the ListBox on the program’s main screen and use it not only for selection, but for playback as well.

Using the Program

My final design was, of course, a compromise between ultimate simplicity and a complete memo-management system. The downloadable SpeakMemo project is written for Silverlight for Windows Phone and requires the Windows Phone 7 Development Tools. You can run the program on the phone emulator, and it will appear to be working fine, but it won’t actually record or play back any sounds.

The first time you run the SpeakMemo program, it displays the screen shown in Figure 1.

image: The Initial SpeakMemo

Figure 1 The Initial SpeakMemoScreen

One button! Or, at least one enabled button on a fairly uncluttered screen. The button shows how much space exists in isolated storage and how that corresponds with a recorded sound file. (No, the program will not allow you to record a memo 17 hours in length!)

Press the Record button and it changes to a flashing red display with an updating duration indicator, as shown in Figure 2.

image: SpeakMemo While Recording

Figure 2 SpeakMemo While Recording

Press the Record button again, and the recorded memo shows up on the screen with the date and time, duration, storage space and Play button, as shown in Figure 3.

image: SpeakMemo with One Memo

Figure 3 SpeakMemo with One Memo

Of course, you can press the Play button to play it back, and the button toggles between Play and Pause modes.

It might not be so obvious with only one memo, but the recorded memos are stored in a ListBox in reverse chronological order as shown in Figure 4, so as you accumulate many of them, you can scroll through and play them individually.

image: The SpeakMemo ListBox

Figure 4 The SpeakMemo ListBox

One of the powerful features of Silverlight is the DataTemplate that lets you define the appearance of items in a ListBox. This DataTemplate can include other controls, such as buttons. I was pleased to come up with a practical application of putting a Button in a DataTemplate.

You can also manage the collected memos by deleting individual ones. When a memo is selected, the Delete button is enabled. Perhaps inspired by putting a Button in a DataTemplate, I performed another Silverlight trick by putting two additional buttons inside the Delete button. These buttons become visible when you press Delete, and they perform the traditional confirmation function, as shown in Figure 5.

image: Confirming a Delete

Figure 5 Confirming a Delete

Playing a memo causes it to be selected, but an item is not played when you select it by pressing on the area to the right of the Play button. The program lets you play one memo, record another and delete still another all at the same time.

The Phone and Sound

At one time Windows Phone 7 was supposed to have some of the speech recognition and synthesis support found in the Microsoft .NET Framework System.Speech namespaces. Perhaps you’ll see that support in the future.

Until then, you can capture sound from the phone’s microphone and play it back through the phone’s speaker using classes in the Microsoft.Xna.Framework.Audio namespace. These are XNA classes, but you can also use them in Silverlight programs. To use XNA classes in a Silverlight project, simply add a reference to Microsoft.Xna.Framework.dll to the project’s references and ignore the warning message.

The classes in the Microsoft.Xna.Framework.Audio namespace are entirely separate from those in the Microsoft.Xna.Framework.Media namespace. The Media namespace contains classes for playing music from the phone’s music library, which are compressed audio files in MP3 or WMA format that become objects of type Song. I show how to access the music library in Chapter 18 of my book, “Programming Windows Phone 7” (Microsoft Press, 2010), which can be downloaded for free from bit.ly/dr0Hdz. In a blog entry on my Web site, I also demonstrate how to play MP3 or WMA files that are stored within the program itself, or which are downloadable over the Internet (bit.ly/ea73Fz).

In contrast, classes in the Microsoft.Xna.Framework.Audio namespace work with uncompressed audio data in the standard PCM format, which is the same method used for audio CDs and Windows WAV files. With PCM, the analog sound amplitude is sampled at a uniform rate (usually in the range of 8,000 to 48,000 samples per second) and each sample is usually stored as an 8-bit or 16-bit value. The storage required for a particular sound is the product of the duration in seconds, the sample rate and the number of bytes per sample (multiply by two for stereo).

If you need speech-recognition support in your Windows Phone 7 application you must provide it yourself, most likely via a Web service. Similarly, a program that requires converting text to speech will probably use a Web service, or wait until the phone provides that support. The Microsoft Translator app for Windows Phone does this using the Microsoft Translator service (microsofttranslator.com). The code and documentation for the Translator Starter Kit is being released on MSDN (msdn.microsoft.com/library/gg521144(VS.92).aspx).

When using XNA audio services, a Silverlight program must call the static FrameworkDispatcher.Update method at approximately the same rate as the video refresh rate, which on Windows Phone 7 is approximately 30 times a second. There’s a description of how to do this in the article “Enable XNA Framework Events in Windows Phone Applications” within the XNA online documentation (msdn.microsoft.com/library/ff842408). In SpeakMemo, the XnaFrameworkDispatcherService class handles this job. This class is instantiated in the App.xaml file.

Sound Recording

To record sound through the phone’s microphone, you use the Microphone class. You’ll probably create an instance of this class with the static Default property:

Microphone microphone = Microphone.Default;

Alternatively, the static All property provides a collection of Microphone objects, but then you’ll probably want to present the list to the user to select one.

The sample rate is fixed, cannot be changed and is reported by the SampleRate property to be 16,000 samples per second. According to the Nyquist sample theorem, this is suitable for recording sounds up to 8,000 Hz in frequency. This is fine for voice, but don’t expect great results with music. Each sample is 2 bytes wide and monaural, which means that each second of recorded sound requires 32,000 bytes, and each minute is 1.9MB.

Microphone data is delivered to your program in buffers that are simply byte arrays. You’ll install a handler for the BufferReader event and then call Start to start recording. When the Microphone object fires the BufferReady event, your code calls GetData with a byte array. On return from GetData, the buffer has been filled with PCM data. When your program wants to stop recording, call GetData once more to get the last partial buffer. The method returns the number of bytes transferred to the array. Then call Stop.

The only option that Microphone allows you is specifying the byte size of the buffer that you pass to GetData. The BufferSize property is a TimeSpan value that must be between 100 ms and 1,000 ms (one second) in increments of 10 ms. In SpeakMemo, I left it at the default value of 1,000.

For your convenience, the Microphone class has two methods to convert between buffer sizes and time. Unfortunately these methods are a little confusing because the names refer to “sample.” The GetSampleDuration method basically divides a byte size by 32,000 and returns a TimeSpan indicating that many seconds. GetSampleSizeInBytes multiplies a TimeSpan duration in seconds by 32,000.

When SpeakMemo is recording, it accumulates multiple 32,000-byte buffers in a generic List collection. When recording stops, the program saves all the individual buffers to a file in isolated storage.

Once I decided that I wouldn’t include a key-word feature to identify memos, I wanted the file to contain only the PCM data and not any supplementary information. However, I was quite startled to realize that the IsolatedStorageFile class in Silverlight for Windows Phone does not support the methods for accessing the file creation time or last write time, and I felt this information was crucial from the user’s perspective.

This meant that the file name itself would have to include the date and time. I first tried creating a file name from a DateTime object using the “s” and “u” formatting options, but that didn’t work. (Why it doesn’t work I’ll leave as a simple exercise for the reader.) I then fabricated my own file name string by piecing the various components of the date and time together.

XNA Sound Playback

The Microsoft.Xna.Framework.Audio namespace lets you play back pre-recorded sounds using the related SoundEffect and SoundEffectInstance classes, whose names surely betray their common function in the context of an XNA game! But the static SoundEffect.FromStream method requires a Stream object referencing a standard Windows WAV file complete with RIFF header, and I didn’t want to bother with file formats.

For working with raw PCM data rather than WAV files, you’ll instead want to use the DynamicSoundEffectInstance class, which derives from SoundEffectInstance. This class is ideal for the data generated from the Microphone class or for programs that dynamically create their own waveform data, such as music synthesizer programs.

The DynamicSoundEffectInstance constructor requires a sample rate and a number of channels; if you’re using this class with data generated from the microphone, obviously you’ll want to keep it consistent:

DynamicSoundEffectInstance playback = 
  new DynamicSoundEffectInstance(
  microphone.SampleRate, AudioChannels.Mono);

On the other hand, if you want the playback to sound like a 
fast-talking chipmunk, simply multiply that first argument by two. DynamicSoundEffectInstance expects data to have a 16-bit sample size. The class has Play, Pause, Resume and Stop methods to control the playback, and a State property indicates the current state. The class works somewhat the opposite of Microphone: It fires a BufferNeeded event when it requires a new buffer. Your job is to fill up a buffer with PCM data and call SubmitBuffer.

To avoid audible gaps in the sound, in the general case you’ll want to maintain a queue of buffers in the DynamicSoundEffectInstance class and submit a new buffer while the previous buffer is still playing. The class helps out with a PendingBufferCount property that indicates the number of buffers in the queue. The BufferNeeded event is fired when the PendingBufferCount changes and is less than or equal to two.

However, if you just need to play an entire chunk of PCM data, it’s possible to call SubmitBuffer without bothering with the BufferNeeded event. At first, this was how I was using the class in the SpeakMemo program, but I discovered it wasn’t possible to determine when the buffer had completed playing. There is no “state changed” event, and even if there were, DynamicSoundEffectInstance doesn’t switch from the Play state to the Stop state when finished with the buffer. It’s still expecting more buffers. Not knowing this information prevented the program from correctly toggling the visuals of the Play/Pause button.

I ended up handling the BufferNeeded event, but only to take the opportunity to check the PendingBufferCount property. When PendingBufferCount gets down to zero, the buffer has completed playback.

Storage Issues

SpeakMemo stores recorded memos in isolated storage. Conceptually isolated storage is private to the application, but physically it’s part of a total storage area that’s analogous to the hard drive of a desktop computer. All the application executables are stored there, as well as the phone’s photo library, music library, video library and much more. The hardware specification for Windows Phone 7 requires the phone to have at least 8GB of flash memory for this storage area, and the phone itself will alert the user when the storage is getting low.

Storing the memo files was not my big concern. I was more worried about the program’s heap. Aside from the flash memory storage, the Windows Phone 7 hardware specification also requires 256MB of RAM. This is the memory that an application occupies when it’s running, and which provides the program’s local heap. My experimentations revealed that SpeakMemo could allocate an array up to 90MB in size before it raised an out-of-memory exception. This is equivalent to about 47 minutes of sound from the microphone.

This doesn’t mean that a Windows Phone 7 program is necessarily limited to 47 minutes of recording time. But a program that wants to record that much continuous sound must progressively save buffers to isolated storage to free up memory, and then load the file incrementally when playing it back. This was not how SpeakMemo was structured. Instead, the program saved and loaded entire files, and I didn’t feel inclined to abandon that much simpler structure.

For that reason, I simply set a 10-minute maximum on the memo duration. Once a recording reaches that length, it’s simply stopped and saved (which itself requires several seconds). To keep the program simple, there’s no warning. The recording simply stops as if the user had pressed the button. This automatic stop-and-save also occurs when the program is terminated or otherwise deactivated; for example, during tombstoning.

Of course, playing back a 10-minute memo is not exactly convenient, either. The Play button toggles between play and pause mode but there’s no way to rewind or fast forward. Those features could be added, but you know what that requires, right?

Yes: more buttons. Or perhaps even a Slider.


Charles Petzold is a longtime contributing editor to MSDN Magazine*. His new book, “Programming Windows Phone 7” (Microsoft Press, 2010) is available as a free download from bit.ly/dr0Hdz.*

Thanks to the following technical expert for reviewing this article: Mark Hopkins