April 2012

Volume 27 Number 04

Touch and Go - Musical Instruments for Windows Phone

By Charles Petzold | April 2012

Charles PetzoldEvery Windows Phone has a built-in speaker and a headphone jack, and it would surely be a shame if their use was limited to making phone calls. Fortunately, Windows Phone applications can also use the phone’s audio facilities for playing music or other sounds. As I’ve demonstrated in recent installments of this column, a Windows Phone application can play MP3 or WMA files stored in the user’s music library, or files downloaded over the Internet.

A Windows Phone app can also dynamically generate audio waveforms, a technique called “audio streaming.” This is an extremely data-intensive activity: For CD-quality sound you need to generate 16-bit samples at a rate of 44,100 samples per second for both left and right channels, or a whopping 176,400 bytes per second!

But audio streaming is a powerful technique. If you combine it with multi-touch, you can turn your phone into an electronic music instrument, and what could be more fun than that?

Conceiving a Theremin

One of the very earliest electronic music instruments was created by Russian inventor Léon Theremin in the 1920s. The player of a theremin doesn’t actually touch the instrument. Instead, the player’s hands move relative to two antennas, which separately control the volume and pitch of the sound. The result is a spooky quivering wail that glides from note to note—familiar from movies such as “Spellbound” and “The Day the Earth Stood Still,” the occasional rock band, and season 4, episode 12 of “The Big Bang Theory.” (Contrary to popular belief, a theremin was not used for the “Star Trek” theme.)

Can a Windows Phone be turned into a hand-held theremin? That was my goal.

The classical theremin generates sound through a heterodyning technique in which two high-frequency waveforms are combined to produce a difference tone in the audio range. But this technique is impractical when waveforms are generated in computer software. It makes much more sense to generate the audio waveform directly.

After toying briefly with the idea of using the phone’s orientation to control the sound, or for the program to view and interpret hand movements through the phone’s camera à la Kinect, I settled on a much more prosaic approach: A finger on the phone’s screen is a two-dimensional coordinate point, which allows a program to use one axis for frequency and the other for amplitude.

Doing this intelligently requires a little knowledge about how we perceive musical sounds.

Pixels, Pitches and Amplitudes

Thanks to pioneering work by Ernst Weber and Gustav Fechner in the 19th century, we know that human perception is logarithmic rather than linear. Incremental linear changes in the magnitude of stimuli are not perceived as equal. What we instead perceive as equal are changes that are proportional to the magnitude, often conveniently expressed as fractional increases or decreases. (This phenomenon extends beyond our sensory organs. For example, we feel that the difference between $1 and $2 is much greater than the difference between $100 and $101.)

Human beings are sensitive to audio frequencies roughly between 20Hz and 20,000Hz, but our perception of frequency is not linear. In many cultures, musical pitch is structured around the octave, which is a doubling of frequency. When you sing “Somewhere Over the Rainbow,” the two syllables of the first word are an octave apart regardless of whether the leap is from 100Hz to 200Hz, or from 1,000Hz to 2,000Hz. The range of human hearing is therefore about 10 octaves.

The octave is called an octave because in Western music it encompasses eight lettered notes of a scale where the last note is an octave higher than the first: A, B, C, D, E, F, G, A (which is called a minor scale) or C, D, E, F, G, A, B, C (the major scale).

Due to the way these notes are derived, they are not perceptually equally distant from one another. A scale in which all the notes are equally distant requires five more notes for a total of 12 (not counting the first note twice): C, C#, D, D#, E, F, F#, G, G#, A, A# and B. Each of these steps is known as a semitone, and if they’re equally spaced (as they are in common equal-temperament tuning), each note has a frequency that is the 12th root of two (or about 1.059) times the frequency of the note below it.

The semitone can be further divided into 100 cents. There are 1,200 cents to the octave. The multiplicative step between cents is the 1,200th root of two, or 1.000578. The sensitivity of human beings to changes in frequency varies widely, of course, but is generally cited to be about five cents.

This background into the physics and mathematics of music is necessary because the theremin program needs to convert a pixel location of a finger to a frequency. This conversion should be done so that each octave corresponds to an equal number of pixels. If we decide that the theremin is to have a four-octave range corresponding to the 800-pixel length of the Windows Phone screen in landscape mode, that’s 200 pixels per octave, or six cents per pixel, which corresponds nicely with the limits of human perception.

The amplitude of a waveform determines how we perceive the volume, and this, too, is logarithmic. A decibel is defined as 10 times the base 10 logarithm of the ratio of two power levels. Because the power of a waveform is the square of the amplitude, the decibel difference between two amplitudes is:

CD audio uses 16-bit samples, which allows that ratio between maximum and minimum amplitudes to be 65,536. Take the base 10 logarithm of 65,536 and multiply by 20 and you get a 96-decibel range.

One decibel is about a 12 percent increase in amplitude. Human perception to changes in amplitude is much less sensitive than to frequency. A few decibels are required before people notice a change in volume, so this can be easily accommodated on the 480-pixel dimension of the Windows Phone screen.

Making It Real

The downloadable code for this article is a single Visual Studio solution named MusicalInstruments. The Petzold.MusicSynthesis project is a DLL that mostly includes files I discussed in last month’s installment of this column (msdn.microsoft.com/magazine/hh852599). The Theremin application project consists of a single landscape page.

What type of waveform should a theremin generate? In theory, it’s a sine wave, but in reality it’s a somewhat distorted sine wave, and if you try to research this question on the Internet, you won’t find a lot of consensus. For my version, I stuck with a straight sine wave, and it seemed to sound reasonable.

As shown in Figure 1, the MainPage.xaml.cs file defines several constant values and computes two integers that govern how the pixels of the display correspond to notes.

Figure 1 Amplitude and Frequency Calculation for Theremin

public partial class MainPage : PhoneApplicationPage
  static readonly Pitch MIN_PITCH = new Pitch(Note.C, 3);
  static readonly Pitch MAX_PITCH = new Pitch(Note.C, 7);
  static readonly double MIN_FREQ = MIN_PITCH.Frequency;
  static readonly double MAX_FREQ = MAX_PITCH.Frequency;
  static readonly double MIN_FREQ_LOG2 = Math.Log(MIN_FREQ) / Math.Log(2);
  static readonly double MAX_FREQ_LOG2 = Math.Log(MAX_FREQ) / Math.Log(2);
  double xStart;      // The X coordinate corresponding to MIN_PITCH
  int xDelta;         // The number of pixels per semitone
  void OnLoaded(object sender, EventArgs args)
    int count = MAX_PITCH.MidiNumber - MIN_PITCH.MidiNumber;
    xDelta = (int)((ContentPanel.ActualWidth - 4) / count);
    xStart = (int)((ContentPanel.ActualWidth - count * xDelta) / 2);
  double CalculateAmplitude(double y)
    return Math.Min(1, Math.Pow(10, -4 * (1 - y / ContentPanel.ActualHeight)));
  double CalculateFrequency(double x)
    return Math.Pow(2, MIN_FREQ_LOG2 + (x - xStart) / xDelta / 12);

The range is from the C below middle C (a frequency of about 130.8Hz) to the C three octaves above middle C, about 2,093Hz. Two methods calculate a frequency and a relative amplitude (ranging from 0 to 1) based on the coordinates of a touch point obtained from the Touch.FrameReported event. 

If you just use these values to control a sine wave oscillator, it won’t sound like a theremin at all. As you move your finger across the screen, the program doesn’t get an event for every single pixel along the way. Instead of a smooth frequency glide, you’ll hear very discrete steps. To solve this problem, I created a special oscillator class, shown in Figure 2. This oscillator inherits a Frequency property but defines three more properties: Amplitude, DestinationAmplitude and DestinationFrequency. Using multiplicative factors, the oscillator itself provides gliding. The code can’t actually anticipate how fast a finger is moving, but in most cases it seems to work OK.

Figure 2 The ThereminOscillator Class

public class ThereminOscillator : Oscillator
  readonly double ampStep;
  readonly double freqStep;
  public const double MIN_AMPLITUDE = 0.0001;
  public ThereminOscillator(int sampleRate)
    : base(sampleRate)
    ampStep = 1 + 0.12 * 1000 / sampleRate;     // ~1 db per msec
    freqStep = 1 + 0.005 * 1000 / sampleRate;   // ~10 cents per msec
  public double Amplitude { set; get; }
  public double DestinationAmplitude { get; set; }
  public double DestinationFrequency { set; get; }
  public override short GetNextSample(double angle)
    this.Frequency *= this.Frequency < this.DestinationFrequency ?
                                     freqStep : 1 / freqStep;
    this.Amplitude *= this.Amplitude < this.DestinationAmplitude ?
                                     ampStep : 1 / ampStep;
    this.Amplitude = Math.Max(MIN_AMPLITUDE, Math.Min(1, this.Amplitude));
    return (short)(short.MaxValue * this.Amplitude * Math.Sin(angle));

Figure 3shows the handler for the Touch.FrameReported event in the MainPage class. When a finger first touches the screen, Amplitude is set to a minimum value so the sound rises in volume. When the finger is released, the sound fades out.

Figure 3 The Touch.FrameReported Handler in Theremin

void OnTouchFrameReported(object sender, TouchFrameEventArgs args)
  TouchPointCollection touchPoints = args.GetTouchPoints(ContentPanel);
  foreach (TouchPoint touchPoint in touchPoints)
    Point pt = touchPoint.Position;
    int id = touchPoint.TouchDevice.Id;
    switch (touchPoint.Action)
      case TouchAction.Down:
        oscillator.Amplitude = ThereminOscillator.MIN_AMPLITUDE;
        oscillator.DestinationAmplitude = CalculateAmplitude(pt.Y);
        oscillator.Frequency = CalculateFrequency(pt.X);
        oscillator.DestinationFrequency = oscillator.Frequency;
        HighlightLines(pt.X, true);
        touchID = id;
      case TouchAction.Move:
        if (id == touchID)
           oscillator.DestinationFrequency = CalculateFrequency(pt.X);
           oscillator.DestinationAmplitude = CalculateAmplitude(pt.Y);
           HighlightLines(pt.X, true);
      case TouchAction.Up:
        if (id == touchID)
          oscillator.DestinationAmplitude = 0;
          touchID = Int32.MinValue;
          // Remove highlighting
          HighlightLines(0, false);

As you can see from the code, the Theremin program generates just a single tone and ignores multiple fingers.

Although the theremin frequency varies continuously, the screen nevertheless displays lines to indicate discrete notes. These lines are colored red for C and blue for F (the colors used for harp strings), white for naturals and gray for accidentals (the sharps). After playing with the program awhile, I decided it needed some visual feedback indicating what note the finger was actually positioned on, so I made the lines widen based on their distance from the touch point. Figure 4shows the display when the finger is between C and C# but closer to C.

The Theremin Display
Figure 4 The Theremin Display

Latency and Distortion

One big problem with software-based music synthesis is latency—the delay between user input and the subsequent change in sound. This is pretty much unavoidable: Audio streaming in Silverlight requires that an application derive from MediaStreamSource and override the GetSampleAsync method, which supplies audio data on demand through a Memory­Stream object. Internally, this audio data is maintained in a buffer. The existence of this buffer helps ensure that the sound is played back without any disconcerting gaps, but of course playback of the buffer will always trail behind the filling of the buffer.

Fortunately, MediaStreamSource defines a property named AudioBufferLength that indicates the size of the buffer in milliseconds of sound. (This property is protected and can be set only within the MediaStreamSource derivative prior to opening the media.) The default value is 1,000 (or 1 second) but you can set it as low as 15. A lower setting increases the interaction between the OS and the MediaStreamSource derivative and might result in gaps in the sound. However, I found that the minimum setting of 15 seemed to be satisfactory.

Another potential problem is simply not being able to crank out the data. Your program needs to generate tens or hundreds of thousands of bytes per second, and if it can’t do this in an efficient manner, the sound will start breaking up and you’ll hear a lot of crackling.

There are a couple ways to fix this: You can make your audio-generation pipeline more efficient (as I’ll discuss shortly) or you can reduce the sampling rate. I found that the CD sampling rate of 44,100 was too much for my programs, and I took it down to 22,050. Reducing it further to 11,025 might also be necessary. It’s always good to test your audio programs on a couple different Windows Phone devices. In a commercial product, you’ll probably want to give the user an option of reducing the sampling rate.

Multiple Oscillators

The Mixer component of the synthesizer library has the job of assembling multiple inputs into composite left and right channels. This is a fairly straightforward job, but keep in mind that each input is a waveform with a 16-bit amplitude, and the output is also a waveform with a 16-bit amplitude, so the inputs must be atten­uated based on how many there are. For example, if the Mixer component has 10 inputs, each input must be attenuated to one-tenth of its original value.

This has a profound implication: Mixer inputs can’t be added or removed while music is playing without increasing or decreasing the volume of the remaining inputs. If you want a program that can potentially play 25 different sounds at once, you’ll need 25 constant mixer inputs.

This is the case with the Harp application in the MusicalInstruments solution. I envisioned an instrument with strings that I could pluck with my fingertip, but which I could also strum for the common harp glissando sound.

As you can see from Figure 5, visually it’s very similar to the theremin, but with only two octaves rather than four. The strings for the accidentals (the sharps) are positioned at the top, while the naturals are at the bottom, which somewhat mimics the type of harp known as the cross-strung harp. You can perform a pentatonic glissando (at the top), a chromatic glissando (in the middle) or a diatonic glissando (on the bottom).

The Harp Program
Figure 5 The Harp Program

For the actual sounds, I used 25 instances of a SawtoothOscillator class, which generates a simple sawtooth waveform that grossly approximates a string sound. It was also necessary to make a rudimentary envelope generator. In real life, musical sounds don’t start and stop instantaneously. The sound takes a while to get going, and then might fade out by itself (such as with a piano or harp), or might fade out after the musician stops playing it. An envelope generator controls these changes. I didn’t need anything as sophisticated as a full-blown attack-decay-sustain-release (ADSR) envelope, so I created a simpler AttackDecayEnvelope class. (In real life, a sound’s timbre—governed by its harmonic components—also changes over the duration of a single tone, so the timbre should be controlled by an envelope generator as well.)

For visual feedback, I decided I wanted the strings to vibrate. Each string is actually a quadratic Bezier segment, with the central control point collinear with the two end points. By applying a repeating PointAnimation to the control point, I could make the strings vibrate.

In practice, this was a disaster. The vibrations looked great but the sound degenerated into extreme crackling ugliness. I switched to something a little less severe: I used a DispatcherTimer and offset the points manually at a much slower rate than a real animation.

After playing around with the Harp program a little while, I was unhappy with the flicking gesture required for plucking the strings, so I added some code to trigger the sound with just a tap. At that point, I probably should have changed the name of the program from Harp to HammeredDulcimer, but I let it go.

Avoiding Floating Point

On the Windows Phone device that I was using for most of my development, the Harp worked fine. On another Windows Phone it was extremely crackly, indicating that the buffers could not be filled quickly enough. This analysis was confirmed by halving the sampling rate. The crackling stopped with a sampling rate of 11,025Hz, but I wasn’t ready to sacrifice the sound quality.

Instead, I started taking a close look at the pipeline that provided these thousands of samples per second. These classes—Mixer, MixerInput, SawtoothOscillator and AttackDecayEnvelope—all had one thing in common: They all used floating-point arithmetic in some way to compute these samples. Could switching to integer calculations help speed up this pipeline enough to make a difference?

I rewrote my AttackDecayEnvelope class to use integer arithmetic, and did the same thing with the SawtoothOscillator, which is shown in Figure 6. These changes improved performance significantly.

Figure 6 The Integer Version of SawtoothOscillator

public class SawtoothOscillator : IMonoSampleProvider
  int sampleRate;
  uint angle;
  uint angleIncrement;
  public SawtoothOscillator(int sampleRate)
    this.sampleRate = sampleRate;
  public double Frequency
      angleIncrement = (uint)(UInt32.MaxValue * value / sampleRate);
      return (double)angleIncrement * sampleRate / UInt32.MaxValue;
  public short GetNextSample()
    angle += angleIncrement;
    return (short)((angle >> 16) + short.MinValue);

In the oscillators that use floating point, the angle and angle­Increment variables are of type double where angle ranges from 0 to 2π and angleIncrement is calculated like so:

For each sample, angle is increased by angleIncrement.

I didn’t eliminate floating point entirely from SawtoothOscillator. The public Frequency property is still defined as a double, but that’s only used when the oscillator’s frequency is set. Both angle and angleIncrement are unsigned 32-bit integers. The full 32-bit values are used when angleIncrement increases the value of angle, but only the top 16 bits are used as a value for calculating a waveform.

Even with these changes, the program still doesn’t run as well on what I now think of as my “slow phone” as compared to my “fast phone.” Sweeping a finger across the whole screen still causes some crackling.

But what’s true with any musical instrument is also true with electronic music instruments: You need to become familiar with the instrument, and know not only its power but its limitations.

Charles Petzold is a longtime contributor to MSDN Magazine. His Web site is charlespetzold.com.

Thanks to the following technical experts for reviewing this article:  Mark Hopkins