Creating Karaoke, Windows Media Player Style

 

Stephen Toub

Microsoft Corporation

September 2004

 

Applies to:

   Microsoft® Windows Media® Player 10

 

Summary: Describes how to write a Microsoft Windows Media Player Audio DSP plug-in that can remove primary vocals from audio files.

Download the KaraokeMaker Source Code.msi sample file (196 KB)

 

Contents

Introduction

Using the Windows Media Player 10 SDK

Understanding Digital Sound

Modifying Code in the Audio DSP Plug-in Wizard

Conclusion

Introduction

Years ago I came across a WinAmp plug-in that attempted to remove primary vocals from audio files and did so with relative success. Since that time I've upgraded to a superior music platform, Microsoft® Windows Media® Player, but have often thought back to that plug-in and have hoped for something similar in the Player. While Microsoft has never included such a feature in the product, with the advent of Windows Media Player 9 Series and its software development kit (SDK), adding such functionality is easily done in a lazy New York summer afternoon.

Using the Windows Media Player 10 SDK

The Windows Media® Player SDK, now at version 10, self-describes itself as "document[ing] programming technologies that can be used to extend the capabilities of Windows Media Player." However, it also includes a wizard for Microsoft Visual C++® 6 and Visual C++ .NET capable of generating all of the boilerplate C++ necessary to get up and running with a new Windows Media Player 10 plug-in. The wizard can create four different types of plug-ins, as shown in Table 1 (variations also exist within each type; for example, the wizard can create both Audio DSP and Video DSP plug-ins).

Table 1. Windows Media Player plug-in types

Type Description
Custom Visualizations Display imagery synchronized to audio.
User Interface Plug-ins Add new functionality to the Now Playing pane of the player.
DSP Plug-ins Process/modify audio and video data before it's rendered.
Rendering Plug-ins Decode and render custom data contained in a Windows Media format stream.

These plug-ins allow for the augmentation of many aspects of the Player's functionality. For our purposes (a plug-in to remove primary vocals), we're interested in modifying the input audio data before it is played, and therefore, we want to create an Audio DSP plug-in. (For more information about the types of plug-ins that are available, see Building Windows Media Player and Windows Media Encoder Plug-ins.

The boilerplate code generated by the Windows Media Player 10 SDK wizard when you create an Audio DSP plug-in compiles to a fully functional plug-in that allows you to modify the volume of the current audio. Compiling the project and opening the Player reveals that the new plug-in is registered, available, and working.

To download the Windows Media Player 10 SDK, go to the Windows Media Downloads Web page.

Understanding Digital Sound

To understand the data passed to an Audio DSP plug-in, one must have at least a minimal understanding of how sound works. "Sound" is composed of a series of pressure waves traveling through some medium, often air. The amplitude of these waves corresponds to what we hear as volume, while the frequency of the waves corresponds to what we hear as pitch. Digital sound is represented as a series of samples, taken at equally spaced intervals, representing the amplitude of the waves. When the sound is played, the samples are converted back to waves, filling in the missing information about the waves by interpolating from the captured samples. Therefore, the more samples taken and the more information captured in each sample, the better the quality of the digital audio and the representation of the original sound. For example, the standard CD audio format includes 44,100 samples per second, with each sample occupying 16-bits. The standard DVD audio format has 96,000 samples per second, each at 24-bits.

When Windows Media® Player plays an audio file, it constructs an array of bytes representing the stream of amplitude samples and supplies this data to registered DSP plug-ins. A plug-in can then interpret and massage this data however it sees fit, sending the modified data back to the Player, at which time it can be passed to another plug-in in the filter chain. Eventually, when there are no more plug-ins, this data is rendered and presented to the user.

Note that DSP plug-ins are disabled when playing protected content, such as Windows Media Audio (WMA) files associated with digital rights management (DRM) licenses. As a result, Audio DSP plug-ins won't work for commercial DVD audio tracks or for most audio files purchased from Microsoft MSN® Music.

Modifying Code in the Audio DSP Plug-in Wizard

The code generated by the Audio DSP plug-in wizard simply modifies the amplitude of each sample, multiplying it by a user-supplied value, and consequently, uniformly scaling the volume of the audio. To see this, examine a portion of the DoProcessOutput method, as shown here:

// Return number of bytes actually copied to output buffer.
*cbBytesProcessed = dwSamplesToProcess * sizeof(BYTE);

// 8-bit sound is 0..255 with 128 == silence.
while (dwSamplesToProcess--)
{
    // Get the input sample and normalize to -128 .. 127.
    int i = (*pbInputData++) - 128;

    // Apply scale factor to sample.
    i = int( ((double) i) * m_fScaleFactor );

    // Truncate if exceeded full scale.
    if (i > 127) i = 127;
    if (i < -128) i = -128;

    // Convert back to 0..255 and write to output buffer.
    *pbOutputData++ = (BYTE)(i + 128);
}

The DoProcessOutput method is called each time the Player supplies the plug-in with a new sequence of audio bytes to be processed. Because the data supplied to the plug-in can come from a variety of difference sources in a variety of different formats, the DoProcessOutput method must be capable of interpreting the bytes based on the number of bits used for each sample. Therefore, if the plug-in accepts audio with bit depths of 8, 16, 20, etc., its DoProcessOutput method must be capable of interpreting the byte array accordingly. The previous sample shows the handling for 8-bit data, though if you look at the entire DoProcessOutput in the wizard-generated code, you'll see similar code for each bit depth. For example, the code to process 16-bit samples looks like this:

// Return number bytes actually copied to output buffer.
*cbBytesProcessed = dwSamplesToProcess * sizeof(short);

// 16-bit sound is -32768..32767 with 0 == silence.
short *pwInputData = (short*) pbInputData;
short *pwOutputData = (short*) pbOutputData;
 
while (dwSamplesToProcess--)
{
    // Get the input sample.
    int i = *pwInputData++;

    // Apply scale factor to sample.
    i = int( ((double) i) * m_fScaleFactor );
            
    // Truncate if exceeded full scale.
    if (i > 32767) i = 32767;
    if (i < -32768) i = -32768;

    // Write to output buffer.
    *pwOutputData++ = i;
}

As in this section, the plug-in is working with 16-bit audio, and the code must first interpret the bytes as such. Therefore, the bytes are cast to a short array where each value in the array represents one sample from one channel. The plug-in then loops through all of the samples, scaling the input value by the user-supplied multiplication factor (in the wizard-generated code, this is limited to a real value between 0 and 1, inclusive) and storing the result to the output array. With DSP plug-ins, the input array should not be modified; rather, all data to be rendered by the Player must be copied to the output array.

My karaoke plug-in is only slightly more complex than the sample generated by the wizard. It still needs to loop through all of the values in the input data, tweak them, and write them to the output array.

The algorithm for the "tweak" employed in this plug-in is certainly not new, employing a classic phase-cancellation method used in many similar applications, both software- and hardware-based. The process relies on the fact that in many stereo recordings (at least in popular music) the lead vocal is often mixed approximately equally onto the left and right channels. By canceling out the parts of the channels that are identical, I hope to eliminate the lead vocals from the sound. To do this, I simply invert the left channel and mix it equally with the right channel (see Figure 1).

 Figure showing attempt to cancel out lead vocals

Figure 1. Attempting to cancel out lead vocals

The code for this process that parallels the 16-bit code shown previously can be seen here:

// Return number bytes actually copied to output buffer.
*cbBytesProcessed = dwSamplesToProcess * sizeof(short);

// 16-bit sound is -32768..32767 with 0 == silence.
short   *pwInputData = (short *) pbInputData;
short   *pwOutputData = (short *) pbOutputData;

// Process both channels at the same time, so continually
// grab next two samples (one is left channel, one is right).
while (dwSamplesToProcess-- && dwSamplesToProcess--)
{
    // Get the input sample.
    int left = *pwInputData++;
    int right = *pwInputData++;

    // Invert left (only up to the specified scale factor).  
    // If the scale factor is 1, left will be completely 
    // inverted and any matching waves in the two channels will be
    // completely canceled.  As scale factor decreases, so will 
    // the cancelation.
    int i = int(((-1 * left) + right) * m_fPostScaleFactor);

    // Truncate if exceeded full scale.
    if (i > 32767) i = 32767;
    else if (i < -32768) i = -32768;

    // Write to output buffer (for both left and right).
    *pwOutputData++ = i;
    *pwOutputData++ = i;
}

With DoProcessOutput implemented, I've completed almost everything I need to have a fully functional, vocals-removing plug-in. The wizard generates the necessary DllRegisterServer and DllUnregisterServer functions for you:

STDAPI DllRegisterServer(void)
{
    CComPtr<IWMPMediaPluginRegistrar> spRegistrar;
    HRESULT hr;

    // Create the registration object.
    hr = spRegistrar.CoCreateInstance(
        CLSID_WMPMediaPluginRegistrar, NULL, CLSCTX_INPROC_SERVER);
    if (FAILED(hr)) return hr;
    
    // Load friendly name and description strings.
    CComBSTR    bstrFriendlyName;
    CComBSTR    bstrDescription;

    bstrFriendlyName.LoadString(IDS_FRIENDLYNAME);
    bstrDescription.LoadString(IDS_DESCRIPTION);

    // Describe the type of data handled by the plug-in.
    DMO_PARTIAL_MEDIATYPE mt = { 0 };
    mt.type = MEDIATYPE_Audio;
    mt.subtype = MEDIASUBTYPE_PCM;

    // Register the plug-in with Windows Media Player.
    hr = spRegistrar->WMPRegisterPlayerPlugin(
           bstrFriendlyName, // Friendly name (for menus, etc.).
           bstrDescription,  // Description (for Tools->Options->Plug-ins).
           NULL,             // Path to application that uninstalls the plug-in.
           1,                // DirectShow priority for this plug-in.
           WMP_PLUGINTYPE_DSP, // Plug-in type.
           CLSID_KaraokeMaker, // Class ID of plug-in.
           1,                // Number of media types supported by plug-in.
           &mt);             // Array of media types supported by plug-in.

    if (FAILED(hr))  return hr;

    // Registers object, typelib and all interfaces in typelib.
    return _Module.RegisterServer();
}

STDAPI DllUnregisterServer(void)
{
    CComPtr<IWMPMediaPluginRegistrar> spRegistrar;
    HRESULT hr;

    // Create the registration object.
    hr = spRegistrar.CoCreateInstance(
        CLSID_WMPMediaPluginRegistrar, NULL, CLSCTX_INPROC_SERVER);
    if (FAILED(hr)) return hr;

    // Tell Windows Media Player to remove this plug-in.
    hr = spRegistrar->WMPUnRegisterPlayerPlugin(
        WMP_PLUGINTYPE_DSP, CLSID_KaraokeMaker);

    return _Module.UnregisterServer();
}

These take care of calling the IWMPMediaPluginRegistrar methods to register the plug-in with Windows Media® Player 10, creating the registry entries necessary for the Player to find the plug-in and to list it on the Plug-ins list on the Options menu (see Figure 2).

 Figure of plug-in registered in Windows Media Player 10

Figure 2. Plug-in registered in Windows Media Player 10

The ValidateMediaType method, also wizard-generated, takes care of conveying to the Player the types of audio the plug-in supports. For my plug-in, this requires a minor change to let the Player know that the plug-in only supports audio with two channels. To do this I can return DMO_E_TYPE_NOT_ACCEPTED if pWave->nChannels != 2.

My plug-in also doesn't require a property page that allows for user input (the wizard generates a property page with a single input text box for the user to enter an amplitude scaling value), so I can delete all code relevant to that.

You can test the plug-in by running it on the Like Humans Do track that ships with Microsoft Windows® XP. Select and clear KaraokeMaker for Windows Media Player 10 under Plug-ins on the Tools menu to hear the difference.

Conclusion

Writing a simple audio plug-in like this for Windows Media® Player 10 is a snap. More complicated plug-ins that require customizations to the ProcessInput and ProcessOutput methods, as well as to some of the other required interface methods, will of course take more time. However, even adding the appropriate modifications to them is a fairly straightforward task, leaving the bulk of the time to write the logic specific to your plug-in. Plug-ins like this one, while enjoyable to write, are also very enjoyable to use. Please excuse me while I go sing my favorite Broadway tunes, accompanied by the London Symphony Orchestra!

About the author

Stephen Toub is the Technical Editor for MSDN Magazine, for which he also writes the .NET Matters column.