MSDN Magazine From the July 2002 issue of MSDN Magazine

DirectShow : Core Media Technology in Windows XP Empowers You to Create Custom Audio/Video Processing Components

 

This article assumes you're familiar with C++ and COM

Level of Difficulty    1   2   3

Download the code for this article: DirectShow.exe (45KB)

SUMMARY DirectShow is an API that enables Windows applications to control a wide variety of audio/video input devices including (but not limited to) DV camcorders, Web cams, DVD drives, and TV tuner cards. It provides out-of-the-box support for a variety of formats, from WAV and AVI to Windows Media. DirectShow is also extensible, enabling third parties to support their own specialized devices, formats, or processing components. This article introduces the basic concepts behind DirectShow and gives a step-by-step tutorial showing how to create your own video effect filter.


ne of the major features of Microsoft® Windows® XP is enhanced digital audio and video support, which is especially visible through improvements to Windows Media Player and Movie Maker. You may not realize that these features are partly based on a core-level technology that has been around for over six years. This technology, DirectShow®, is a powerful streaming architecture designed for audio and video data, widely used by the major digital media ISVs, and supported by virtually all hardware digital media devices that run on Windows.

DirectShow offers a high-level application model that enables rapid development of digital media applications, and a low-level plug-in model that enables third parties to create custom audio or video processing components. We'll show you both sides of DirectShow, from the high-level basics of application development to the low-level manipulation of individual pixels in a video frame.

The Evolution of DirectShow

TThe first support in Microsoft Windows for video capture was provided through Video for Windows (VfW) for Windows 3.1. VfW was pretty good for its time, but it had certain limitations. Chief among these was the fact that the Video Compression Manager (VCM) was not designed to handle codecs that put video frames into a different order during the compression process. This meant that it would be difficult to write VCM-based MPEG codecs.

Windows also provided simple audio-video playback support through the Media Control Interface (MCI) command set, which used the mciavi driver. Although the MCI infrastructure allowed for MPEG decoders, it was never fully ported to the 32-bit world, and it was not based on COM. To address these limitations, Microsoft began a project known as Quartz whose initial charter was to provide MPEG-1 file playback support for Windows.

Put yourself in the shoes of the Quartz team for a moment. You see on the horizon a multitude of new devices, such as digital video camcorders, new media formats such as MPEG-2 and DVD, and new technologies such as video conferencing. You see that what is really needed is a framework in which all these new technologies can work together seamlessly, with maximum efficiency, not only with each other but also with legacy technologies. Furthermore, this framework should be extensible so that third parties can provide support for special hardware capabilities, nonstandard formats, and software processing operations. The framework should simplify application development as much as possible, while still providing the ability to control low-level streaming operations and modify audio-video data when necessary.

To address these broader requirements, the Quartz team started with an existing project called Clockwork. Clockwork was a modular framework in which semi-independent components worked together, following a prescribed set of rules, to process digital media streams in whatever way was required by an application. The Quartz team adapted this model to work with Windows and continually evolving third-party devices. The result was a COM-based streaming architecture that over the past six years has served as the basis for hundreds of Windows digital media applications.

The architecture was initially named ActiveMovie®, and it first shipped in 1995 with the DirectX® Media SDK, around the same time that Microsoft Internet Explorer 3.0 was released. In 1996, ActiveMovie was renamed DirectShow. In 1998, with DirectX Media 6.1, support was added for DVD and analog television applications. In 2000, DirectShow became part of the DirectX SDK with DirectX 8.0, and support was added for Windows Media Format and DirectShow Editing Services, a video editing API. The current version of DirectX is 8.1, with version 9.0 available in beta.

Media Streams and Filters

Digital video streams are sequences of video frames that may be uncompressed RGB bitmaps or, if the stream is compressed, a set of numerical values that enable a decoder to reconstruct the final image. The exact contents of the data vary depending on the video format. In all formats, full-motion video is typically played back (rendered) at approximately 25 or 30 frames per second. Uncompressed digital audio streams consist of sequences of samples, each of which is an integer that represents the quantized (rounded-off) amplitude of an analog signal at a given point in time. For CD-quality audio, the samples have a precision of 16 bits and are recorded and played back at 44.1 KHz. A compressed audio stream does not contain actual sequences of samples, but, as with video streams, contains values the decoder uses to reconstruct the original stream before passing it to the sound card.

Besides compression and decompression, which are really just a necessary evil in this world of limited storage and bandwidth, digital audio and video streams may be processed in varied and interesting ways. Streams can be combined, analyzed, reorganized, copied, generated, and modified in ways that are either impossible or much more complex in the analog world. And the richness of digital media lies precisely in this seemingly limitless potential for stream processing operations of all kinds.

In DirectShow, any and all stream operations are encapsulated as filters, COM objects that have a standard behavior along with whatever custom capabilities they may be given. File readers, demultiplexers, compressors and decompressors, audio and video renderers, even device drivers are filters in the sense that they know how to communicate with—and stream data to—other filters. Applications are built by connecting these filters together in order to perform a given task.

Filters come in three basic types: source filters for input, transform filters for any intermediate processing step, and renderer filters for output.

A source filter introduces data into the stream. This data may originate in a file or in a device such as a camcorder, Web cam, TV tuner, network stream, or any type of existing or still-to-be-invented device. DirectShow is tightly coupled with the Windows Driver Model (WDM); any media device with a correctly implemented WDM driver is automatically exposed to user-mode applications as a DirectShow source filter, complete with whatever interfaces the hardware vendor has exposed to enable applications to get and set properties on the device. DirectShow also provides source filters for inputting data from files, from DVDs, and from VfW devices.

A transform filter receives input data from some other filter, performs some operation on the data, and then passes the data to another filter downstream. Transformations may be parsing of streams, encoding or decoding, overlay of text, or any type of analysis or manipulation of the audio or video bits. DirectShow provides many transform filters for handling various compression and file formats, including analog and digital television signals. Later, we'll show you how to create your own transform filter that accepts color video frames as input and outputs black and white video. You can use this filter to give your home movies that old 1950s look!

A renderer filter accepts data from either a source or transform filter and outputs it to the screen, the speakers, a file, a device, or some other location. The "Direct" in DirectShow reflects the fact that the renderer filters use DirectDraw® and DirectSound® technologies to efficiently pass data to graphics and sound cards. In addition, DirectShow supports kernel streaming capabilities, which enables capture devices such as TV tuners and DVD drives to pass data to output devices entirely in kernel mode to save the expense of kernel-to-user mode transitions in situations where the application does not require them.

The division of labor into independent filter modules obviously maximizes code reuse. For example, in any file playback scenario, there will be two common operations—reading the raw byte stream from the file and outputting the final results to the graphics and/or sound card. No matter what the file format may be, you can always use the same DirectShow filters to accomplish these tasks. All that changes are the intermediate filters.

Application Model

DirectShow is designed to make life as easy as possible for developers while still giving them the ability to control lower-level operations when necessary. DirectShow simplifies development by locating virtually all of the streaming and media-processing intelligence in individual filters; the main task of an application is to get these filters working together. For that task, applications use a high-level object called the Filter Graph Manager (FGM) to connect filters together in whatever combination is required to perform a given task, whether it be simple playback or more complex tasks such as format conversion, video analysis, color correction, and so on. The collection of connected filters is called a filter graph. Data in a filter graph always moves downstream from the source filter to the renderer.

You can use the FGM to add filters individually, or you can use the Intelligent Connect logic built into the FGM to automate the graph building process. Using Intelligent Connect, you can build complex filter graphs with only one or two method calls. Once the graph is built, the application controls its operation through Run, Stop, and Pause methods on the FGM, which handles the low-level synchronization details. The FGM relays events from the filter graph back to the application.

DirectShow gives you the ability to control lower-level streaming operations by exposing COM interfaces on individual filters. Typically, applications use these interfaces to configure a filter before streaming begins. For example, the DV Encoder filter exposes the IDVEnc interface, which enables an application to set various encoding parameters (such as NTSC or PAL) and the dimensions of the output video rectangle.

Figure 1 shows a filter graph that performs simple playback of a digital video file saved in AVI format. You can create this graph with a single call to the FGM's IGraphBuilder::RenderFile method.

Figure 2 shows the code to create this filter graph. As you can see, DirectShow packs quite a bit of functionality into the RenderFile method. Given a file name, the FGM checks the file extension, as well as the actual bits in the file stream, to verify the file format. Then it searches the registry for a filter associated with that format type. If it finds one, it adds the filter to the graph.

The FGM then examines the output type from that filter. Typically this differs from the input type because some kind of transformation has been applied to the data. (In the case of splitter or demultiplexer filters, there may be multiple output streams.) The FGM then goes back to the registry and searches for filters that can accept this next type as input. The process is repeated until a renderer filter has been found and connected.

By default, the FGM will select a renderer that outputs to the graphics or sound card. If you want to output to a file, you can simply add a file writer filter to the graph (without connecting it) before calling RenderFile. Before searching the registry, the FGM first tries filters already present in the graph.


Figure 3: Capturing Live Video from Web Cam

DirectShow also provides helper objects for constructing more complex types of graphs. Graphs can get data from a live capture source, such as a camcorder or television tuner, and save the stream to disk or preview it on the screen. Figure 3 shows a graph that captures live video from a Web cam, writes it to an AVI file, and displays a preview window so that you can view what you are capturing. The code to create this is shown in Figure 4. It is very similar to the previous graph, but now we use a hardware device as a source, so we use some DirectShow helper objects to locate the device and build the graph. We render both a file-writing stream and a preview stream so that we can view the video on the screen while we are recording it.

Now that you have an idea of how to construct a basic DirectShow application, let's stop to consider some of the messy little details we didn't touch on:

  • Efficient file I/O, handled by the File Source (Async) filter
  • Format parsing and decompressing details, handled by the intermediate transform filters
  • Connecting to a device driver, handled by the capture graph builder
  • Audio-video synchronization, handled by the source and renderer filters
  • Coordination of streaming operations during play, pause, stop, and run state transitions, handled by the FGM

While it would be an exaggeration to say that all application development in DirectShow is trivial, it should be clear how the DirectShow application model, together with the large selection of standard filters, can dramatically shorten development times for AV applications on the Windows platform.

The Media Type

The other main benefit of DirectShow is its extensibility. Let's look more closely at the inner workings of the filter graph. (It bears repeating that if you are simply writing applications, the details that follow are ones you can, in most cases, happily ignore.)

The modular design of DirectShow requires that all filters follow certain rules for communicating with other filters. These rules make up the connection and streaming protocols. But before filters can communicate with these protocols, they must speak a common language. This language is the media type. In DirectShow parlance, a media type is the shorthand way of referring to a group of data structures that collectively describe the data in a stream. Filters communicate by passing media types back and forth to indicate what types of data they can process.

The high-level descriptor is the AM_MEDIA_TYPE structure:

typedef struct _MediaType {  GUID     majortype;  GUID     subtype;  BOOL     bFixedSizeSamples;  BOOL     bTemporalCompression;  ULONG    lSampleSize;  GUID     formattype;  IUnknown *pUnk;  ULONG    cbFormat;  BYTE     *pbFormat;} AM_MEDIA_TYPE;

The first two members, majortype and subtype, are optimizations used during the connection protocol to identify a stream type. They are used primarily by the FGM in its Intelligent Connect logic and by filters when negotiating a connection. Several majortype GUIDs are defined; the three primary ones are Audio, Video, and Stream. The first two are self-explanatory. Stream indicates either an as-yet-unknown media type or a data stream that has not yet been demultiplexed and therefore still contains multiple elementary streams of different types. Subtypes provide a finer level of detail for filters or the FGM to quickly determine whether a given filter can do what is required. Dozens of subtypes are defined for various audio and video formats.

The lSampleSize member specifies the size of a media sample. Typically, an audio sample contains 200 to 2000 milliseconds of audio data, and a video sample contains one complete frame. This member is used to determine what size buffers to use when setting up the allocator, which we'll discuss later. If bFixedSizeSamples is false, it indicates that the stream is compressed and that not all frames, at this point in the stream, are of equal size. In this case, the lSampleSize member is ignored and the filter, usually a decoder, will need some other way to determine the buffer size.

The bTemporalCompression member indicates that the data stream is compressed using an inter-frame compression scheme such as MPEG. This member isn't really used in practice, since the media subtype and format block header provide the same information as bTemporalCompression.

The really interesting information about a media stream is contained in the pbFormat member, which points to a dynamically allocated format block that may be one of the several structures defined by DirectShow (or your own structure, as long as you have filters that can understand it). The formattype member identifies what type of data structure is being pointed to, and the cbFormat member indicates the size in bytes of that structure (see Figure 5).


Figure 5: Info

Let's look at the VIDEOINFOHEADER structure as an example, since we'll be using it in our own filter, and also because it is the basis for other video-related structures:

typedef struct tagVIDEOINFOHEADER {  RECT        rcSource;  RECT        rcTarget;  DWORD        dwBitRate;  DWORD        dwBitErrorRate;  REFERENCE_TIME   AvgTimePerFrame;  BITMAPINFOHEADER  bmiHeader;} VIDEOINFOHEADER;

As you can see here, a VIDEOINFOHEADER is essentially a BITMAPINFOHEADER with some additional members that specify certain streaming parameters. This reflects the fact that a video stream is basically a sequence of device independent bitmaps (DIBs) displayed in succession at some specified rate, generally between 10 and 30 frames per second. The rcSource and rcTarget rectangles allow filters to stretch or shrink incoming video bitmaps to smaller subareas within the video rectangle. The BITMAPINFOHEADER, familiar to anyone who has done any GDI graphics programming, specifies the width and height of the rectangle, the bits per pixel, and the type of compression, if any, that is used. For audio streams, the WAVEFORMATEX structure provides the same level of information. DV, MPEG-1, and MPEG-2 video streams use other structures that carry additional information that is specific to those formats.

The Connection Protocol

Before two filters can stream data, they must use the language of the media type to agree on what that data is. Filters accomplish this in the connection protocol. To be precise, it is not the filter that performs the connection protocol; rather, it is a separate object that is created and owned by the filter, called a pin. Pins come in two varieties, input and output. Whenever a filter needs to connect with another filter, it does so through a pin.

The FGM or the application is responsible for adding filters to the graph and specifying which two pins should attempt to connect. But all the actual work of connecting is performed by the pins themselves. When two pins connect, they must agree on three things: a media type, which describes the format of the data, a transport, which defines how the pins will exchange data, and an allocator, a new object owned by one of the pins, which will create and manage the buffers that the two pins will use to exchange data.

The connection protocol begins when the FGM calls IPin::Connect on the output pin (see Figure 6). The FGM passes in a pointer to the downstream input pin and, optionally, a pointer to a complete or partially filled-in AM_MEDIA_TYPE structure. The output pin can examine the majortype and the subtype to determine whether it is even worth attempting to go ahead with the connection process. For example, in the case of a video filter, the FGM might hand over a media type with a MAJORTYPE_Audio. The filter can then fail the Connect call immediately without wasting anyone's time.


Figure 6: Filter Connection Protocol

If the media type suggested by the FGM is acceptable, the output pin will call the input pin's IPin::EnumMediaTypes method (see Figure 6) to get a list of the input pin's preferred media types. The output pin loops through this list, examining each media type returned. If it sees something it likes, it can specify a complete media type based on the preferred type in a call to the input pin's IPin::ReceiveConnection method. If none of the media types are acceptable to the output pin, it will propose media types from its own list. If none of these media types succeed, the connection will fail.

Once the two pins agree on a media type, they move on to the next step, which is determining the transport. Basically, a transport defines whether the upstream filter will push data into the shared buffers, or whether the downstream filter must pull or request data from the upstream filters. The most common transport is the push model, represented by the IMemInputPin interface. If the output pin determines through a call to QueryInterface that the input pin supports this interface, then it knows to push the data downstream. At this point, the two pins are ready to negotiate the number and size of the memory buffers they will share when streaming begins. The upstream filter will write data into those buffers, and the downstream filter will read the data.

In most cases, the process of allocating and managing buffers is delegated to a separate allocator object, which is owned by one of the pins. The pins negotiate which allocator will be used and the size and number of buffers that it will require.

The output pin is responsible for selecting the allocator and setting its properties. It might ask the input pin to propose an allocator or ask for the input pin's buffer requirements. Then it notifies the input pin about which allocator and buffer properties it selected. The input pin can agree to these, or it can reject the connection. The allocators for most pins simply allocate memory from the heap on the host system, but sometimes pins on source and renderer filters have allocators that create buffers in hardware devices. These filters will always insist on using their own allocator. Most transform filters, such as the one we will create, have no special allocator requirements.

If the allocator negotiation succeeds, then the original Connect call from the FGM returns successfully, and the FGM begins the process again for the next filter in the graph. After all the filters have been connected in this way, the graph is ready to run.

The Streaming Protocol

Data in a filter graph is pushed through the graph either by a source filter (when the source is live, such as a TV tuner or camcorder) or a parser filter (when the data is read from a file). In the latter case, the parser filter requests the data from the file reader filter and pushes it down the rest of the graph. Pushing means that the filter fills up the downstream filter's buffers with data as fast as it can (see Figure 7).

Each block of data is encapsulated in a COM object that exposes the IMediaSample interface, which is used to store the buffer address and to get and set time stamps on the data. The time stamp indicates the time at which a video frame or audio sample should be rendered on the hardware.


Figure 7: Streaming Protocol

For most video processing operations on modern PCs, the processing power is more than enough to render 30 frames per second, so the problem is how to slow down the graph so that it runs at the correct speed. The renderer filters look at the time stamps on the media samples and wait until the current time matches the time stamp before rendering a frame or an audio sample. While they are waiting, the entire graph is blocked once all the buffers between each pair of filters are filled. The clock used by the renderers for this purpose is typically supplied by the audio renderer or sound card, which has a much finer resolution than the system clock on the CPU. When the renderers present a video frame or a group of audio samples to the hardware, this frees one buffer for the upstream filter, which fills it and releases its pointer to its own upstream buffer, which becomes available for a new sample from the next filter upstream, and so on up the chain.

Of course, sometimes you just want to process data without viewing it, so you don't want to control the rate of processing. In this case, you call a method on the FGM that tells the renderer not to pay any attention to the time stamps and just let the graph run as fast as it can. In other cases, it may be that some CPU-intensive filter, such as a video decoder, is not able to keep pace with the specified frame rate. For this reason, decoders watch the time stamps of the samples they receive. If a decoder receives a late sample—that is, one whose presentation time is already in the past—it will simply drop the sample and move on to the next one to keep the graph streaming at the correct speed and to keep the audio and video in sync.

Since the source filters and renderers provided with DirectShow are well-suited for most scenarios, most of the time you'll want to extend the architecture by adding your own custom transformation on a stream. Here you actually have two choices for implementation: a DirectX Media Object (DMO) or a DirectShow filter. As with most choices in life, each option has its advantages and disadvantages. We are going to show how to write a DirectShow filter, but you should also be aware of DMOs as a solution for certain types of tasks.

A DMO is a COM object that processes digital media data. DMOs are similar in many ways to DirectShow filters, but they are self-contained. They do not require a filter graph and can be used outside of DirectShow. You can also use DMOs in DirectShow, through a wrapper filter that DirectShow provides, although it has some limitations. Whether you should write a DMO or a filter really depends on the specific needs of your application. DMOs are definitely recommended for encoders and decoders. For other scenarios, a filter might be the appropriate choice. In general, the DMO APIs are simpler than the DirectShow filter APIs, although the DirectShow filter base classes handle most of the complex aspects of filter development. If you are planning to write a filter, you should at least investigate DMOs to see if they fit your needs.

Writing a Grayscale Transform Filter

Now let's write a custom filter for DirectShow. The example will be a video transform filter that converts color images to grayscale.

When you design a filter, one of the first things to decide is what media types the filter will support. Our grayscale filter will support UYVY, which is a YUV media type. YUV loosely describes a set of image formats that are common in video processing. In a YUV format, the brightness levels of the image are encoded separately from the color information. It's easy to convert a YUV image to grayscale because you simply discard the color information and keep the brightness information.

The UYVY format uses 16 bits per pixel. One byte is for color information and one byte is for brightness. The byte layout looks like this

  U, Y, V, Y, U, Y, V, Y ...

where U and V are color information, and Y is brightness.

As described earlier, DirectShow requires that every format be expressed as a media type. For UYVY, the major type is MEDIATYPE_Video and the subtype is MEDIASUBTYPE_UYVY. To keep the example simple, our filter will only support a VIDEOINFOHEADER structure in the format block. The VIDEOINFOHEADER contains a BITMAPINFOHEADER structure, which describes each frame of the video.

The important members of the BITMAPINFOHEADER structure are as follows: biBitCount is 16 bits for UYVY, biCompression is set to the four-character code UYVY, biWidth and biHeight depend on the image size, and biSizeImage is the stride in bytes � height in pixels. The stride of an image is the offset from each row to the address of the next row. For many image formats, the start of each row must be aligned to a DWORD boundary, so the stride can be larger than the image width. Also, the graphics hardware might impose a larger stride to take advantage of the GPU architecture. As you'll see later, DirectShow provides a mechanism for the video renderer to specify the stride.

Our filter is a transform filter, so it receives video frames from an upstream filter, probably a decoder. The upstream filter must determine the image size—there is no way for us to know ahead of time. All of these details are handled in the connection protocol.

The VIDEOINFOHEADER structure contains two other members of interest. cSource defines the clipping rectangle, which is the portion of the video to display. This might be a subrectangle of the original video image. rcTarget defines the destination rectangle. This is the portion of the output buffer that will contain the actual image data. Depending on the stride, it might be a subrectangle inside the buffer.

The Base Class Library

Because DirectShow handles so many aspects of streaming, a DirectShow filter is a fairly complex object. A lot of wiring must be provided. Luckily, much of this work has already been done. The DirectShow SDK includes source code for a C++ class library, which greatly simplifies the writing of a new filter. To use the base classes, compile them into a static library and link it to your project. The SDK documentation explains how to do this in more detail.

The DirectShow base class library includes C++ classes for source filters, transform filters, and renderers. Each of these derives from the generic CBaseFilter class. Our filter will use the CTransformFilter class, which is designed for transform filters that have exactly one input pin and one output pin. The CTransformFilter class uses two other classes, CTransformInputPin and CTransformOutputPin, which define the pins for the filter. Both of these pin classes inherit the generic CBasePin class, and they both use IMemInputPin for streaming (see the hierarchy in Figure 8). The CTransformFilter class always uses separate allocators for the two pin connections.


Figure 8: Class Hierarchy

To create our filter, we must derive a new class from CTransform and override some of its methods. Which specific methods we need to override depends on how much we want to customize the filter. For this example, we'll perform the minimum set of actions that are required:

  • Pin connections: negotiate media types and buffer sizes.
  • Streaming: receive the video frame from the upstream filter, transform the image to grayscale, and deliver the transformed image to the downstream filter.
  • Some miscellany to support COM.

Negotiating the Media Type

A pin connection starts when the Filter Graph Manager calls IPin::Connect on an output pin, giving it a pointer to an input pin. The output pin can reject the connection or call IPin::ReceiveConnection on the input pin.

When the output pin calls ReceiveConnection, it must propose a media type. The input pin might already have a list of types that it prefers. If so, it advertises this list through the IPin::EnumMediaTypes method. In the base filter classes, the output pin starts by examining the input pin's list of preferred types. If none of these are suitable, the output pin tries its own list. The logic here is that the input pin's list contains formats it already likes, so if the output pin likes one of those too, you're in luck.

The CBasePin class implements the framework for the connection protocol. A pin that derives from CBasePin just needs to fill in some of the details, as defined in two virtual methods. CheckMediaType checks whether a particular media type is acceptable to the pin ("Is this format OK?"). GetMediaType retrieves a preferred media type from the pin in order of preference ("Give me a format you like").

We mentioned earlier that the CTransformFilter class uses two specialized pin classes, CTransformInputPin and CTransformOutputPin, which inherit CBasePin. These classes refine the connection process further by making some assumptions about how transform filters work.

CTransformInputPin::GetMediaType does not return any preferred types. Instead, the input pin relies on the upstream filter to propose a type. For video, this is a reasonable behavior because the upstream filter knows the image dimensions, which are part of the format. There is no way for our filter to guess the size of the image. On the other hand, for something like PCM audio, where the range of possible types is somewhat limited, it might make sense for the input pin to offer some preferred types. In this case, you can still override the pin class.

CTransformInputPin::CheckMediaType delegates to a pure virtual member function on the CTransformFilter class, CheckInputType. Our derived filter class must implement this method.

The CTransformOutputPin class refuses any pin connections until our filter's input pin is connected. The idea is that, for most transform filters, you don't know which formats you can output until you know the format you will receive. After the input pin is connected, the output pin will accept connections. At that point, the output pin delegates to a pair of pure virtual methods on the filters, CTransformFilter::GetMediaType and CTransformFilter::CheckTransform. As stated earlier, our filter class must implement both of these methods.

To summarize, there are three methods that our filter must implement to support pin connections: CTransformFilter::CheckInputType examines a proposed input type and accepts or rejects it. CTransformFilter::GetMediaType returns a preferred output type. CTransformFilter::CheckTransform examines a proposed output type and accepts or rejects it.

Figure 9 shows our implementation of the CheckInputType method. In this method, we test whether the proposed media type is a valid UYVY format. The parameter is a pointer to a CMediaType class, which is a thin wrapper for the AM_MEDIA_ TYPE structure. You can directly access the structure members from this class.

The major type should be MEDIATYPE_Video and the subtype should be MEDIASUBTYPE_UYVY. As mentioned earlier, our filter only supports the FORMAT_VideoInfo format type. In that case, the format block is a VIDEOINFOHEADER structure. We examine it to make sure the caller has set the correct bit count, compression, and image size.

If we pass all of these hurdles, we can return S_OK. Otherwise, we did not get an acceptable type from the upstream filter, so we return VFW_E_TYPE_NOT_ACCEPTED.

The CheckTransform method (see Figure 10) checks whether the filter can convert a given input format to a given output format. Our filter just removes color information, so we want the formats to match. As we mentioned earlier, however, the video renderer might request a different stride, based on the video hardware. If so, it sets biWidth equal to the new stride. The rcSource and rcTarget rectangles define the actual image size.

The CTransformFilter::GetMediaType method should return one of the filter's preferred output types, which the caller requests by index value. Again, we want the output format to match the input format. The output pin will not call this method unless our input pin is connected, so after validating the iPosition parameter we call the input pin's IPin::ConnectionMediaType method. This method returns a copy of the media type.

HRESULT CYuvGray::GetMediaType(int iPosition,                               CMediaType *pMediaType){    if (iPosition < 0)    {        return E_INVALIDARG;    }    else if (iPosition == 0)    {        return m_pInput->ConnectionMediaType(pMediaType);    }    return VFW_S_NO_MORE_ITEMS;}

Choosing a Buffer Size

After two pins agree on a media type, they must decide some things about memory allocation, including which pin will provide the allocator object, the number of buffers to allocate, and the size of the buffers. For an IMemInputPin connection, the output pin has the initiative and the final say. Optionally, it can call one of two methods on the input pin. IMemInputPin::GetAllocatorRequirements returns a structure that describes the input pin's buffer requirements, including the number of buffers and the size of each buffer. IMemInputPin::GetAllocator requests an allocator from the input pin. The input pin may or may not be able to provide one. If it can't provide one, then the output pin is responsible for creating the allocator.

To configure the allocator, the output pin calls IMemAllocator::SetProperties, which sets the number of buffers and the size of each buffer. Next, the output pin must inform the input pin by calling IMemInputPin::NotifyAllocator. This ensures that both pins are using the same allocator.

The CTransformFilter class implements most of these steps for you. The filter's input pin has no buffer requirements, so its implementation of GetAllocatorRequirements always returns E_NOTIMPL. The input pin relies on the fact that the upstream filter must create buffers that can hold the frames it will be delivering. For the output pin, you need to implement another pure virtual method, CTransformFilter::DecideBufferSize, which is called when the output pin connects to the downstream filter (see Figure 11).

Before DecideBufferSize gets called, our filter's output pin selects an allocator. It also asks the downstream input pin for that pin's buffer requirements, if any. The first argument to DecideBufferSize is a pointer to the allocator that was chosen. The second argument is an ALLOCATOR_PROPERTIES structure that contains the downstream filter's buffer requirements. If that filter doesn't have any requirements, the structure is zeroed out. Our job in DecideBufferSize is to select the properties that we want and set them on the allocator. We can ignore some or all of the downstream filter's request, but ignoring any of it could possibly cause the connection to fail.

There are four buffer properties: alignment, buffer count, buffer size, and prefix. (The prefix is an optional block of memory reserved in front of the buffer. Prefix is usually zero, but some filters request a prefix. For example, the AVI Mux filter uses the prefix to write RIFF headers in front of each sample.) Our filter enforces the following three requirements. First, the alignment and buffer count must be non-zero. Second, the buffer size must be large enough for both the upstream filter and the downstream filter. We retrieve the size of the input buffers, compare it with the downstream filter's request, and take the larger value. This should guarantee that the buffer can hold the video frame but is also large enough for the downstream filter. And third, we don't care about prefix, so we use whatever the downstream filter requested. The diagram in Figure 12 illustrates each of these steps.


Figure 12: DecideBufferSize

After you call SetProperties, make sure to check the results. The allocator might not be able to configure itself according to your request, especially if your filter doesn't own the allocator. SetProperties returns the actual properties in a separate structure (ActualProp). When DecideBufferSize returns, the base class takes care of calling NotifyAllocator on the input pin.

Streaming and Threading

Earlier, we mentioned the streaming protocol that all DirectShow filters must support. At any given time, a filter is in one of three possible states: stopped, paused, or running. A filter must be able to switch between these states in response to commands from the FGM. When the filter graph pauses, samples move through the graph until they reach the renderer. At that point, streaming will block until the application issues a run command. A transform filter can ignore the distinction between paused and running. In either state, it should accept new samples from upstream, process them, and deliver them downstream. When the filter is stopped, it should reject new samples.

Several other events can happen during streaming. Flushing occurs when the filter graph needs to clear out any stale data that the filters might have. For example, the graph is flushed if the application seeks a new point in a source file. Also, new segment messages can be issued by a filter to inform all the downstream filters that the next batch of data will form a continuous stream from the same source. The new segment message includes the start and stop times and the playback rate. Furthermore, various control messages can travel upstream, against the normal flow of data. These include seek requests.

The CTransformFilter class manages all of these activities. All you need to provide is a method to transform the input images, discussed next. If you decide to extend your filter with new functionality, you should understand the overall streaming model.

During streaming, the video renderer might request a format change. For instance, it might need to change the stride. The exact mechanisms for this are described in the documentation. The implementation of CTransformFilter translates the request into a call to the CheckTransform method described earlier. Assuming that our filter agrees to the new format, the video renderer will attach the media type to the next output sample. Before our filter processes a sample, it must check for a media type and update its internal state accordingly.

DirectShow is inherently multithreaded. Data moves through the filter graph on a worker thread called the streaming thread. Sometimes this thread is created by a source filter, which pushes samples downstream. In other cases, it is created by a parser filter, which pulls data from a source filter. Most transform filters do not create any threads. Instead, they process samples on the original streaming thread.

Even though our filter does not create any threads, every DirectShow filter operates in a freethreaded environment. Samples are received and delivered on the streaming thread, and state changes occur on the application thread. This can require some hard thinking about the proper use of critical sections to avoid deadlocks or race conditions. For a simple transform filter, the base classes hold critical sections at the right times, but you should be aware of this issue. For more information about threading in DirectShow, read "Threads and Critical Sections" in the MSDN Library.

Transforming the Image

The upstream filter delivers samples to our filter by calling IMemInputPin::Receive on our filter's input pin. The input pin delegates this call to the filter, which does several things. First it gets a new sample from our output pin's allocator. This sample will hold the converted grayscale image. Next, it copies all of the sample properties from the input sample to the output sample, including the time stamps. Then it calls CTransformFilter::Transform, a pure virtual method, to convert the image. And finally, it sends the output sample downstream.

At this point we need to implement the Transform method as shown in Figure 13. The first step is to check for a media type on the output sample by calling IMediaSample::GetMediaType. If there is one, it means the video renderer wants us to change formats. In that case, we call SetMediaType on our own filter's output pin to update its internal state.

Next, we get the address of the underlying buffers using the IMediaSample::GetPointer method. Then we call ProcessFrame, a private method that performs the conversion. The result is a grayscale image in the output buffer. We call IMediaSample::SetActualDataLength to set the length of the valid image data. (The allocated buffer might be larger than the actual image, so the downstream filter needs to know the actual size.)

ProcessFrame is a private method that we have defined in our filter (see Figure 14); it is not part of the DirectShow base classes. ProcessFrame steps across each row of pixels in the source image and calculates the output pixels. We set the color component to the neutral value (0x80) and leave the brightness value untouched.

Because the source and target images have the same dimensions, this loop is fairly easy to write, but we do need to consider some details about how video images work in Windows.

The first consideration is alignment. Every RGB DIB is aligned to a DWORD boundary. The same is true for YUV bitmaps if the bit depth is evenly divisible by 8. In that case, the biWidth value is given in pixels. In some YUV formats, the bit depth is something weird like 12, and then biWidth is given in bytes, not pixels.

Next is the target rectangle. The VIDEOINFOHEADER structure's rcTarget member defines a subrectangle within the buffer. The filter should draw the image within this rectangle. If the target rectangle is empty, that means the entire buffer should be used for the drawing operation. The video renderer uses rcTarget to indicate the stride of the buffer.

Orientation refers to the layout of the image in memory. An image is bottom-up if the first thing in the buffer is the bottom row of pixels, followed by the next row up, and so on. An image is top-down if the buffer starts at the top row and goes downward (see Figure 15). An uncompressed RGB image can be either bottom-up or top-down. The orientation is indicated by the sign of the biHeight member: bottom-up DIBs have a positive biHeight, and top-down DIBs have a negative biHeight. YUV images are always top-down, and biHeight should always be negative. However, some codecs may incorrectly set a positive biHeight—regardless, you should ignore the sign on a YUV image.


Figure 15: Image Orientation

Working with bitmaps can get especially confusing if the input format differs from the output format. When you mix top-down and bottom-up formats, it is easy to draw an upside-down image by mistake. And if you forget to account for the stride, the image might render as a bunch of diagonal lines.

The easiest way to handle bitmaps is to find the address of the first pixel in the top row, then calculate the stride, which is the offset from any given row to the next row. In a top-down bitmap, stride is a positive number. In a bottom-up bitmap, stride is negative. In our filter, we use a handy function named GetVideoInfoParameters that calculates these values, taking into account the alignment, target rectangle, and orientation. You can find the source code for this function in the code download (at the link at the top of this article), and you'll find it very useful if you do any video processing of the sort described in this article. The function lets you loop over the pixels in a consistent way and not worry about variations in formats.

COM Glue

At this point, you almost have a complete, working filter. The only thing left is to transform the filter into a proper COM object. Filters must support the usual COM plumbing, such as class factories and self-registration, but the base class library does almost everything for you. You only need to provide some information about your filter. First, create a new GUID for the filter's CLSID. (Do not reuse GUIDs from other components!) You can use the Guidgen or Uuidgen utility for this purpose:

static const GUID CLSID_YuvGray =  { 0xa6512c9f, 0xa47b, 0x45ba, { 0xa0, 0x54, 0xd, 0xb0,  0xd4, 0xbb, 0x87, 0xf7 } };

Next, write the filter's constructor method. Our filter does not have any new variables to initialize, so we simply call the CTransformFilter constructor:

CYuvGray::CYuvGray(LPUNKNOWN pUnk, HRESULT *phr) :    CTransformFilter(NAME("YUV Gray"), pUnk, CLSID_YuvGray)    {}

In addition to the constructor, you must provide a static class method that creates a new instance of the filter. This method is used by the DirectShow class factory object's IClassFactory::CreateInstance method and must follow a particular syntax. The CreateInstance method will look about the same for every filter you create, so we won't go into details here. Consult the SDK documentation for more information:

CUnknown * WINAPI CYuvGray::CreateInstance(LPUNKNOWN pUnk,                                           HRESULT *pHR)  {  CYuvGray *pFilter = new CYuvGray(pUnk, pHR);      if (pFilter == NULL)      {          *pHR = E_OUTOFMEMORY;      }      return pFilter;  }

The DirectShow class factory object also requires a template for the filter. This is basically just a series of nested structures that contain information to be written into the registry. The registry information is used by COM when you call CoCreateInstance and also by the FGM when it builds a graph (see Figure 16).

The last step is to implement the DllRegisterServer and DllUnregisterServer methods, which are needed by COM. If you are putting your filter into the generic DirectShow Filters category, you can simply call the AMovieDllRegisterServer2 function. If your filter belongs to any other category, such as video compression or audio compression, use the IFilterMapper2::RegisterFilter and UnregisterFilter methods, both fully documented in the SDK.

STDAPI DllRegisterServer(void){    return AMovieDllRegisterServer2(TRUE);}STDAPI DllUnregisterServer(){    return AMovieDllRegisterServer2(FALSE);}

The base classes implement IUnknown for you, including reference counting and QueryInterface. If you want to support any additional COM interfaces, you will need to override QueryInterface. The final step is to build and register the DLL. You can register it with the Regsvr32 utility.

Watch the Filter in Action

Now you can watch your filter in action, using the GraphEdit utility provided with the DirectShow SDK. First, try using GraphEdit to render a file normally, without the filter:

  • Launch GraphEdit from the Start menu, under Programs | Microsoft DirectX 8.1 | DirectX Utilities.
  • From the File menu, choose Render Media File. Select a video file on your hard drive. You can use Butterfly.mpg, which is included with the SDK. It is located in the folder DXSDK\samples\Multimedia\Media.
  • Click Open. GraphEdit builds a file-playback graph for the file you selected.
  • Click the Run button. A window will pop up to display the video you selected.

Now try the same file using the YUV Gray filter:

  • From the File menu, choose New to create a new filter graph. (You'll be prompted by Graph-Edit to save the old graph. Click No.)
  • From the Graph menu, choose Insert Filters.
  • Expand the DirectShow Filters list and select YUV Gray Filter.
  • Click Insert Filter, then click Close to dismiss the dialog. This adds the YUV Gray filter to the filter graph, which ensures that DirectShow will use this particular filter when it builds the playback graph.
  • From the File menu, choose Render Media File and select the same video file. If the decoder for that file supports UYVY, the YUV Gray filter should be connected to the graph automatically, between the decoder and the video renderer. If not, it might mean that the decoder does not support UYVY—clear the graph and try another file.
  • Click the Run button. The video should now appear in black and white.

Conclusion

So that's it. We've shown you how to create a basic playback application using DirectShow and how to create your own custom component to process the video stream. From here, it's a small step to creating live capture applications using Web cams, DV camcorders, TV tuners, or virtually any other type of video input device. And once you're capturing a video stream, it's just another small step to creating timeline-based sequences, 3-D graphics effects, and all the other cool applications that no one has even thought of yet.

For related articles see:
Getting Started with DirectShow
DirectShow Architecture
Microsoft Windows Media and DirectShow: Options for Application Developers
For background information see:
The DirectX node in the MSDN Online library, specifically Graphics and Multimedia/SDK Documentation/DirectX 8.1 (C++)/DirectShow*
Michael Blome is a lead programmer/writer at Microsoft in the New Media Products Division. He has been a technical writer for over 12 years, and has been working on DirectShow since 1999. His hobbies include biking with his family and deciphering the manual for his DV camcorder.
Mike Wasson is a programmer/writer at Microsoft in the New Media Products Division, and has worked on DirectShow for three years, specializing in video capture and editing. Previously he worked as a copy editor for the magazine* Hospital Practice. He claims to know how to hyphenate all bacteria names correctly.