Basics of A/V synchronization in DirectShow

All filters in a DirectShow graph should be synchronized to the same clock, the reference clock. The filter graph manager makes sure that it finds one component that will be the reference clock, in the following simplified order: user-specified clock, renderer (usually audio renderer), or system clock if none available before.

The stream time is based off the reference clock, but relative to the time the graph last started running (so the stream time doesn't move if the graph is paused). If a media sample that enters a renderer has a time stamp t, then it means it should be rendered at stream time t. This is the basic mechanism by which a/v synchronization occurs.

There is usually a crystal in the audio hardware though, and no guarantees that the hardware timer will match the system clock. That's why usually we have the audio renderer being the reference clock for the whole DirectShow graph. If the audio renderer receives a sample late, or if the audio clock is consistently drifting from the system clock, then the audio renderer will issue stream time adjustments.

An audio renderer implementation will usually inherit from the CBaseReferenceClock class, and will call SetTimeDelta() function whenever it needs to do an adjustment to the stream time. Note that it should use a low pass filter before sending adjustments to the master clock so that no unnecessary jittering is introduced.

As the video renderer uses the incoming timestamps to schedule samples for presentation, the scheduler is based off stream time, and the audio renderer has control to change the stream time, the video and audio renderer will be using the same timeline.

About the Video Renderer & Frame Dropping

If the video is running slow, and all video frames are being rendered, then theoretically the video renderer will receive samples with timestamps in the past and schedule them for immediate rendering. If this situation continues to happen, what will happen is that the video is going to be behind audio. This shows the need for frame dropping.

In fact, audio and video synchronization in DirectShow works by a combination of two elements:

  • Audio renderer controlling the DirectShow stream time;
  • IQualityControl and IDMOQualityControl interfaces guiding frame dropping algorithm

Dropping frames at the video renderer level is of course not very effective. If using overlay flipping surfaces, for instance, dropping a frame doesn't get you much farther trying to catch up (because the flipping itself is very cheap). Even in the case of Blits, it is still going to help very little (rendering time is small compared to decoding time). That's why there is the need to indicate the state and lateness of the renderer to upstream filters/DMOs, which is done through the quality notification messages.

The video renderer originates the notification messages (since it is the filter that needs to run in real time), and sends them upstream. If the upstream filter is a decoder, and it can handle it, it doesn't pass the message upstream. If it can't handle, then it passes it upstream. Note that the video renderer will drop frames anyway if it is very late.

Here's a coarse example of how to use the Quality interface to be able to drop frames in a decoder filter:

HRESULT CDecoderFilterPin::Notify(IBaseFilter *pSender, Quality q)

{

       if (quality sink has been set) // m_pQSink

       {

              status = Pass Notify on the quality sink (base sender is the decoder filter now)

}

else

{

if (has frame dropping algorithm)

              {

                     status = Call decoder filter to do frame dropping

              }

              else

              {

                     if (upstreamQualityControl)

                     {

                           status = Pass Notify() on to upstream quality control interface (base sender is the decoder filter now)

                     }

                     else

                     {

                            status = not handled;

                     }

}

}

return status

}

The decoder needs to decide, given what the current time is (given by the IQualityControl/IDMOQualityControl interface) if it needs or not to drop frames. Algorithms for frame dropping can vary from the extremely simple to very complicate. An example of an extremely simple one follows:

The quality notification message will indicate how late we were when we last rendered a video frame. A very simple algorithm would be to drop all frames until you arrive at that "time", so that catch-up happens fast:

CatchupPTS = q.Late + q.Timestamp

Drop all frames until PTS_frame >= Catchup_PTS

Of course, there are many variations for this. If B-frames are available, start trying to catch up by dropping B-frames. If the decoder has decoding decoupled from output generation (or color conversion etc.), a first step when you're late is to start dropping the output generation, and try to catch up. If neither is working, then you may have to drop P-frames, in which case you'll have to wait until the next I-frame, which is never a good user experience because the spacing between I-frames may be large.

When doing trick modes, the algorithm for dropping frames will usually be more forgiving, since frames are arriving at rates other than 1.0. In fact, at high-rates, all samples that the decoder receives are probably going to be key frames anyway. In this case, you may drop any frame without a significant penalty, making catch up much easier.