How It Works: SQL Server Checkpoint (FlushCache) Outstanding I/O Target

I ran into an issue yesterday that brought up questions as to how the checkpoint's max outstanding, I/O target is determined.   There is different behavior across the various versions of SQL Server and it would take an entire white paper to fully explain the checkpoint process.  I am going to try to boil it down to the most relevant information.

I previously outlined some of the checkpoint behavior in the following white papers.

https://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/sqlIObasics.mspx
https://www.microsoft.com/technet/prodtechnol/sql/2005/iobasics.mspx

Checkpoint sweeps the hashed data buffers.  When it finds a dirty buffer it calls the routine (WriteMultiple).  WriteMultiple does hash lookups for adjacent, dirty pages.  The outstanding page count is incremented when a page write request is made.  When the write completes the value is decrement.

The I/Os are written asynchronously.  Checkpoint limits the number of I/Os it can post by comparing the number of outstanding requests with the target.   When the target is reached checkpoint waits.  Once an I/O completes Checkpoint can post another I/O.

Some high level pseudo-code helps.

Void FlushCache()
{

bufferPos = 0;
while(bufferPos < maxBuffers)
{

if(buf[bufferPos] is dirty)
{

pagesInMotion = WriteMultiple(buf[bufferPos]);
}

       SafeAdd(totalCheckpointPagesOutstanding, pagesInMotion);

while (totalCheckpointPagesOutstanding >= maxOutstandingPages)

{

  WaitFor..(5 milli-seconds)

}

bufferPos++;

}

}

Void CheckpointIoCompletion(….)
{

       Check success of WriteOperation

       SafeDecrement(totalCheckpointPagesOutstanding);

}

Max Outstanding I/O Target

Version Max Outstanding I/Os for Checkpoint
SQL Server 2000 SP3 100
SQL Server 2000 SP4 64 * Logical Schedulers
SQL Server 2005 *Formula

As I stated there are other parts to FlushCache.  For example SQL Server 2000 SP3 and SP4 attempt to maintain a rate that meets the configured recovery interval.  This may include additional sleeps.

See: KB 906121: Checkpoint resumes the behavior that it exhibited before you installed SQL Server 2000 SP3 when you enable trace flag 828 on a computer that is running SQL Server 2000 SP4
https://support.microsoft.com/kb/906121

Generally speaking the loop will continue without additional waits. This means that the I/O load during SQL Server 2000 checkpoints is held steady. As soon as one of the I/Os complete Checkpoint can issue a new request. 

The reason I specifically point this out is that I have seen systems where some of the I/Os become stalled.   Lets say the first 50 I/Os are stalled and the last 50 complete.  Checkpoint puts out 50 more I/O requests that can again complete.   This can make the performance counters for average disk sec/transfer and disk queue length more challenging to interpret.

*Formula (SQL 2005)

Checkpoint (FlushCache) was significantly upgraded for SQL Server 2005.  It can honor a target (see manual checkpoint) and various other parameters can affect Checkpoint.   I am going to describe one of these behaviors.

The starting point for max outstanding I/O level is loosely calculated as:

  1. Try to checkpoint entire SQL Server memory footprint in 300 seconds assuming 20ms I/O target.   Divide total SQL Server commit buffers to determine the number buffer rate.

  2. Divide outcome of by the number of logical schedulers

  3. If outcome of is less than 20 default to 20

  4. Multiply outcome by 10  

    Note: Portions of this calculation are used to determine max outstanding lazy writes.

Each time SQL Server 2005 checkpoint executes the calculated target establishes the max outstanding I/O level.   Depending on the type of checkpoint this value is adjusted as the I/Os complete. 

SQL Server 2005 checkpoint adds duration tracking to the I/O requests.  As each I/O completes the duration is used to adjust a running I/O average.  Here is some pseudo-code showing a very high level running average calculation.   A running average is used to accommodate the 'immediate' state of the I/O path so Checkpoint can react quickly to the current state and not the historical states.

Void CheckpointIoCompletion(….)
{

       Check success of WriteOperation

             duration = Current Time - I/O Start Time

             runningAvg = (duration + (runningAvg * 7)) / 8;

SafeDecrement(totalCheckpointPagesOutstanding);

}

Using this running average, Checkpoint adjusts the number of allowed max outstanding I/Os with a goal to keep I/O response time below 20ms.

 

Bob Dorr
SQL Server Senior Escalation Engineer