Sequential Read Ahead

Balancing CPU and I/O throughput is essential to achieve good overall performance and to maximize hardware utilization.  SQL Server includes two asynchronous I/O mechanisms - sequential read ahead and random prefetching - that are designed to address this challenge.

To understand why asynchronous I/O is so important, consider the CPU to I/O performance gap.  The memory subsystem on a modern CPU can deliver data sequentially at roughly 5 Gbytes per second per socket (or for non-NUMA machines for all sockets sharing the same bus) and (depending on how you measure it) can fetch random memory locations at roughly 10 to 50 million accesses per second.  By comparison, a high end 15K SAS hard drive can read only 125 Mbytes per second sequentially and can perform only 200 random I/Os per second (IOPS).  Solid State Disks (SSDS) can reduce the gap between sequential and random I/O performance by eliminating the moving parts from the equation, but a performance gap remains.  In an effort to close this performance gap, it is not uncommon for servers to have a ratio of 10 or more drives for every CPU.  (It is also important to consider and balance the entire I/O subsystem including the number and type of disk controllers not just the drives themselves but that is not the focus of this post.)

Unfortunately, a single CPU issuing only synchronous I/Os can keep only one spindle active at a time.  For a single CPU to exploit the available bandwidth and IOPs of multiple spindles effectively the server must issue multiple I/Os asynchronously.  Thus, SQL Server includes the aforementioned read ahead and prefetching mechanisms.  In this post, I'll take a look at sequential read ahead.

When SQL Server performs a sequential scan of a large table, the storage engine initiates the read ahead mechanism to ensure that pages are in memory and ready to scan before they are needed by the query processor.  The read ahead mechanism tries to stay 500 pages ahead of the scan.  We can see the read ahead mechanism in action by checking the output of SET STATISTICS IO ON.  For example, I ran the following query on a 1GB scale factor TPC-H database.  The LINEITEM table has roughly 6 million rows.

SET STATISTICS IO ON

SELECT COUNT(*) FROM LINEITEM

Table 'LINEITEM'. Scan count 3, logical reads 22328, physical reads 3, read-ahead reads 20331, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Repeating the query a second time shows that the table is now cached in the buffer pool:

SELECT COUNT(*) FROM LINEITEM

Table 'LINEITEM'. Scan count 3, logical reads 22328, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

For sequential I/O performance, it is important to distinguish between allocation ordered and index ordered scans.  An allocation ordered scan tries to read pages in the order in which they are physically stored on disk while an index ordered scan reads pages according to the order in which the data on those index pages is sorted.  (Note that in many cases there are multiple levels of indirection such as RAID devices or SANS between the logical volumes that SQL Server sees and the physical disks.  Thus, even an allocation ordered scan may in fact not be truly optimally ordered.)  Although SQL Server tries to sort and read pages in allocation order even for an index ordered scan, an allocation ordered scan is generally going to be faster since pages are read in the order that they are written on disk with the minimal number of seeks.  Heaps have no inherent order and, thus, are always scanned in allocation order.  Indexes are scanned in allocation order only if the isolation level is read uncommitted (or the NOLOCK hint is used) and only if the query process does not request an ordered scan.  Defragmenting indexes can help to ensure that index ordered scans perform on par with allocation ordered scans.

In my next post, I'll take a look at random prefetching.

Comments

  • Anonymous
    September 25, 2008
    I hope the question I emailed you got you interested in blogging on this subject. Thanks for the reply btw.You said, "The read ahead mechanism tries to stay 500 pages ahead of the scan." Is SQL Server doing 500 pagesized asynchronous I/O's in this case? A single (500*pagesize) I/O? Or some other combination? It would seem that this should vary depending on whether you are doing OLTP (better response time) or OLAP (better throughput).

  • Anonymous
    September 26, 2008
    SQL Server tries to combine up to 32 contiguous pages (256 Kbytes) into a single (asynchronous) I/O.  So, in a best case scenario, it can read ahead 500 pages in just 16 I/Os.  However, if the pages in the table are not contiguous (e.g., due to fragmentation), SQL Server cannot combine the I/Os and must issue one I/O per page (8 Kbytes).

  • Anonymous
    October 07, 2008
    In my last post , I explained the importance of asynchronous I/O and described how SQL Server uses sequential

  • Anonymous
    October 29, 2008
    Craig,Few moments here I think need further elaboration.First - "The memory subsystem on a modern CPU can deliver data sequentially at roughly 5 Gbytes per second per core"There are NUMA and non-NUMA hardware systems. Non-NUMA share FSB for exclusive access to RAM, so it becomes 5 Gbytes perl ALL CPUs on board. For NUMA systesms - all cores on the same NUMA node share that 5Gbytes. Thus its not per core.Second is "Solid State Disks (SSDS) can reduce the gap between sequential and random I/O performance by eliminating the moving parts from the equation, but a performance gap remains." Do you mean performance = throughput here?Thank you, Serge

  • Anonymous
    October 30, 2008
    First, I should have written 5 Gbytes per second per SOCKET.  And, you are correct.  With a shared bus, this bandwidth does not scale.  I will clarify the post.  Thanks.Second, yes, I'm referring to the gap between memory and disk/SSD throughput.

  • Anonymous
    February 25, 2009
    In this post from last year, I discussed how random I/Os are slower than sequential I/Os (particularly

  • Anonymous
    August 25, 2009
    Your comment: "SQL Server tries to combine up to 32 contiguous pages (256 Kbytes) into a single (asynchronous) I/O." BOL ('Reading Pages'): "The read-ahead mechanism allows the Database Engine to read up to 64 contiguous pages (512KB) from one file." Is it 32 or 64 contiguous pages?

  • Anonymous
    August 25, 2009
    Your comment: "If the pages in the table are not contiguous (e.g., due to fragmentation), SQL Server cannot combine the I/Os and must issue one I/O per page (8 Kbytes)." 'Contiguous' loses its meaning, however, in a world based on RAID striping and proportional fill. Precisely what is meant by contiguous and fragmented? Are we only talking about page splits?

  • Anonymous
    August 31, 2009
    BOL is correct.  SQL Server does combine up to 64 contiguous pages into a single read.  However, it will combine only 32 contiguous pages into a single write.  Thanks for the correction.  I apologize for the error. From the perspective of SQL Server, the terms contiguous and fragmented refer to the logical layout within a database file.  SQL Server has no visibility to the physical layout.

  • Anonymous
    September 16, 2012
    How does SQL Server determine the page numbers of read-ahead pages when scanning an index in index-order? Each page only contains a pointer to the next page, not to the next 500 pages. How does SQL Server know which pages to prefetch? Great blog!

  • Anonymous
    September 17, 2012
    Hi Tobi, The non-leaf nodes of the B-tree (specifically those nodes one level above the leaf nodes) have pointers to and can be used to prefetch multiple pages at a time.  Of course, this optimization does work for heaps which do not have non-leaf nodes. HTH, Craig