A faster CHECKDB – Part II

Note: Validation for this post was performed in the SQL Server Customer Advisory Team (SQL CAT) Customer Lab on an HP Proliant DL385 G7 (overview | quickspec), AMD Opteron 6176 SE 2.3GHz dual socket (12 cores per socket) with 128GB RAM using HP StorageWorks P2000 G3 MSA Array Systems (4 shelves, 10TB raw, 8gbs fiber channel) (overview | quickspec).

In December of last year, I announced some changes made to improve the performance DBCC CHECKDB. To be perfectly transparent, I created a very short tale of these changes because the Cumulative Update was just shipped and I was going to be out of the office for the rest of the year. After a month or so here in the new year of 2012, I wanted to provide more of the history and details about these changes. As a quick reference, the fix for this problem can be found in this article.

Last year I was invited to spend a week with our SQL Product Executives in Japan visiting some of our top and prospective customers. Having a passion and charter for support, I naturally looked for opportunities where I see problems customers are having with the SQL product. I found out quickly from some of our biggest Japan customers, that the performance of CHECKDB was an issue. As I said in the previous blog post, as part of this trip I met Cameron Gardiner, who told me about how prevalent of a problem this had become for some of his largest accounts running SQL Server with SAP.

So after coming back to the US, I sought a solution. One that wouldn’t require a re-architecture of CHECKDB. Something we could do that was targeted but would provide value. One thing Cameron had pointed out to me was that it did not appear as though CHECKDB was driving the disk hard enough. He used an old trick for this find. Backing up a database to a NUL disk. How do you do this? Try out this syntax:

BACKUP <db> to DISK = ‘NUL’

Some people know about this as I’ve seen postings on the web about customers who have used this to test disk read performance. WARNING: Do not run this command on a production server without using the WITH COPY_ONLY option. Otherwise, it can affect your differential backups.

So what does this do? The BACKUP code in the engine is designed to do 2 things: 1) Read all the allocated pages for this database from the database files 2) Write them to a target media. A filename of ‘NUL’ is an old DOS trick (for those of you reading this that don’t know what DOS is, I’m sorry my age must be showing). The Windows OS effectively recognizes this as a special name that means “throw it away”. So by using this as our backup write target, we can test the performance of reading from our database files on disk since no time is spent writing. Backup reads the files differently (and buffers them differently) than the code in the engine to read database pages for queries (i.e. Read Ahead). So it is not quite a 100% fair comparison as we will see to what CHECKDB could do. But it is close enough approximation to see what is the best read throughput we could possibly get from our database files. What Cameron would see in his testing is that he could achieve on a common system about 50Mb/sec (using the perfmon counter PhysicalDisk: Disk Read Bytes/sec) on each drive where his SAP SQL database was stored (in his test he had 8 drives with two files on each drive). But when he would run DBCC CHECKDB WITH PHYSICAL_ONLY, he could only get at best about 20Mb/sec on each drive. Another observation he made was when he monitored PhysicalDisk: Avg Disk Bytes/Read it would show several instances of 8kb reads. He asked me why we would be doing so many single pages reads? So from his perspective it appeared as though CHECKDB wasn’t reading the files as fast as BACKUP or as efficiently. When he first told me this, I explained that CHECKDB had different purposes and different work to do than BACKUP so it is not a fair comparison. But he told me that at no point during the CHECKDB run did we ever get close to the 50Mb/sec rate for each drive. He would have expected that even with some of the “overheard” of CHECKDB to perform consistency checks that at some point it would get close to BACKUP. It was hard to argue this point. So he and I went on a path of doing more tests on SAP SQL test databases he had available in Microsoft labs. I realized after some testing that there was an opportunity for improvement. But I knew I needed help. So along come some heroes in this story.

First, I need to secure lab resources for very large databases and compute resources. So along comes my first hero, Jimmy May. Jimmy is a Senior Program Manager in the Customer Programs team (i.e. SQLCAT). He is my resource when it comes to large lab resources such as this. But it is not just Jimmy. Throughout this entire process, Steven Schneider worked tirelessly to keep providing me the hardware resources I needed to make this happen. Without Jimmy and Steven, I’m not sure I would have ever made it through this.

With Cameron’s help to get an actual SAP-SQL database and the right hardware resources, I was ready to investigate and tack the problem. For so many years, Paul Randal and then Ryan Stonecipher were my “goto” SQL Developers to investigate CHECKDB issues. When I contacted Ryan about it he told me another developer owned CHECKDB now. So my next hero (and the real hero who made this happen) came into the story, Andrew Cherry. Andrew is a developer working within the SQL Engine. He and I started a dialogue on the problem and what could be done about it.

We went through several different passes at what could be a possible bottleneck to allow CHECKDB to drive the disk harder. The first thing Andrew told me about our work is that 1) We could really only optimize this work for the WITH PHYSICAL_ONLY option 2) CHECKDB was never designed to push the hardware resources to their limit. In fact, CHECKDB was designed to ensure it didn’t consume too many resources so it would not affect the performance of the overall system. One case in fact is the behavior of “batching”. Instead of trying to process all objects in the database at one time, CHECKDB processes them in “batches” to avoid using too much tempdb space. Andrew noticed that due to the sheer number of IAM pages in the SAP SQL database we were using that the batch concept was slowing down performance. This is due to the fact that all the IAM pages were in cache from the “CHECKALLOC PHASE” at the beginning. But as we started processing each “batch”, some IAM pages where getting thrown out of cache. So later batches would have to perform very slow single 8k page reads to bring them back in cache. This explained the behavior Cameron saw for 8kb reads. So we felt one change we could make was to give the customer an option to use only “one batch” (and in fact that is one of the changes in the eventual trace flag 2562).

We stopped at this point and made sure we agreed on our goals and boundaries:

  • We cannot make such large changes as to risk the correctness and architecture of CHECKDB
  • We would only target WITH PHYSICAL_ONLY even though we believed it could help the “FULL” CHECKDB scenario.
  • Our changes had to be acceptable to fit into a Cumulative Update (i.e. no “big” changes)
  • We would try to push hardware resources harder
  • Therefore, we would use trace flags as options for these changes as we didn’t want the default to be “push the hardware resources harder”
  • We would try to get as close as possible to the “BACKUP ‘NUL’” test for the portion of CHECKDB that reads in all the pages from disk. We know we cannot achieve that same performance but let’s strive to get as close as possible.

The results of the “object batching” work certainly helped but we both felt it was not enough. We did see now the Avg Disk Bytes/Read stay steady around 64kb which told us that standard Read-Ahead algorithms were kicking in and the single page read problem was gone. But it was apparent that we still could not come close to the disk read byes/sec number that BACKUP was achieving. Andrew went off to investigate what could be causing this and whether we could do something about it.

After some period of time, we had a meeting and he came up with his findings. He observed that during the time CHECKDB reads pages, there was what he thought an unusual high number of non-Page Latch Waits surfaced as what is called DBCC_MULTIOBJECT_SCANNER latches. This latch is used for several purposes but one of them is to protect an internal bitmap of all pages that need to be read. When CHECKDB “runs in parallel” we need to make sure we coordinate which pages to read across all the threads. We noticed during the time we read pages, these non-Page latches waits were in the tens of thousands. We didn’t know if this was needed but it was certainly an area to investigate. What Andrew found is that we were not as efficient as we could be in using this latch. Therefore, he created a concept of “page batching” where all the worker threads doing the reading for CHECKDB could read more pages at a time and hold the latch less frequently and for less time. The result was a big breakthrough. Our non-page latch count when down to the hundreds now. Our SQL Server: Buffer Manager – Readahead pages/sec went up. And most importantly, our Disk Read Bytes/sec was very close to BACKUP. The overall time to run CHECKDB WITH PHYSICAL_ONLY is not the same as BACKUP because CHECKDB has other things it must perform, namely CHECKALLOC, building a “fact table”, actually checking the page consistency, and of course building up and presenting results. Furthermore, BACKUP is designed to read in large chunks of the database file and buffer this in its own memory, while CHECKDB uses standard BPool Read-ahead algorithms. Trying to change that violated our goals and boundaries we established up front.

The overall result of this work for our test SAP SQL database was very good and to a point where I felt we had the necessary changes that would be of value to customers. We started with a 4.3TB SAP SQL database in which CHECKDB WITH PHYSICAL_ONLY took well over 13 hours. With Andrew’s “object batching” and “page batching” changes we were now down to 5 hours and 57 minutes (BTW. BACKUP ‘NUL’ for this db was 4 hours 35 mins).

I was very satisfied with these results so felt we were ready to get these changes to customers. Before just moving these into the Cumulative Update cycle, we wanted some “live” customer testing. So we asked Cameron Gardiner to have some of his largest customers test a private build with these changes. He chose one of his customers with a 14TB SAP SQL database. The result was staggering. Before our private build, CHECKDB WITH PHYSICAL_ONLY took over a day. Now with our change, it ran in 8 hours!

With these tests completed, we moved into the cycle of getting this pushed into Cumulative Updates for SQL Server 2008R2 and SQL Server 2008. The result of this work was a 2008 R2 CU release back in December of 2011. The SQL Server 2008 CU updates will be in the next release due here in March of 2012. In addition, Andrew baked in the “page batching” changes into SQL Server 2012 as the default leaving trace flag 2562 to enable “object batching” (since this results in a larger tempdb).

There is no question we optimized and tested these changes for the “large” (+1TB) SQL Server user. But I think you might find it useful for smaller databases. My caution to you is that we push the hardware resources much harder so these changes are more for users who run CHECKDB “offline” (for example on a backup server).

There are also some factors that can affect whether these changes actually help make your execution of CHECKDB faster:

  • maxdop for CHECKDB – The more worker threads CHECKDB can use the faster we will read the pages. Therefore, machines where SQL Server uses more cores can expect better performance
  • optimized I/O – We push the disk harder for reads but if you have a slow disk or bottleneck in the I/O system, these changes won’t help you any.
  • Memory – We use the SQL Buffer Pool to read pages. The less memory to read them in, the slower the performance. This however, is not as significant a factor as maxdop and I/O.
  • Tempdb performance – CHECKDB has to use tempdb for fact tables and aggregation of results. If tempdb I/O is slow, it can also slow down the performance of CHECKDB.
  • Spreading files across disks – As with the general performance of server queries, if we are spreading out our reads across separate physical disks for files, we can achieve better I/O performance.

This was a complete team effort at Microsoft to make these changes happen. From SQLCAT, to our labs, to developers, and our release services team. Many people contributed to what may seem like a small improvement but something that will help customers with very large Tier-1 databases.

Bob Ward
Microsoft