Paul S. Randal
QI've noticed some very strange behavior on one of our SQL Server 2005 databases. We run a DBCC CHECKDB on the database as part of our nightly database maintenance and some nights the DBCC returns corruption errors. The weird thing is that we don't see page checksums failures during the previous day and if I manually run DBCC CHECKDB the morning after the maintenance job failure, the corruptions have disappeared. Can you explain what's going on? This has been happening for about a month and I'm concerned that I can't trust DBCC CHECKDB.
AIt's good to see that you have DBCC CHECKDB as part of your regular maintenance and that page checksums are enabled. The behavior you describe can be very disconcerting when it occurs—it almost looks like DBCC CHECKDB is giving incorrect results. But this scenario does occur every so often—there are no run-time indications of corruption, DBCC CHECKDB finds corruption, and then many hours later the corruptions are gone when DBCC CHECKDB is run again. Let's take this step by step.
First, it's very common for corruptions to be reported by DBCC CHECKDB that have not been reported during regular database operations. Although page checksums are a great method of detecting corruptions caused by the I/O subsystem, they are effective only when the data pages are read by SQL Server after a corruption has occurred.
Imagine that a data page has a page checksum applied to it and it is then corrupted at some later point by the I/O subsystem. Even though the page checksum will detect the corruption, it is only when the page is read into memory by SQL Server that the page checksum is validated and the corruption is discovered. If that page is never read by SQL Server, the corruption will never be discovered. This is why it is vital to enable page checksums and regularly perform consistency checks—consistency checks will read all pages in the database, validating their page checksums and discovering corruption as early as possible.
In your case, it looks as if the corruptions have occurred in data pages that were not read as part of regular database operations and so the corruptions were not discovered until DBCC CHECKDB read the corrupt pages. Although it seems as if the page checksums didn't detect the corruption as they should, that's not the case.
Second, "disappearing" corruptions can occur quite easily between successive executions of DBCC CHECKDB, but only on databases where changes occur between the two DBCC runs. Say a page is indeed corrupt and DBCC CHECKDB reports it as being corrupt. Now say the page is then deallocated from a table (for example, because it becomes empty). A subsequent DBCC CHECKDB will not read it this time and thus will not report it as being corrupt. DBCC CHECKDB reads only allocated pages (meaning pages that are currently in use). In your case, I'm guessing that the rest of the nightly maintenance job includes index rebuilds or reorganizing—both of which can drastically alter the set of pages that are allocated to a particular object or index. This would account for the behavior you're seeing. An index rebuild deallocates the corrupt pages as a side effect and then the next DBCC CHECKDB comes back clean.
To catch the corrupt pages, alter your maintenance job so that it stops if the DBCC CHECKDB step fails. This way, you'll be able to manually verify the corruptions and take further action.
QWe're planning to upgrade from SQL Server 2000 directly to SQL Server 2008, skipping SQL server 2005. But I'm worried about tempdb. This already caused problems for us in SQL Server 2000, as we have lots of short-lived temp tables. From what I understand, tempdb is used a lot more heavily now (in both SQL Server 2005 and SQL Server 2008) and so special provisioning has to be done to prevent performance dropping after an upgrade. Can you explain why this is the case and what we should do?
AFrom your question, I'm guessing that you've had to implement the one-file-per-processor tempdb architecture that is commonly required when many short-lived temporary tables are created by multiple database connections. And you might have even had to enable the 1118 trace-flag (see "Concurrency enhancements for the tempdb database" for more information on this).
Although there is the potential for heavier tempdb use in SQL Server 2005 onward, you might not see this unless you specifically use features that rely on tempdb. Note that if you take a solution that experienced tempdb performance problems with SQL Server 2000 and upgrade it to SQL Server 2008 on the same hardware, you should in many cases see the performance problems alleviated somewhat because of some tempdb-specific changes to the SQL Server Storage Engine in SQL Server 2005.
The features that do make heavy use of tempdb on SQL Server 2005 and SQL Server 2008 are:
- Online index operations
- DML triggers
- Multiple-active resultsets ( MARS)
- Snapshot isolation (both transaction-level and statement-level)
All four of these features use an underlying technology called versioning to store various point-in-time versions of data records. Simply put, these record versions are stored in the version store in tempdb, with all user databases sharing the same version store in the single tempdb. The more you use these features, the heavier the use of the version store and hence tempdb will be and the more potential there will be for a performance impact.
The key to any successful upgrade is to implement a representative production load on a test system with the new schema and measure the performance to avoid surprises you might encounter if you were to go straight into production.
Unfortunately, this discussion is way beyond the scope of this column to cover in-depth, but there are some excellent resources available that I recommend you check-out:
- The white paper "Working with tempdb in SQL Server 2005"
- In-depth blog posts on the Storage Engine team blog
- Various SQL Server 2008 Books Online topics, starting with "Capacity Planning for tempdb"
- The general SQL Server upgrade page
QWe're implementing a nightly database maintenance plan that includes improving index performance. I've heard that setting the "fill factor" option for indexes can completely remove the need for maintaining indexes. Is this true? It seems that some indexes in our database suffer from fragmentation and some don't. Should we set a default fill factor for the database that will apply to all indexes and if so, what value should we use?
AThe fill factor setting can indeed be used to partially mitigate the need for index maintenance, but rarely can it be used to completely remove the need. In a nutshell, the fill factor setting instructs the Storage Engine to leave a certain percentage of free space in pages of clustered and nonclustered indexes when they are created or rebuilt. ( Note that the fill factor setting is not maintained during regular insert/update/delete operations.) A fill factor of 90, for instance, leaves 10% free space. Fill factors of 0 or 100 both leave no free space (this has been the source of much confusion).
The idea is that space is left in the pages, which allows records on the page to expand or new records to be inserted on the page without causing an expensive, fragmentation-causing operation called a page split. You specify a percentage of free space so the pages can become more steadily full until the next index maintenance operation occurs, which resets the fill factor again. The trick is to choose a percentage that minimizes page splits between index maintenance operations.
For an OLTP (online transaction processing) database, there's no easy answer except to choose a fill factor for each index based on trial and error. For data warehouses, where the indexes don't change, the fill factor should be 100% (meaning no free space is left on the pages). It is pretty uncommon that the default fill factor for a database is changed from the default of 100%, as the best fill factors for various indexes are usually different. The SQL Server 2008 Books Online topic "Fill Factor" has a lot more information on this.
One other option is to change the index so that page splits do not occur. This might involve changing the index key so that inserts are not random (for instance, by not using a random GUID primary key) or disallowing operations that change the size of variable-length columns.
QWe're going to be moving to SQL Server 2008 as soon as SP1 comes out and one of the features we're looking forward to using is FILESTREAM, as it removes the 2GB size limit for column values. Before we start designing the next version of our schema to use the FILESTREAM datatype, are there any drawbacks or problems we should be aware of that could lead to issues in production?
AIt's always a smart idea to determine all the characteristics of a new feature before designing it into a new schema or application, especially one that utilizes technologies outside of SQL Server itself, as is the case with FILESTREAM. Most of the information you need is contained in a white paper I wrote for the SQL Server team called "FILESTREAM Storage in SQL Server 2008." I highly recommend you read that paper for a complete discussion, but I will summarize the major areas of concern here.
First, the FILESTREAM data is stored in the NTFS file-system, rather than inside SQL Server data files. There are various configuration steps that must be taken to ensure that NTFS performs well with very large numbers of files in a single directory, such as turning off 8.3 name generation, setting the NTFS cluster size appropriately, and possibly separating the FILESTREAM data onto separate physical disks from other data.
Second, you should ensure that the average size of data stored using FILESTREAM will be 1MB or higher. Research has shown that for data sizes of less than 256KB, and sometimes between 256KB and 1MB, better performance can be gained by storing the data directly inside SQL Server rather than using a mechanism such as FILESTREAM.
Third, you should consider the operations that will be performed on the FILESTREAM data. Partial updates are not supported for FILESTREAM data, so even updating a single byte of a 200MB data value will result in an entirely new 200MB value being created. Apart from being an expensive operation, this can lead to NTFS-level fragmentation, which can further reduce performance. If partial updates are commonplace in the application, some kind of batching mechanism may be required to avoid repeated updates to a single FILESTREAM value.
Last, you should consider the cross-feature compatibility of FILESTREAM with high-availability technologies. FILESTREAM fully supports backup and restore operations (including point-in-time recovery), log shipping, and replication. It is not, however, compatible with database mirroring in any way in SQL Server 2008. ( I've been told this will be fixed in the next version of SQL Server.)
This is just a taste of the things to consider. For the full picture, you should read the white paper. As with any new feature, though, before designing an application around it, be sure to do some extensive testing to see whether its capabilities match your requirements. Given that FILESTREAM also encompasses NTFS storage, I'd also do preproduction performance and disaster-recovery testing to ensure that nothing trips you up when you go live.
Paul S. Randal is the Managing Director of SQLskills.com and a SQL Server MVP. He worked on the SQL Server Storage Engine team at Microsoft from 1999 to 2007. Paul wrote DBCC CHECKDB/repair for SQL Server 2005 and was responsible for the Core Storage Engine during SQL Server 2008 development. Paul is an expert on disaster recovery, high availability, and database maintenance, and is a regular presenter at conferences around the world. He blogs at SQLskills.com/blogs/paul.