SQL Server 2000 I/O Basics

Author: Bob Dorr, Microsoft SQL Server Escalation

Note: You can also read more about I/O and SQL Server 2005 in "SQL Server I/O Basics Chapter 2".

SUMMARY: Learn the I/O requirements for Microsoft SQL Server database file operations. This will help you increase system performance and avoid I/O environment errors.

On This Page

Introduction
Terms
Microsoft SQL Server I/O Design
Microsoft SQL Server Core I/O Requirements
Problems, Pitfalls, and Samples
Utilities
Conclusion
Other References

Introduction

Because new devices and storage solutions continue to appear in the marketplace, the environments in which Microsoft® SQL Server™ runs have become extremely diverse. To ensure data integrity for our customers who are running SQL Server, it is important that I/O environments provide appropriate functionality.

The purpose of this paper is to explain the Input/Output (I/O) requirements for SQL Server database file operations so that vendors and customers can evaluate and adjust their environments to meet the needs of SQL Server.

Important When planning, deploying, and maintaining a Microsoft SQL Server installation, ensure that the I/O system supports all the factors outlined in this article.

Both Microsoft Knowledge Base and Microsoft SQL Server Books Online (BOL) contain many details related to SQL Server I/O operations. Links to information that is important to your understanding of the material covered in this paper are included where appropriate.

Important It is highly recommended that you fully review the referenced material before continuing with subsequent sections of this paper.

Books Online (BOL) References: All BOL references cited in this paper are taken from the Microsoft SQL Server 2000 Books Online site (updated to include SP3 functionality), which can be found at:

https://www.microsoft.com/sql/prodinfo/previousversions/books.mspx

Click the Product Documentation link. Then follow the instructions to either download the BOL or browse online.

Terms

This section defines common terms that are used both in this paper and in documents that are referenced in this paper.

ACID Properties

The ACID (Atomicity, Consistency, Isolation, and Durability) properties are a core requirement for SQL Server and other transactional, reliable database products.

Atomicity

A transaction must be an atomic unit of work; either all of its data modifications are performed or none of them are performed.

Consistency

When completed, a transaction must leave all data in a consistent state. In a relational database, all rules must be applied to the transaction's modifications to maintain data integrity. All internal data structures, such as B-tree indexes or doubly linked lists, must be correct at the end of the transaction.

Isolation

Modifications made by concurrent transactions must be isolated from modifications made by all other concurrent transactions. A transaction either sees the data in the state it was in before another concurrent transaction modified it, or it sees the data after the second transaction has completed, but it does not see an intermediate state. This is referred to as serializability because it provides the system with the capability to reload the starting data and replay a series of transactions to end up with the data in the same state it was in after the original transactions were performed.

Durability

After a transaction has completed, its effects are permanently in place in the system. The modifications persist even in the event of a system failure.

Write-Ahead Logging (WAL) Protocol

Write-Ahead Logging is a key technique for providing the ACID properties. Briefly, WAL requires that all the transaction log records associated with a particular data page be flushed to stable media before the data page itself can be flushed to stable media. WAL will be described in more detail in this paper.

Point in Time

A time interval that can be captured as if time were frozen in that moment.

Stable Media

Stable media is often confused with physical storage. SQL Server defines stable media as storage that can survive system restart or common failure. Stable media is commonly physical disk storage, but other devices and certain caching facilities qualify as well.

Many high-end disk subsystems provide high-speed cache facilities to reduce the latency of read and write operations. This cache is often supported by a battery-powered backup facility. The battery backup provides the necessary power to maintain the data in the cache for several days, but implementations vary by manufacturer. Manufacturers can switch out batteries to increase the life of the cache when necessary.

The key is that after the system problem has been corrected, the pending writes in the cache are treated as if the failure or restart never occurred. Most manufacturers' implementations immediately flush pending writes to physical disk during the restart operations.

The following are examples of successful recovery situations actually encountered by Microsoft SQL Server Product Support Services personnel.

Example 1: Hardware failure (CPU board)

Battery backup on the caching controller maintained the data. A new computer was built and the controller and I/O subsystem were attached to the new computer. On restart, the caching controller flushed all cached writes to disk and DBCCs were clean.

Example 2: Power supply failure

Battery backup on the caching controller maintained the data for four days (with battery swaps) until the power supply could be properly updated. On restart, the caching controller flushed all cached writes to disk and DBCCs were clean.

Important Always consult with your hardware manufacturer for proper stable media strategies.

If you have a hardware or power failure, Microsoft strongly recommends the execution of a full DBCC CHECKDB suite and acquisition of the necessary backups to ensure data integrity.

For detailed information about the use of caching controllers and SQL Server, see the following articles on the Microsoft Support Web site.

Write Ordering

Write ordering (or write dependency) is the capability of the I/O subsystem to preserve the order of I/O operations. As described previously, stable media can include caching. If a point in time is studied, the stable media should reveal preserved I/O order characteristics.

The order of the I/O operations associated with SQL Server must be maintained. The system must maintain write ordering or it breaks the WAL protocol as described in this paper. (The log records must be written out in correct order and the log records must always be written to stable media before the data pages that the log records represent are written.) After a transaction log record is successfully flushed, the associated data page can then be flushed as well. If the subsystem allows the data page to reach stable media before the log record does, data integrity is breached.

For example, if a computer running SQL Server had to restart after the data page had reached stable media but before the log record, the database recovery may fail. Because the log record for a page modification does not exist, the recovery process cannot determine the proper transactional state of the page. Worse, the log record did not get flushed to stable media so the recovery process is not aware that the page requires rollback and it cannot attempt to correct the problem, thereby leaving the database in an unknown state.

Multichannel and Load-Balancing Systems

Many high-end media environments implement load-balanced systems that can have multiple channels to support I/O requests. These systems must maintain support for I/O ordering. Many of these systems support I/O ordering with a stable media cache and subsequently combine and/or split I/O requests across available subsystem resources to complete the storing to physical media.

For specific information on I/O tuning and balancing, you will find more details in the following document.

Torn I/O

Torn I/O is often referred to as a torn page in SQL Server documentation. A torn I/O occurs when a partial write takes place, leaving the data in an invalid state. SQL Server 2000/7.0 data pages are 8 KB in size. A torn data page for SQL Server occurs when only a portion of the 8 KB is correctly written to or retrieved from stable media.

SQL Server always checks I/O completion status for any operating system error conditions and proper data transfer size and then handles errors appropriately. Torn pages generally arise after system outages where the subsystem does not complete the entire 8-KB I/O request.

Disk drive manufacturers generally limit data transfer guarantees to sector boundaries of 512 bytes, so if the stable media is a physical disk and the controller does not contain a cache with battery backup capabilities, the I/O request is limited to the final spin/transfer rate of the physical drive. Thus, if an I/O is issued to write 8 KB (for a total of sixteen 512-byte sectors), but only the first three sectors make it to stable media, then the page becomes torn resulting in data corruption. A subsequent read of the 8-KB page would bring in 3 sectors of the new version of the page and 13 sectors of an older version.

SQL Server can enable torn page detection on a per-database basis. A portion of the first 512-byte sector of the page contains the page header. When torn page detection is enabled, the header member contains information about each of the other 512-byte sectors or segments of the 8-KB page. When the page is read in, the torn page information is used to detect a torn page situation.

Torn page detection incurs minimal overhead and is a recommended practice for SQL Server installations.

To read more about torn page detection, see “Torn Page Detection” in SQL Server Books Online.

Log Parity

Hardware manufacturers guarantee sector-size writes so SQL Server 2000 transaction log files are always written with sector-size alignment. Each sector of the transaction log contains a parity flag. This flag can be used to determine the last sector that was correctly written.

During a recovery operation, the log file(s) are scanned for the final sector that was written; the log records can then be used to return the database to the appropriate transactional state.

Mirroring and Remote Mirroring

Mirroring is a common data redundancy and emergency recovery practice. Mirroring can be implemented at a software or hardware level. Mirroring installations have historically been physical implementations of locally attached hardware cabinets. Advances in remote mirroring have made it possible to maintain mirrors across long distances.

Several types of mirroring implementations are available on the market. Some implementations are cache-based; others ensure that the I/O takes place at all mirrored locations before the I/O request is considered complete. Whatever the implementation, write ordering must be maintained.

SQL Server considers a mirror to be a stable-media, point-in-time copy of the primary data. Point in time is an important aspect of this. Strict maintenance of the WAL requirements must occur on the mirrored subsystem to maintain the ACID properties of the data. The mirrored subsystem must mimic the exact points in time as experienced in the primary data.

For example, many high-end installations contain multiple I/O storage devices. If database log files are placed on one mirror set and data files on another mirror set, write ordering cannot be directly maintained across the separate hardware components. Without extended capabilities, the log and data page write order on the mirror devices cannot be maintained with point-in-time capabilities. Extended mirror capabilities are needed to ensure that write ordering can be maintained across multiple physical mirrored devices. These are often referred to as Mirror Groups or Consistency Groups.

The SQL Server 2000 High Availability Series expounds upon remote mirroring. For more information, see the following document.

Forced Unit Access (FUA)

Forced Unit Access (FUA) occurs when a file is opened (by using CreateFile) with the FILE_FLAG_WRITETHROUGH flag. SQL Server opens all database and log files with this flag.

The flag indicates that on any write request the FUA bit should be sent with the request to the subsystem. This bit indicates to the subsystem that data must reach stable media before the I/O is to be considered complete and the operating system signaled that the I/O is complete. No intermediate cache should be used that is not considered stable media. In other words, the write should go directly to the stable media, a process that is called writethrough.

This prevents problems that occur when a cache (such as an operating system cache that is not battery backed) accepts the I/O and informs SQL Server that the I/O is complete when in fact it has not yet been stored to stable media. Without this capability, SQL Server could not rely on the system to support WAL protocol needs.

Check with your manufacturer to determine their FUA handling. Some hardware treats the FUA status as an indication that the data must be stored on physical disk (even if a battery-backed cache is available). Others consider a battery-backed cache to be stable media and FUA-compliant. Different implementations can have profound effects on SQL Server performance. A subsystem that supports a stable media cache allows write operations to complete faster than a subsystem that requires that the data be written to the physical media.

Important IDE disk specifications and implementations do not have clear standards for how the FUA request is handled. SCSI disk specifications and implementations use the FUA request to disable physical disk caches and other caching mechanisms. In many IDE implementations, the FUA request is simply discarded by the IDE hardware, thereby making this type of subsystem unsafe to use with SQL Server or with any other product that relies on the FUA behavior. Because of the need to honor the FUA setting, some IDE drive manufacturers have utilities that allow IDE drive caches to be disabled, making them safe for use with SQL Server.

Important Some versions of Microsoft Windows® do not always propagate the FUA bit to the hardware. Corrections have been made starting with Windows 2000 Service Pack 3 (SP3) to always propagate the FUA bit. For backward compatibility, Microsoft has produced a utility to control the operating system behavior.

Use of the utility to disable FUA behavior for an SQL Server installation should be limited to stable media caches.

This utility is located on the Microsoft Web site. See the following:

Data Pages

The SQL Server database page size is 8 KB. Each page contains a header with fields such as Page Number, Object Id, LSN, Index Id, Torn bits, and Types. The actual row data is located on the remaining portion of the page. The internal database structures track the allocation state of the data pages in the database.

Data pages are also referred to as pages.

Page Number

A page number is a value from 0 through ((Max File Size/8 KB)-1). The page number multiplied by 8 KB provides the offset in the file to the first byte in the page.

When a page is read from disk, the page number is immediately checked to ensure that the proper offset was returned (the page number in the header is compared to the expected page number). If this is not the case, SQL Server will generate Error 823.

Object ID

This is the ID of the object to which the page is assigned within the schema of the database. A page can be assigned to only a single object. When the page is read from disk, the object ID is checked on the page. If the object ID does not match the expected object ID, SQL Server will generate Error 605.

SQL Server often performs its writes on page-sized, 8-KB, or larger boundaries.

Extents

SQL Server generally (except for nonmixed extents) allocates space an extent at a time. An extent is eight 8-KB pages, or 64 KB. SQL Server often performs reads in extent-sized (64 KB or 128 KB) boundaries as well.

Buffer Pool

The buffer pool is also referred to as the BPool. The BPool consumes the majority of the user mode address space leaving only a few 100 MB of the virtual address range free for thread stacks, DLLs, and other activities. The buffer pool is reserved in large chunks but the working page size of 8 KB is maintained to match the database page size.

Hardware Read Cache

The hardware read cache is commonly a read-ahead cache used by the controllers. Depending on the size of the available cache, the read-ahead cache is used to retrieve more data than the actual read request may have asked for.

The hardware read cache and read-ahead will be helpful to an application whose data is commonly of a contiguous nature and is retrieved in a reasonably contiguous manner, such as OLAP scans or a reporting application.

Because the hardware read cache removes some of the cache memory that could be used to support write requests, it can have a negative effect on transactional installations (OLTP) that require that data be written at high rates.

Important Some controllers will not do read-ahead if the size of the read request is larger than 16 KB. If the primary application is Microsoft SQL Server, the hardware read-ahead activity does not provide any benefit because I/O read requests can occur for sizes larger than 16 KB. Check with your hardware vendor for complete details and recommendations when running SQL Server.

Hardware Write Cache

The hardware write cache not only caches write requests but also fulfills read requests if the data is still available in the hardware write cache. This is a commonly used I/O caching mechanism.

Hardware write cache capabilities can be critical in maintaining OLTP performance goals. With the proper battery backup and algorithms, the hardware write cache can secure the data safely (on stable media) as well as increase the speed of an application like SQL Server by hiding much of the true physical I/O time.

Error 823

SQL Server error 832, "I/O error <error> detected during <operation> at offset <offset> in file '<file>'" occurs when:

  • A ReadFile, WriteFile, ReadFileScatter, WriteFileGather, or GetOverlappedResult operation results in any operating system error code.

  • The page number on the page read from disk is not the expected page ID.

  • The data transfer size is not valid.

  • A torn read is detected when torn page detection is enabled.

  • A stale read is detected when stale read detection is enabled.

For extended details on the 823 error, see the Microsoft Web site topic:

Error 605

SQL Server error 605, "Attempt to fetch logical page (x:yyy) in database 'dddd' belongs to object 'aaa', not to object 'tttt'." occurs when:

  • The object ID on the page does not match the object ID that is expected on the page.

SQL Server error 605 is detected each time a page is retrieved for a scan. The scan is associated with a specific object. If the scan ID does not match the object ID that is stored on the page, the error is encountered. This occurs the first time the page is used and may occur during subsequent in-memory lookups of the page.

Microsoft SQL Server I/O Design

To fully understand SQL Server I/O design, it is important to understand the basics of database file I/O operations, transaction log I/O operations, and transactional consistency maintenance.

The I/O and transactional design of SQL Server ensures that the ACID properties are fully maintained. This section focuses on the Durability property and its associations with the operating system and hardware environments.

Write-Ahead Logging (WAL) Protocol

A key to ACID property maintenance is the WAL protocol. The WAL protocol requires that all transaction log records associated with a particular data page be flushed to stable media before the data page can be flushed to stable media.

Microsoft SQL Server 2000 and Microsoft SQL Server 7.0 use 8-KB data pages and sector-aligned transaction log buffers. Earlier versions of SQL Server use 2-KB data and log pages.

The following example is a code example taken from the following Microsoft Knowledge Base article. See the article for more details.

Consider the following example that demonstrates how SQL Server upholds the WAL protocol for in INSERT statement. For this example, assume that there is no index and the page that is affected is page 150.

BEGIN TRANSACTION
   INSERT INTO tblTest VALUES (1)
COMMIT TRANSACTION

Statement

Actions performed

BEGIN TRANSACTION

A BeginTran log record is written to the log cache but there is no need to flush to stable media because SQL Server has not made any physical changes.

INSERT INTO tblTest

  1. Page 150 is not currently present in the SQL Server cache so data page 150 is retrieved into the SQL Server data cache.

  2. Appropriate locks are obtained and the page is latched.

  3. An Insert log record is built and added to the log cache.

  4. A new row is added to the data page and the page is marked dirty.

  5. The latch is released.

  6. The log records associated with the transaction do not need to be flushed at this point because all changes remain in volatile storage.

COMMIT TRANSACTION

 7.   A Commit log record is generated. The log records associated with the transaction (and all previous log records) must be written to stable storage. The transaction is not considered committed until the log records are correctly flushed to stable media. (Log is hardened.)

 8.   Data page 150 remains in the SQL Server data cache and is not immediately flushed to stable media. Flushing the data page to stable media is not necessary because after the log records are properly secured, it is possible for recovery to redo the operation based on those log records.

 9.   Transactional locks are released and the batch is considered complete.

Locking and latching are separate issues when maintaining the WAL protocol. Locking maintains the transactional integrity of the data while latching maintains the physical integrity of the data. In the previous example, SQL Server 7.0 and SQL Server 2000 hold the latch on page 150 for only the time necessary to perform the physical changes on the page, not the entire time of the transaction. The appropriate lock type is established to protect the row, range, page, or table as necessary.

For more details on lock types, see the sections that cover locking in Microsoft SQL Server Books Online.

Looking at the previous WAL example in more detail, you might ask what happens when the lazy writer process or the checkpoint process runs before the COMMIT takes place but after the page has been modified.

Statement

Actions performed

BEGIN TRANSACTION

A BeginTran log record is written to the log cache area but there is no need to flush to stable media because SQL Server has not made any physical changes.

INSERT INTO tblTest

  1. Page 150 is not currently present in the SQL Server data cache so data it is retrieved into the data cache.

  2. Appropriate locks are obtained and the page is latched.

  3. An Insert log record is built and added to the log cache.

  4. A new row is added to the data page and the page is marked dirty.

  5. The latch is released.

  6. The log records associated with the transaction do not need to be flushed at this point because all changes remain in volatile storage.

Lazy writer or checkpoint locates page 150 in the buffer pool

Page 150 is currently marked as dirty so both of the processes are aware that the database page should be flushed to stable storage.

  1. Latch page 150 to prevent further modifications.

  2. Issue a request to log manager to flush all log records up to and including the LSN value stored on page 150. (Log is hardened.)

  3. Wait for all log records to be flushed successfully to stable media.

Issue I/O request to flush page 150 to stable media.

Note that the example does not require a commit. If the page is dirty, the log records can be flushed and subsequently the page as the WAL protocol specifies. The locks protect the commit view of the data and if a rollback occurs, the compensating recovery activities will restore the page to the proper state. If this did not work in this way, SQL Server would be limited to a modification that was no larger than physical memory.

The lazy writer and checkpoint processes both issue all appropriate flushes to stable media for transaction log records associated with a dirty page. This maintains the WAL protocol rule that specifies that a data page can never be written to stable media until the associated transaction log records have been flushed.

To read more about the transaction log and write-ahead logging see the topic, “Write-Ahead Transaction Log” in SQL Server Books Online.

Additional information is located on the Microsoft Web site in the following document:

Log Sequence Number

The log sequence number (LSN) value is a three-part, uniquely incrementing value. It is used for maintaining the sequence of the transaction log records in the database. This allows SQL Server to maintain the ACID properties and to perform appropriate recovery actions.

When a modification occurs, a log record is generated with a new LSN value. The same LSN value is stored (replaces the previous LSN value in the page header) in the data page header so the last log record and matching data page can be associated with each other.

To learn more about the transaction log architecture, see “Transaction Log Logical Architecture” in SQL Server Books Online.

Latching

SQL Server uses latches to provide data synchronization. A latch is a user-mode reader-writer lock implemented by SQL Server. Each data page in memory has a buffer (BUF) tracking structure. The BUF structure contains status information (Dirty, On LRU, In I/O) as well as a latch structure.

Locking maintains the appropriate lock activity; latching controls physical access. For example, it is possible for a lock to be held on a page that is not in memory. The latch is only appropriate when the data page is in memory (associated with a BUF).

The following illustration shows a high-level view of the SQL Server 2000 buffer pool.

Figure 1

Figure 1

Scatter-Gather

Starting with Microsoft SQL Server 7.0, the Microsoft Win32 APIs WriteFileGather and ReadFileScatter are used. The WriteFileGather function gathers data from multiple discontinuous buffers and writes the data to a file. The ReadFileScatter function reads data from a file and distributes the data into multiple discontinuous buffers.

These APIs allow SQL Server to avoid multiple physical I/O requests. For example, during the checkpoint process up to sixteen 8-KB pages can be flushed with a single WriteFileGather invocation. Before using WriteFileGather, SQL Server had to issue an I/O request for each data page or it would have to sort and buffer a larger request itself.

Important Scatter-gather capabilities are hardware specific. When the hardware does not support scatter-gather capabilities, the operating system must intervene and issue separate I/O requests. To maximize Microsoft SQL Server I/O performance, ensure that your subsystem natively supports scatter-gather I/O operations.

To learn more about SQL Server’s use of scatter-gather to improve performance, see the following document.

Transaction Log I/O—WriteFile

For transaction log operations, SQL Server uses WriteFile instead of WriteFileGather. WriteFileGather requires I/O in operating system, page size boundaries. This requirement would mean that each log record would be at least 4 KB. WriteFile is used instead to allow smaller writes on sector boundaries.

Asynchronous I/O

All SQL Server transaction log and database file I/O is performed by using an OVERLAPPED structure, which facilitates the use of asynchronous I/O. The SQL Server buffer pool and file manager contain sophisticated I/O facilities. Appropriate reader/writer latches are used to maintain data integrity during asynchronous I/O operations.

SQL Server uses the Win32 API calls as follows.

API

Common Usage

CreateFile

Used to create and open database and log files. The flags FILE_FLAG_OVERLAPPED, FILE_FLAG_WRITETHROUGH, and FILE_FLAG_NO_BUFFERING are specified to avoid nonstable media caching.

WriteFile

Primarily used by the log manager and backup manager to handle I/Os.

ReadFile

Primarily used by the log manager and backup manager to handle I/Os.

WriteFileGather

Primarily used by the buffer pool to write page groups (up to sixteen 8-KB pages in a group).

ReadFileScatter

Primarily used by the buffer pool to read pages into the buffer pool. Can be used for single page requests as well as read-ahead requests. Read-ahead requests are generally 128 pages for each group but can be as many as 1,024 pages when running Microsoft SQL Server Enterprise Edition.

HasOverlappedIoCompleted

Used to determine the status of I/O requests.

GetOverlappedResults

Used to determine success of the I/O requests.

Note Sort and spool operations share some mechanisms with the SQL Server buffer pool and file manager to perform necessary I/O operations.

Also, note that SQL Server does not always service the completion of the I/O on the same worker that posted the I/O request. Completion of an I/O in SQL Server is done by a yielding worker on the same User Mode Scheduler (UMS) in which the I/O was posted. Mechanisms in SQL Server establish callback routines that are invoked when the I/O completes. The callback is an SQL Server-specific mechanism and not alert-based like the ReadFileEx or WriteFileEx functions.

For example, if a data page read completes, the callback routine will check for an operating system return code of zero (GetLastError value), ensure that the bytes transferred are correct, check for torn page errors, ensure that the page number is correct, and perform other sanity checks.

Flushing a Data Page To Disk

Three primary mechanisms trigger the flush of a data page to disk. However, each mechanism uses the same internal routine in the buffer pool to achieve the transfer.

  • Lazy write (LRU and memory-pressure based)

  • Checkpoint (recovery-interval based)

  • Eager write (nonlogged I/O based)

To efficiently flush writes to disk, WriteFileGather is used. This allows SQL Server to bundle consecutive dirty pages into a single write request.

SQL Server uses the following steps to flush a single page.

  1. Latch the page to prevent further modifications.

  2. Ensure that log records up to and including the LSN on the page are flushed to stable media.

  3. Establish proper entries for the WriteFileGather invocation.

SQL Server uses the following steps to set up another page for flushing and repeats for up to 16 total pages inclusive of the first page.

  1. Do a hash lookup for the next contiguous page. For example, if the page to be flushed were page 100, SQL Server searches for page 101 in the buffer hash array.

  2. If the page is not located, then the end of contiguous I/O block is established and the I/O is posted.

  3. If the page is located, acquire latch to prevent further modifications if the page may be dirty.

  4. Check to ensure that the page is dirty and needs to be written. If not, release the latch and consider the end of contiguous I/O block as established and submit the asynchronous I/O request.

  5. If dirty, follow the steps preceding these that tell you how to flush a single page.

After the set of pages to be flushed is determined, the WriteFileGather function is invoked to post (Async / OVERLAPPED) the I/O request with the associated callback function to complete the I/O operation.

When SQL Server determines that HasOverlappedIoCompleted returns TRUE, GetOverlappedResults is used to gather completion information from the system and the callback function is invoked. The callback makes an appropriate determination as to the success of the I/O operation and releases the latch on each page.

Lazy Writer

The lazy writer for SQL Server 2000 and SQL Server 7.0 attempts to locate up to 16 unique pages for each sweep to return to the free list. If the page reference count has dropped to zero, the page is eligible for return to the free list. If the page is marked dirty, the log records and page date will be flushed.

Thus, it is possible for the lazy writer to flush 16 * 16 pages to disk in a single sweep. This is efficient because many of the pages will remain in the SQL Server buffer pool but are now in a clean state. I/O is done in the background from the primary SPID (server process ID). When the lazy writer needs additional buffers for the free list, the buffers may not need to be flushed to disk but are unhashed and returned to the free list.

Checkpoint

The SQL Server 2000 checkpoint process periodically sweeps the buffer pool for buffers that contain pages from a specified database and flushes all dirty buffers to stable media. This makes recovery shorter because roll forward operations have less physical work to complete.

As described earlier, the checkpoint process uses the same I/O approach to post up to 16 pages in a single I/O. Because the I/O is posted (OVERLAPPED), checkpoint does not wait for each I/O request to complete immediately. Checkpoint continues to track posted and completed I/Os but attempts to maintain a high level of outstanding I/Os (for example 100 write requests continuously outstanding). This maximizes I/O throughput and reduces checkpoint runtime.

Before the introduction of WriteFileGather, SQL Server sorted the buffers for a specified database in page order and issued I/O requests in page order. This required many physical I/O requests because the page order of the flush is not in contiguous memory location order. However, it often kept the physical subsystem mechanisms in physical locations that are in close proximity to make the I/O requests complete faster.

With the older design, an elevator seek can be a problem. Issuing many I/Os in page order generally results in a similar "on disk" order. By pushing the subsystem hard with many I/Os located at similar locations, the disk drive would service these I/Os before I/Os that might have been outstanding longer.

With WriteFileGather, SQL Server can sweep the buffer pool without requiring any physical ordering relationship to the pages on disk. By gathering 128 KB (sixteen 8-KB pages) in a group, SQL Server is able to transfer blocks of data with much fewer physical I/O requests. This allows the checkpoint process to maintain its speed while the random nature of the I/O requests deters any elevator seeking that could affect other I/O operations.

All databases, except for tempdb are checkpointed. Tempdb does not require recovery (it is recreated every time SQL Server starts) so flushing data pages to disk is not optimal for tempdb and SQL Server avoids doing so.

Checkpoint protects the system from I/O flooding by serializing checkpoint processes. Only one checkpoint at a time can run to completion. The checkpoint and lazy writer processes also communicate with each other to control I/O queue depths.

Eager Write

Microsoft SQL Server 2000 uses eager writes to post data pages associated with nonlogged operations (commonly bulk insert/select into). This allows the async capabilities of the I/O patterns to keep the dirty pages flowing to disk without dirtying large and unwanted portions of the buffer pool. The same mechanism as that used by the lazy writer and checkpoint processes to post I/O operations is used.

Microsoft SQL Server 7.0 does not implement the eager write actions but instead issues a checkpoint at commit-transaction time to flush all buffers for the database. This can cause multiple nonlogged operations to be serialized because only a single checkpoint can be active.

Important Lazy writer, checkpoint, and eager write do not wait for the I/O to immediately complete. They always post the I/O with WriteFileGather with the OVERLAPPED option and continue with other work, checking for I/O completion success at a later point in time. This allows SQL Server to maximize both CPU and I/O resources for the appropriate tasks.

Shutdown and Recovery Interval

An orderly shutdown action performs a checkpoint of all databases, closes out all internal database tracking structures, and exits the SQL Server process.

The recovery interval controls the checkpoint target. When the interval is elongated, more dirty pages are allowed in memory before checkpoint is triggered.

The current version of Microsoft Windows generally requires a successful, orderly shutdown of SQL Server in 60 to 120 seconds.

Clustering Implementations

Clustering implementations can be affected by the shutdown processing.

  • If this interval is exceeded for an instance in a failover cluster, the cluster processing may engage in a forced termination of the SQL Server resources and the system will engage in appropriate failover activities as well.

  • The SQL Server cluster resource marks the status of SQL Server as failed. In doing so under a "move group" scenario, the cluster process takes over and forces the resources to be physically controlled by the subsequent node.

  • Adjustment of the recovery interval can result in checkpoint requiring a lengthy time to complete, exceeding the provided interval, and resulting in aggressive failover activities. Aggressive failover scenarios do not pose any specific dangers.

  • Microsoft strongly recommends maintaining the default recovery interval to ensure optimal recovery metrics and to function cleanly within the current cluster resource constraints. If the recovery interval is extended beyond 60 seconds, it is possible that clustered instances will be forced offline in an aggressive manner. While the aggressive actions are not unreasonable, they are undesirable. If the recovery process seems to be having significant negative affects on transactional throughput, Microsoft recommends that the database layout, physical file placement, and physical I/O channel be adjusted instead of the recovery interval to facilitate throughput needs.

Recovery Interval Adjustments

Several specific conditions (side effects) can arise when the recovery interval is adjusted. Weigh these carefully before adjusting the recovery interval.

Dirty Page Latency – A page is considered dirty when data modifications have taken place. A dirty page cannot be removed from the SQL Server buffer pool until the associated log records have been written and the page itself written to stable media. Increasing the checkpoint interval (by increasing the recovery interval) on a busy system moves the pressure of handling dirty pages to the lazy writer code line. This can result in overall performance degradation because the lazy writer is not designed to perform checkpoint-like activities.

The lazy writer does perform proper activity on the dirty pages to ensure data integrity and free list maintenance but, unlike the checkpoint process, it is not designed to remove the dirty page I/O latency. Checkpoints allow dirty pages to be written more aggressively. Leaving the checkpointing actions to the lazy writer introduces latency because the lazy writer is forced to perform I/O to age a buffer instead of simple, in-memory operations to maintain the free list(s). If you have adjusted the recovery interval, you should watch the lazy writer performance counter(s) activity closely.

Longer Checkpoint – Increasing the recovery interval can result in more in-memory dirty pages. The more pages that are dirty, the more pages checkpoint will have to flush to disk during its sweep of the SQL Server buffer pool. Longer checkpoint duration is not a problem but exposes the potential for longer recovery intervals and drives the disk harder for a longer interval.

Longer Recovery – Increasing the recovery interval can result in increased recovery time. If SQL Server experiences an unorderly shutdown (the process terminates unexpectedly or a power outage occurs, for example), on startup, the recovery manager is responsible for ensuring that the database returns to the proper transactional state. If checkpoint is running less frequently, resulting in more active dirty pages, a nonorderly shutdown requires that the recovery manager perform more roll-forward and possibly rollback operations to return the database to the correct and consistent transactional state. A key factor affecting the performance of this work is the fact that the SQL Server buffer pool is cold (there are no pages in memory) at the time of recovery. Recovery has to read in the appropriate database pages and make modifications. This can add recovery latency, which is undesirable in a production environment. The increased recovery interval time is often counterproductive to the overall goal of maximized up time.

For example, assume that under a constant load, checkpoint needs to flush 250 MB of dirty buffers every minute. Based on the checkpoint algorithm, if you extend the recovery interval to 10 minutes, the amount of data to be flushed will be 2,500 MB if all other factors remain constant. If the checkpoint duration is longer, it drives the disk for a more pronounced interval but individual pages are handled in the same way, ensuring that overall concurrency is maintained. The 2,500 MB of dirty pages require a significant amount of more work from recovery as well.

Recovery Model Considerations

Microsoft SQL Server 2000 introduced updated database recovery models (the Full, Bulk-Logged, and Simple recovery models). The application workload and recovery models directly affect the types of I/O patterns. The application, hardware solution, and recovery model that maximizes recovery business requirements should be selected.

Flushing a Log Record To Disk

Log records are flushed to disk in much the same manner as data pages. The log manager is responsible for all writes for all transaction log records for all databases. You can select from sysprocesses to see the log manager SPID assignment.

When a request to flush all log records up to a certain LSN is requested by any worker on the system, the request is queued to the log manager. The worker then waits for a response from the log manager specifying that the I/O has completed successfully. The log manager retrieves from the queue and formats the request; it then posts the I/O on sector-aligned boundaries.

The I/O is posted with WriteFile using the OVERLAPPED (async) mechanisms. The log manager can then return to service other queued requests. When the I/O is completed, the completion routine is run to check the success of the write. If the write is successful, the waiting workers can be signaled to continue their operations.

Write ordering is critical at this stage. Because multiple log write requests can be posted for the same transaction log, the LSN order must be maintained.

For example, pages 50, 100, and 200 are modified by separate transactions. Page 50 was modified first, then 100, and then 200. LSN flush requests occurred for page 50, 100, and 200 and were put in motion in the same order. If the log record(s) for page 50 and 200 are flushed to stable media, only the flush to LSN for page 50 is considered done and SQL Server may only flush page 50. The LSN 100 must be flushed to stable media before 100 and then 200 can be considered flushed to the LSN. (Log is hardened.)

Write ordering is a key to data integrity and SQL Server design.

Read-Ahead

SQL Server 2000 uses ReadFileScatter to perform read-ahead operations. SQL Server uses sophisticated algorithms to retrieve data pages that are going to be used in the immediate future.

For example, if you run a query that can use an index to determine applicable rows, a read-ahead may occur on the actual data pages that are needed to complete the select list. As index entries are identified, SQL Server can post OVERLAPPED (async) I/O operations for the data pages that will be used in upcoming steps of the query plan. This is how a query using a bookmark lookup operator uses read-ahead.

This example is just one of many read-ahead situations that SQL Server can use. Allowing the index searching to continue while the data page I/O is in progress maximizes the CPU and I/O of the system. The I/O is often completed by the time it is needed so other steps in the plan have direct memory access to the needed data and do not have to stall while waiting on I/O.

When a read-ahead is posted, it can be from 1 to 1,024 pages. SQL Server limits a single read-ahead request depth to 128 pages on most editions. However, Microsoft SQL Server Enterprise Edition raises the limit to 1,024 pages.

SQL Server uses the following steps to set up read-ahead.

  1. Obtain the requested amount of buffers from the free list.

  2. For each page:

    1. Determine the in-memory status of the page by doing a hash search.

    2. If found to be already in memory, set up the read-ahead request to immediately return the buffer to the free list upon I/O completion.

    3. Establish the proper I/O request information for ReadFileScatter invocation.

    4. Acquire I/O latch to protect buffer from further access.

    5. If the page is not found in hash search then insert it into the hash table.

  3. Issue the ReadFileScatter operation to read the data.

When the I/O operation is complete, each page is sanity checked for a valid page number and torn page errors. In addition, various other data integrity and safety checks are performed. The I/O latch is then released so the page is available for use if it is located on the hash chain. If the page was determined to be already in memory, the page is immediately discarded to the free list.

This process shows the key factors of SQL Server I/O patterns. Read-ahead goes after pages that can already be in memory or not allocated. Because SQL Server maintains the in-memory buffers and hash chains, SQL Server tracks the page’s state. Importantly, read-ahead processing opens the door for overlapping read and write requests at the hardware level.

If a page is already in memory when the read-ahead request is posted, the contiguous read is still needed and is faster than breaking up read requests into multiple physical requests. SQL Server considers the read to be unusable for the page in question, but many of the pages around it may be usable. However, if a write operation is in progress when the read is posted, the subsystem has to determine which image of the read to return. Some implementations return the current version of the page before the write is complete; others make the read wait until the write completes; and yet others return a combination, showing partially new data and partially old data. The key is that SQL Server will discard the read as unusable but the subsystem needs to maintain the proper image for subsequent read operations. The in-progress write, when completed, must be the next read image returned to the server running SQL Server.

Do not confuse read-ahead with parallel query plans. Read-ahead occurs independently of the parallel query plan selection. The parallel plan may drive I/O harder because multiple workers are driving the load, but read-ahead occurs for serial and parallel plans. To ensure that parallel workers do not work on the same data sets, SQL Server implements the parallel page supplier to help segment the data requests.

SQL Server has added increased diagnostics to report previously unreported read failures. The Microsoft Web site contains the following Knowledge Base article that provides diagnostic installation and usage instructions.

Microsoft SQL Server Core I/O Requirements

The SQL Server 2000 I/O subsystem has a core set of requirements for maintaining data integrity. If your system is fully compliant in the following areas, then SQL Server is able to maintain the ACID properties of your database(s).

Stable Media

Any system running SQL Server must provide stable media capabilities for the database log and data files. If the system has something like a nonbattery backed cache or perhaps caching enabled on disk, it is not safe for SQL Server installations.

Ensure that your system has proper caching attributes to secure the data to stable media.

Write Ordering

SQL Server is designed to maintain the WAL protocol as described earlier in this paper. The underlying environment plays a pivotal role in the maintenance of the protocol. Write ordering must be maintained.

With any system, but specifically for remote mirroring implementations, write ordering is critical to maintain a point-in-time view of the data. Many remote mirroring vendors implement generation logic to maintain physical write order even when the write operations are transmitted between remote caches with alternate protocols.

Torn I/O Prevention (Split I/Os)

For SQL Server, the 8-KB blocks need to be handled as a single block of data. Systems that split I/Os must be configured to avoid splitting I/O requests into smaller blocks. Some dynamic disk and volume managers can be configured with block and cluster sizes that are smaller than 8 KB or not on 8-KB boundaries. These systems may split the SQL Server I/O request across physical system components. In doing so, they can introduce the possibility of torn pages and they can destroy true write ordering precedence.

Ensure that your system does not allow data to be broken down in such a manner as to introduce torn page possibilities.

Problems, Pitfalls, and Samples

The problems, pitfalls, and samples described in this section are issues encountered by the Microsoft SQL Server Support and Development staff. Many of them have resulted in configuration changes and/or other changes such as firmware, driver, or other levels of correction outside SQL Server.

Because a wide variety of issues can arise and a number of manufacturers are involved in the field of stable media storage, Microsoft strongly recommends that you contact your hardware and software vendors and make sure that you are running an implementation that is compliant with SQL Server 2000.

Stale Reads and Lost Writes

What is a hardware-level stale read compared to a lost write?

Because a stale read can appear as a lost write or a lost write can appear as a stale read, explicit definition of the terms is appropriate.

The following definitions assume that all operating system calls are successful and that the API routines are used correctly by the user mode application.

Stale Read

Data returned through a ReadFile or ReadFileScatter call does not represent the last successful write operation.

Lost Write

Data sent through a WriteFile or WriteFileGather is never represented by stable media.

Examine the following set of diagrams closely. Assume that the ExpandFile image is what the stable media looks like.

A stale read is when the Write Image, shown in the diagram as "A" payload, is successfully written to disk to replace the initial "Z" payload. However, the next read of the same bytes (the offset) of the file still returns the "Z" payload Read Image.

  • It is a stale read if the stable media contains the "A" image but the hardware cache returns "Z" payload.

  • It is a lost write if the stable media contains the "Z" payload and the write of the actual "A" payload was lost.

    Cc966500.sqliofig2(en-us,TechNet.10).gif

    Figure 2

Stale Read Example

This is an example of a stale read situation. Remember that read-ahead for SQL Server will read in a large block but it will disregard pages that are of unknown state. On completion of a read, it will return the buffer immediately to the free list when a page is deemed already in memory at the start of the read-ahead I/O request.

For the example, assume that pages 107 and 108 are in the SQL Server buffer pool and are dirty.

Write Page 107

Lazy writer posts write request.

Write Page 108

Lazy writer posts write request.

Page 107 write completes

I/O request completes and the lazy writer removes the page from the cache and places its buffer on the free list.

Read posted for 100 through 115

Read-ahead is used to read in the data. Only page 108 is found in memory. Read-ahead will steal the needed work buffer but will not hash it as it is already in memory.

Read completes

The read completes. KEY: The data for page 108 is unknown, as it has not completed at the hardware level. SQL Server does not need the duplicate, so the work buffer is returned directly to the free list at the completion of the read.

Note This is the version of page 108 stored in hardware read-ahead cache now.

Page 108 write completes

BUG: Read-ahead hardware cache is not invalidated

Lazy writer ejects page 108 from cache and places its buffer on the free list.

Read request for page 108

Requires that a physical read take place but the stale image of the page is returned from the hardware read-ahead cache, not the on-disk version.

Trace Flag 818

The instrumentation added with trace flag -T818 tracks the last 2,048 page write operations. During a successful write I/O completion (proper page ID, bytes transferred successfully, and the proper OS error codes), the DBID, Page ID, and LSN are recorded in a ring buffer. If a failure occurs, error 823 is raised.

When an 823 or 605 error is detected, SQL Server looks in the ring buffer for the LSN value that was on the page during the last write. If not correct, extra information is added to the SQL Server error log. The information indicates the type of error along with both the expected and the retrieved LSN.

The extended LSN information outlined here appears in the SQL Server error log. The LSN returned is from the read and is clearly older (stale) than the value used during the last write.

SQL Server has detected an unreported OS/hardware level read or write 
problem on Page (1:75007) of database 12 
LSN returned (63361:16876:181), LSN expected (63361:16876:500)
Contact the hardware vendor and consider disabling caching mechanisms to
correct the problem

SQL Server 2000 Service Pack 4 (SP4) refines the design of –T818 into a hash table implementation. A finite amount of hash bucket entries is allowed but more than 2,048 are allowed on 32-bit installations and even more entries are allowed on 64-bit installations.

Because it is implemented as a hash table, the stale read check can occur on every read, not just on those pages identified with a previous 823 or 605 error. This check occurs like other standard checks for object ID on the page (605) and page number (823) error conditions. Because it does the check on every read, it can catch situations where the page ID and object ID are correct but the rows on the page would have been damaged.

For example if a row were inserted, the page flushed, and a stale read occurred (row missing), the object ID and page ID would be valid so the older implementation may not have flagged the stale read condition. The new design will detect the LSN mismatch and flag it as a new 823 error condition.

Torn Pages

SQL Server database pages are 8 KB in size, whereas the typical transfer size for the hardware is 4 KB using 512-byte sectors. When you start adding alternate RAID configurations, the hole can show up as torn reads. The timing of the reads and writes can overlap on different drives so you may get partial old and partial new data. Again, the bug is that after the write, not all portions of the read-ahead cache are invalidated. The damaged image remains in the read-ahead hardware cache until it is forced out.

Flushing Hardware Cache Example

It is possible to perform operations that cause the hardware cache to be flushed. The flushing action can resolve error conditions if they are transient.

Some techniques to accomplish this can be triggered with SQL Server.

  • Issue dbcc dropcleanbuffers to remove all the buffers from the SQL Server buffer pool.

  • Issue dbcc checkdb on the database in question.

This can correct the problem of transient, repeated corruption behavior. The technique causes checkdb to do a large number of read requests, which forces the hardware cache to turn over. This turnover ejects the problem sectors from the hardware cache and forces the proper physical read. This obtains the correct image and SQL Server magically appears to correct the problem.

SQLIOStress.exe

The SQLIOStress test program (version 4.00.020 and later) contains special patterns to detect stale-read/lost-write problems quickly. The (-H) parameter was added to be more aggressive at the read-ahead logic simulation and quickly triggers the bad behavior.

The following is a sample of SQLIOStress output with the (-H) enabled on a controller that is exhibiting the problem.

When the problem is encountered, the write and the following read image are dumped to the log as well as the results of the API calls that were made.

10/02/03 16:56:26 00001832    ERROR: Stale read check failure. Page image returned does not match previous write. Check hardware and caches. Read size was 8192

10/02/03 16:56:26 00001832    

10/02/03 16:56:26 00001832    ---------------- Write Image

10/02/03 16:56:26 00001832    Overlapped: 0x0AA91000 Used: Y Complete: 1 Event: 0 Offset: 41951232 OffsetHigh: 0 Internal: 0 Internal High: 8192

10/02/03 16:56:26 00001832    Sector: 0      LSN: 8        Page: 5121  Address: 0x0D16DFA8

10/02/03 16:56:26 00001832     [AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAA]

10/02/03 16:56:26 00001832    Sector: 15      LSN: 8        Page: 5121  Address: 0x0D16FDA8

10/02/03 16:56:26 00001832     [AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAA]

10/02/03 16:56:26 00001832    ---------------------------------------------------------------------------

10/02/03 16:56:26 00001832    ---------------- Read Image

10/02/03 16:56:26 00001832    Overlapped: 0x0D16DEEC Used: Y Complete: 1 Event: 868 Offset: 41951232 OffsetHigh: 0 Internal: 0 Internal High: 8192

10/02/03 16:56:26 00001832    Sector: 0      LSN: 0        Page: 5121  Address: 0x0AA81000

10/02/03 16:56:26 00001832     [ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ ZZZZZZZZZZZZZZZZ]

10/02/03 16:56:26 00001832    Sector: 15      LSN: 0        Page: 5121  Address: 0x0AA82E00

10/02/03 16:56:26 00001832     [ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ ZZZZZZZZZZZZZZZ]

10/02/03 16:56:26 00001832    ---------------------------------------------------------------------------

10/02/03 16:56:26 00001832    

10/02/03 16:56:26 00001832    Dumping API Trace Information

10/02/03 16:56:26 00001832    ---------------------------------------------------------------------------

10/02/03 16:56:26 00001832    Slot     | TID      | Handle   | Enter Ticks          | Bytes Req| Exit Ticks           | Bytes Ret| API                       | Ret Code | OSError  | Internal | Int High | Offset   | Off High | Event

10/02/03 16:56:26 00001832    10083    | 1832     | 916      | 36747447522368       | 8192     | 36747447613236       | 0        | WriteFileGather           | 0        | 997      | 259      | 0        | 41951232 | 0        | 0

10/02/03 16:56:26 00001832    20088    | 1832     | 0        | 36747476369460       | 0        | 36747476369616       | 0        | HasOverlappedIoCompleted  | 0        | 0        | 259      | 0        | 41951232 | 0        | 0

10/02/03 16:56:26 00001832    50118    | 1832     | 0        | 36747589924652       | 0        | 36747589925212       | 0        | HasOverlappedIoCompleted  | 1        | 0        | 0        | 8192     | 41951232 | 0        | 0

10/02/03 16:56:26 00001832    50119    | 1832     | 916      | 36747589926084       | 0        | 36747589935944       | 8192     | GetOverlappedResult       | 1        | 0        | 0        | 8192     | 41951232 | 0        | 0

10/02/03 16:56:26 00001832    50120    | 1832     | 916      | 36747589955240       | 8192     | 36747590068768       | 0        | ReadFileScatter           | 0        | 997      | 259      | 0        | 41951232 | 0        | 868

10/02/03 16:56:26 00001832    50121    | 1832     | 0        | 36747590069764       | 0        | 36747590069956       | 0        | HasOverlappedIoCompleted  | 0        | 0        | 259      | 0        | 41951232 | 0        | 868

10/02/03 16:56:26 00001832    Duplicates: 459

10/02/03 16:56:26 00001832    50581    | 1832     | 0        | 36747590384800       | 0        | 36747590384928       | 0        | HasOverlappedIoCompleted  | 0        | 0        | 0        | 8192     | 41951232 | 0        | 868

10/02/03 16:56:26 00001832    50582    | 1832     | 0        | 36747590416768       | 0        | 36747590416900       | 0        | HasOverlappedIoCompleted  | 1        | 0        | 0        | 8192     | 41951232 | 0        | 868

10/02/03 16:56:26 00001832    50583    | 1832     | 916      | 36747590417460       | 0        | 36747590417764       | 8192     | GetOverlappedResult       | 1        | 0        | 0        | 8192     | 41951232 | 0        | 868

SQL Error Messages

SQL Server detects many flavors of I/O problems or data integrity damage at run time. An I/O problem or data integrity breach is detected and reported by SQL Server as an error message. The more common errors are 605, 823, 624, and log restore failures.

SQL Server cannot detect some situations. These situations appear as logical problems with the data and commonly generate unexpected application-level error conditions.

Run-Time Errors

Errors 605 and 823 are the results of straightforward run-time checks for the proper page and object IDs. These are easy to catch at run time without a performance penalty. The problem is that stale read data opens up a much larger window of potential error conditions.

Logical Errors

There are many cases (such as 6xx errors) where the data becomes logically inconsistent. (An example is error 624: Could not retrieve row from page by RID because the requested RID has a higher number than the last RID on the page. %S_RID.%S_PAGE, DBID %d.) For example, if you insert a new row into a table but the stale read takes place on the index page, the index insert is lost. It then looks like the index update never took place, so you have more data rows than index rows for this table. It could look like the index keys are improper. If part of a heap table, the RIDs (relative identifiers) could be incorrect because the actual page data is aligned differently.

You can take this scenario into other areas when you start looking at actual data relationships. If one of the pages becomes a stale-read victim, it could look like the insert never took place at all. This might appear as unexpected row counts or PK/FK violations. Perhaps this would appear as if a debit or credit to a user's account never took place.

When the data is suspect, the potential problems become very broad. The data can appear damaged during a checkdb. It could cause run-time problems because index RIDs point to unexpected data areas and it could cause many other problems such as exceptions or assertions.

Log Shipping Errors and Restore Failures

Data corruption or integrity problems can be exposed while attempting a restore operation. For example, stale reads can cause a transaction log restore to fail.

As outlined earlier, the LSN must be unique in one database. In the case of a stale read, it is possible to find two separate log records with the same previous page LSN. SQL Server recovery detects this as a problem and will stop the recovery operation. It is a clear sign that the page was changed, flushed to disk, and read back in; however, the change was not seen due to stale read. The second change uses the stale LSN and the capability to recover the page is lost.

Important This is important because a stale read can quickly invalidate the backup strategy and should be addressed aggressively.

Replication

Replication can use the transaction log as a data source. When something like a stale read condition damages the transaction log, it also damages the replication data source. The damaged source can lead to problems with logical consistency of the replicated data.

For example, when replication is involved, the problem can span multiple computers. In a merge replication scenario, if changes are lost because of a stale read situation on one computer, it could affect the way the merge is treated across the entire replication topology. This could manifest itself in many ways, including sync failures.

Filter Drivers

Many implementations of backup software, antivirus programs, and other applications are deployed as I/O system filter drivers. This allows interception of the I/O request and appropriate processing. Inappropriate processing by a filter driver may cause stale reads or lost writes.

Often these types of problems require reproductions and kernel-level debugging efforts to determine the root cause of the problem, which might be something like a stuck IRP. This can be time-consuming and invasive.

Microsoft recommends appropriate backup and security strategies but you should also ensure that the software manufacturer is Microsoft SQL Server I/O-compliant.

Stuck I/O

For SQL Server, the most common problem with filter drivers in use by the operating system, is that the I/O becomes stuck and the filter driver does not provide error messages or logging facilities.

For example, SQL Server Support staff encountered one such problem with a high-end disk subsystem, which was attempting to load balance I/O requests across multiple Host Bus Adapters (HBA). The software had a bug that lead to a lost I/O. SQL Server waited on I/O forever, producing I/O-based latch timeouts in the error log.

Important Microsoft recommends that you understand the logging capabilities of any filter driver installed on the system.

Synchronous I/O

SQL Server relies heavily on asynchronous I/O capabilities to maximize resource usage. SQL Server Support has debugged problems with some filter drivers that do not allow the I/O request to be completed asynchronously. Instead, the filter driver requires the I/O to complete before returning control to SQL Server.

This is easily observed by watching the disk queue lengths. When SQL Server is running, it commonly keeps more than one I/O posted. When I/Os become synchronous, the disk queue is often held at one outstanding I/O. This causes the SQL Server code to unnecessarily block.

Because the disk sec/transfer time may not be a fair statistic, use it with caution. When less I/O is outstanding on a drive, the disk sec/transfer is often fast. The longer the disk queue length, the more variations that can occur on disk sec/transfer. However, because SQL Server understands the asynchronous nature of the I/O patterns, longer disk queue lengths, and somewhat longer disk sec/transfer can result in overall better throughput and resource utilization.

Data Integrity

When filter drivers are installed on the system, they have direct access to the I/O data. Therefore, the integrity of the data can be compromised by a troublesome filter driver. Ensure that the filter driver is Microsoft SQL Server-compliant.

Important If you are experiencing problems with I/O stability or speed, Microsoft SQL Server Support may recommend disabling the filter driver to test results. Be aware that the only safe way to disable many filter drivers is to uninstall them.

Compression versus Encryption

Windows allows data file compression and/or encryption to be established on a per-file basis. The details about and support of each option by SQL Server 2000 are outlined in this section.

Compression

Windows allows files to be compressed on disk to save physical space. Compression is not supported in SQL Server 2000 or prior versions. The issue is that when compression is used, the actual file data is handled by the operating system in large chunks (64 KB for example). Therefore, when SQL Server updates an 8-KB data page, the system is really working on a large chunk of data and rewriting it.

The data rewrite breaks the WAL protocol because data written to disk will be rewritten, thereby breaking write ordering rules. The rewrite can lead to caching and other activities that are not appropriate for database ACID properties. The rewrite also destroys the design around sector boundary safety.

All SQL Server database log and data files should remain in an uncompressed state.

Speed Changes

The compression overhead also results in I/O performance bottlenecks. Microsoft SQL Server Support has diagnosed severe I/O bottleneck problems reported by customers attempting to use compression. In one specific case, the checkpoint timing for flushing only a few thousand buffers went from a few seconds to minutes. The work required to handle the compression not only resulted in extra system overhead, but increased the latch time needed to complete the I/O and negatively affected the overall SQL Server system.

Read-Only Databases

Microsoft SQL Server 2000 is not optimized to allow read-only, compressed databases.

Because the SQL Server 2000 optimizer does not account for compressed files when it builds a plan, the compressed files can add significant overhead. For example, if the plan requires single-page reads, each read requires that a chunk of the file be uncompressed. If the optimizer was aware of a compressed file, the plan might be changed to do more read-ahead or to revise data scan operations so the compression would occur fewer times.

Important Tempdb should never be placed on a compressed drive because even if a database is marked read-only, query support activities can occur in tempdb.

Future versions of Microsoft SQL Server may support read-only, compressed databases.

Encryption

Windows allows file encryption. SQL Server 2000 supports encrypted database log and data files. The operating system only encrypts and decrypts the data; it does not rewrite the actual data blocks or change sector boundary operations.

Warning Be sure to have a strong backup strategy for every SQL Server installation, but be aware that encryption can limit the types of disaster recovery options that are available.

Page Files and Paging

Microsoft SQL Server Support has uncovered hardware that is not resilient to paging and page file actions in various ways, such as write ordering not being preserved. SQL Server tries to prevent paging activities by reducing its committed footprint when possible. However, it is still possible for the process memory associated with SQL Server to be paged out. Like the standby and hibernate actions, paging involves other disk storage devices; these other devices, although they are not used by SQL Server directly, should also be Microsoft SQL Server I/O-compliant to ensure data integrity.

It is difficult to protect against the in-memory data corruption that is caused by the paging of a data page. Future versions of SQL Server may combine latch enforcement techniques with in-memory checksums to provide better protection against paging damage. However, checking the page for a checksum each time it is accessed by SQL Server adds significant overhead.

Important For installations of SQL Server, Microsoft recommends that the page file be placed on devices that are Microsoft SQL Server I/O-compliant.

Trace flag -T815

To help detect unwanted changes to in-memory SQL Server data pages, latch enforcement is enhanced with the –T815 trace flag. When a page is latched for modification, the VirtualProtect on the page is set to PAGE_READWRITE. At all other times the protection is PAGE_READONLY. This can help catch actions such as memory overwrites (scribblers).

Starting with Microsoft SQL Server 2000 build 8.00.0922, you can dynamically turn on or turn off trace flag -T815 by using the DBCC TRACEON Transact-SQL statement and the DBCC TRACEOFF Transact-SQL statement.

Important Latch enforcement is only valid for non-AWE (Address Windowing Extensions) environments.

Online File Copy (Snapshot)

Several backup and hardware vendors provide the capability to mirror or make a copy of a file while the file is still opened. Because SQL Server opens all database log and data files exclusively, this is only possible at a filter driver or hardware level.

A common problem with these vendor solutions is that they do not support a true snapshot—a point-in-time copy—of the entire file. The data is being copied while SQL Server is still executing. This breaks the write ordering and point-in-time semantics, and generally makes the copy of the database file(s) unrecoverable.

Some manufacturers have implemented true snapshot capabilities by using the documented SQL Server Virtual Device Interface (VDI) object model. This is the only safe way to freeze I/O to capture a point-in-time image of all the specific database files.

Read Retry

Microsoft SQL Server 2000 supports read retry only for sort operations. Future versions of SQL Server may include read retry for other I/O operations.

Microsoft SQL Server Support has diagnosed problems related to read retry. On some systems, invalid data is detected on a read. SQL Server sort logic can issue the same read (a retry of the read) and the subsequent data is valid. Because the data should have been returned correctly on the first read, this indicates a problem with the storage.

Warning If your system is showing symptoms of read retry success it should be considered a serious condition and addressed.

DBCC with REPAIR Options

When I/O operations are failing it is common practice to run database consistency checks more frequently. This can provide more detail about the extent of the failure and data points affected.

Warning It should not be common practice to use the REPAIR option of DBCC to correct corruption problems. Any corruption, especially persistent, recurring corruption, is symptomatic of a larger problem, one involving the supposedly stable media.

Utilities

Microsoft provides a few utilities for testing the general properties of your system.

SQLIOStress.exe

SQLIOStress.exe simulates various patterns of SQL Server 2000 I/O behavior to ensure rudimentary I/O safety.

The SQLIOStress utility can be downloaded from the Microsoft Web site. See the following article.

SQLIO.exe

SQLIO.exe is a SQL Server 2000 I/O utility used to establish basic benchmark testing results.

The SQLIO utility can be downloaded from the Microsoft Web site. See the following:

Conclusion

The proper configuration and maintenance of the I/O subsystem is critical to a successful SQL Server deployment. Understanding how SQL Server performs I/O operations on database log and data files will help you fine-tune the I/O subsystem. Always ensure that the subsystem upholds the WAL protocol foundations so that SQL Server will maintain ACID properties properly.

Other References

iSCSI

Network / NAS Storage