Detecting and Resolving Stalled and Stuck I/O Issues in SQL Server 2000 SP 4

 

Robert Dorr
Microsoft Corporation

May 2005

Summary: Guest columnist Robert Dorr explores how the reporting tool in SQL Server 2000 Service Pack 4 significantly reduces the time it takes to identify and determine the underlying causes of stalled and stuck I/O operations. (7 printed pages)

Introduction

A database management system such as SQL Server relies on the timeliness of file input/output operations. Faulty or improperly configured hardware, firmware settings, filter drivers, compression, bugs, and other conditions within the I/O path can lead to stuck or stalled I/O problems that quickly and negatively impact the SQL Server performance.

The impact of these problems on SQL Server varies widely based on the specifics of the problem, but they often cause blocking, latch contention and timeouts, sluggish response time, and resource over-utilization.

Stuck I/O is an I/O request (often an I/O Request Packet (IRP)) that does not complete without outside intervention. This condition often requires a complete system restart or similar action to resolve and is a strong indication of faulty hardware or a bug in an I/O path component.

Stalled I/O is an I/O request (again, often an IRP) that completes without intervention but takes longer then expected. This condition is generally due to hardware configuration, firmware settings, or filter driver interventions requiring assistance from the hardware or software vendor to trace and resolve.

SQL Server 2000 SP4 includes logic for database and log file I/Os (reads and writes) to detect stalled and stuck conditions. When an I/O operation has been pending completion for 15 or more seconds, SQL Server detects this and reports the condition. The following message is logged in the SQL Server error log:

2004-11-11 00:21:25.26 spid1 SQL Server has encountered 192 
occurrence(s) of IO requests taking longer than 15 seconds to complete 
on file [E:\SEDATA\stressdb5.ndf] in database [stressdb] (7). The OS 
file handle is 0x00000000000074D4. The offset of the latest long IO is: 
0x00000000022000". 

This message indicates that the current workload requirements are exceeding I/O path or current system configurations and capabilities or that the I/O path has malfunctioning software (firmware, driver) or hardware component(s).

The logged error message provides the following information:

  • ### occurrences—The number if I/O requests that did not complete the read or write operation in less than 15 seconds.
  • File information—The complete file name, database name, and DBID of the affected file.
  • File handle—The operating system handle of the file. This can be used with debuggers and other utilities to track IRP requests.
  • Offset—The offset of the last stuck or stalled I/O. This can be used with debuggers and other utilities to track IRP requests. (Note: This I/O may no longer be stuck or stalled by the time the message is logged.)

For more complete details on SQL Server 2000 I/O patterns, see SQL Server 2000 I/O Basics.

Recording vs. Reporting

The reporting and recording of I/O is rendered on a per-file basis. The detection and reporting of stalled and stuck I/O requests are two separate actions.

Detection (recording) is handled in two locations within SQL Server. The first is when the I/O is actually completed. If the request took more than 15 seconds, the record operation occurs. The second is by the lazy writer process. When the lazy writer executes, it includes a new check of all pending data and log file I/O requests, and, if the 15-second threshold has been exceeded, the record operation occurs.

Reporting is done on intervals of no less than 5 minutes. Reporting takes place when the next I/O request is made on the file. If a record action has taken place and 5 minutes or more have passed since the last report took place, a new report (the error message shown above) is written to the error log.

The 15-second threshold is currently not adjustable. While not recommended, stalled and stuck I/O detection can be completely disabled with trace flag 830. Set the startup parameter –T830 to disable stalled/stuck I/O detection during SQL Server startup. Use dbcc traceon(830, -1) to disable detection for a currently running SQL Server instance. Dbcc traceon is only effective until SQL Server is restarted.

Note   A given I/O request that becomes stalled or stuck is only reported one time. If the message reports 10 I/Os as stalled, these 10 reports will not occur again. If the next message reports 15 I/Os stalled, it indicates that 15 new I/O requests have become stalled.

Performance and Plan Actions

Overall system performance can play a key role in I/O processing. The general health of the system should be considered when investigating reports of stalled or stuck I/O. Excessive loads can cause the overall system to be slow, including I/O processing. The behavior of the system at the time of the problem can be a key to determining root cause. For example, if the CPU usage becomes or remains high at the time of the problem, it could be an indication that a process on the system is consuming so much CPU that it negatively impacts other processes in various ways.

Review the performance counters Average Disk Sec/Transfer and Average Disk Queue Length or Current Disk Queue Length for specific I/O path information. For example, the Average Disk Sec/Transfer on a SQL Server machine is commonly less than 15ms. If this value climbs, it may indicate that the I/O subsystem is not keeping up with I/O demand.

Keep in mind that SQL Server takes full advantage of Windows' asynchronous I/O capabilities and pushes disk queue lengths heavily, so high values for these performance counters are not in and of themselves an indication of a problem.

Indexes and Parallelism

All too often, bursts of I/O occur because an index is missing and the resulting scans, hashes, and sorts stress the I/O system. A pass with Index Turning Wizard can often help resolve I/O pressure on your system. If adding an index can help a query avoid a table scan and perhaps even a sort or a hash, the system can gain several advantages:

  • Reduces the physical I/O required to complete the action, which directly equates to performance benefits for the query
  • Fewer pages in data cache have to be turned over, so those pages in cache remain relevant to active queries
  • Avoids unnecessary sorts and hashes
  • Can reduce tempdb usage and contention
  • Reduces resource usage and/or parallel operations. Since SQL Server does not guarantee parallel query execution and load on the system is considered by the server when determining whether to parallelize a query, it is best to optimize all queries for serial execution. In Q/A environments, max degree of parallelism should be set to 1 in order to force tuning for the worst-case scenario of receiving no parallel plans at all from the server. If queries are certified in the test environment to execute efficiently in serial fashion, parallel plans in the production environment can provide an unexpected performance improvement. Many times, however, SQL Server selects parallel execution because the sheer amount of data being traversed is quite large. This data volume is often affected directly by indexing. In the case of a missing index, for example, a large sort may result. It's easy to see how multiple workers performing a sort operation can result in quicker response than processing the sort serially, just understand that this action can increase pressure on the I/O system dramatically. Large read requests from many workers can cause an I/O burst along with increased CPU usage when multiple workers run concurrently. Many times, the query can be tuned to run faster and use fewer resources if an index is added or other tuning action takes place. This not only improves the performance of the query in question, but also of the system as a whole.

Practical Examples from Microsoft SQL Server Support

Microsoft SQL Server and Platforms Escalation Support have handled the following scenarios, which are intended to provide a frame of reference and help set expectations about stalled and stuck I/O situations, and how your system could be affected. There is no specific hardware or set of drivers that pose any specific or increased risk over another; all systems are the same in this respect.

Example 1 - Log Write Stuck for 45 seconds

An attempted SQL Server log file write got stuck periodically for 45 seconds. The log write was not completing in a timely fashion, resulting in a blocking condition leading to 30-second client query timeouts.

A commit was submitted and became stuck (log write pending), causing the query to continue holding locks and block incoming requests from other clients. Other clients started timing out and compounding the problem because the application was not designed to rollback open transactions when a timeout occurred. This led to hundreds of open transactions holding locks and severe blocking. (For more details on transaction handling and blocking, refer to INF: Understanding and Resolving SQL Server 7.0 or 2000 Blocking Problems). The application serviced a Web site using connection pooling, so, as more connections became blocked, the Web site created more connections that, in turn, became blocked, and the cycle continued.

After approximately 45 seconds, the log write would complete, but by this time hundreds of connections have backed up and caused blocking problems that resulted in several minutes of recovery time for SQL Server and the application. When combined with the application problems, the stalled I/O condition had a very negative impact on the system.

Resolution: This was tracked to a stuck I/O request in an HBA driver. The machine had multiple HBA cards with failover support. The failover timeout value was configured to 45 seconds. When one HBA was behind or not communicating with the SAN for 45 seconds or more, the I/O request was routed to the second HBA for handling and would complete quickly. The recommended failover setting of the hardware manufacture was 5 seconds in order to prevent such stall conditions.

If we had had SQL Server 2000 SP4's new automatic reporting of such conditions, we would have known that the fundamental issue was a stuck or stalled I/O operation occurring due to issues outside of SQL Server much sooner in the troubleshooting process. As it was, we spent a considerable amount of time troubleshooting an issue that initially presented itself as a garden-variety performance problem.

Example 2 - Filter Drive Intervention

Many anti-virus software and backup products use I/O filter drivers. These filter drivers become part of the I/O request stack and have access to the IRP request. Microsoft support has seen various issues from bugs that lead to a stuck I/O to stalled conditions in a filter driver implementation.

One such situation seen by Microsoft SQL Server Support involved a filter driver used for backup processing that allowed backup of files that are open at the time of the backup. The system administrator had mistakenly included the SQL Server data file directory in the file backup selections. When the backup took place, it attempted to gather a consistent image of the file at the time the backup started. In doing so, it would delay subsequent I/O requests, allowing them to complete one at a time as the software handled them.

When the backup started, SQL Server's performance would drop dramatically because the I/Os to SQL Server were forced to complete one at a time. Compounding the issue, the single-I/O logic was such that I/O in general could not be performed asynchronously, so when SQL Server expected to post an I/O request and continue, the UMS worker was stuck in the read or write call until the I/O completed. SQL Server read ahead was effectively disabled by the filter driver's actions. Furthermore, another bug in the filter driver left the single-I/O behavior in place even when the backup was finished. The only way to restore SQL Server performance was to close the database and reopen it or restart SQL Server so the file handle was released and reacquired without the current filter driver interactions in place.

Resolution: SQL Server's data files were excluded from the file backup process, and the bug in the filter driver was fixed that caused files to be left in single-I/O mode.

At the time, if we had had SQL Server 2000 SP4's reporting on stalled I/O operations, we would have known what the fundamental issue was much sooner in the troubleshooting process.

Example 3 - Hidden Errors

Many higher-end systems have multi-channel I/O paths to handle load balancing and similar facilities. Microsoft SQL Server Support has seen situations with this type of software where an I/O request fails but the software does properly handle the error condition and attempts infinite retries. The I/O becomes stuck, and SQL Server is unable to complete the specified action. Much like the log write condition described above, many poor system behaviors can arise after such a condition negatively impacts the system.

Resolution: In situations like this, a restart of SQL Server may provide some relief, but, occasionally, a restart of Windows is required to restore processing to normal. Ultimately, of course, the bug in the I/O subsystem needs to be addressed by the I/O vendor.

SQL Server 2000 SP4's new automatic reporting of such conditions makes detecting problems like this much easier. Rather than merely seeing a general performance slowdown across the server, the new messages logged by SP4 give us immediate insight into the true nature of the problem and that it is most likely outside of SQL Server.

Example 4 - Remote Storage / Mirroring/ RAID Drives

Many systems use mirroring or similar technology to help protect against data loss. Some of these systems are software-based, and others hardware-based. A situation commonly encountered by Microsoft SQL Server Support for these systems is increased latency.

When I/O must complete successfully to a mirror before an I/O operation is considered complete, this obviously increases the overall I/O time. For remote mirror installations, network latency and retries can become a factor. When drive failures occur and the RAID subsystem is rebuilding, I/O throughput can be impacted.

Resolution: What we usually recommend in situations like this is to use strict configuration settings (which vary from vendor to vendor and between devices) to reduce latency to mirrors and RAID rebuild operations.

RAID overhead and latency can cause I/O slowdowns that SQL Server can do nothing about. It's a client of the RAID hardware and drivers just as any other application is. The new stalled and stuck I/O reporting features in SP4 will help pinpoint when this type of issue is slowing down the server inordinately.

Example 5 - Compression

Microsoft does not support SQL Server 7.0 or 2000 data and log files on compressed drives. NTFS compression is unsafe not only because it breaks the Write Ahead Log (WAL) protocol but also because it requires increased processing for each I/O request. Compression prevents asynchronous I/O, resulting in all SQL Server I/O with the affected data or log files being carried out synchronously.

Resolution: In this situation, we always recommend that customers uncompress their data and log files.

NTFS compression can cause I/O slowdowns that SQL Server can do nothing about. It's a client of the file system just as any other user-mode application is. The new stalled and stuck I/O reporting features in SP4 will help pinpoint when compression is having a negative impact on SQL Server I/O operations.

Additional Data Points

The wait type information provided in sysprocesses can be helpful in diagnosing I/O bottlenecks. Buffer I/O latch wait types and write log waits are key indicators to investigate I/O path performance. The Microsoft Knowledge Base article, 822101: The waittype and lastwaittype fields in the sysprocesses table, outlines wait types and details the I/O wait types related to diagnosing stalled or stuck I/O conditions.

Conclusion

Although rare in SQL Server deployments, stuck and stalled I/O problems are historically very difficult to troubleshoot when they do occur. Given that the root cause is usually found in a driver or hardware device, investigating and resolving these types of problems can take a significant amount of time and require expertise outside the typical database administrator's ability. Using the new tools in SQL Server 2000 SP4 can significantly reduce the time required to resolve these types of problems, and can at least point the DBA in the right direction.

References Section

231619: How to use the SQLIOStress utility to stress a disk subsystem such as SQL Server

826433: PRB: Additional SQL Server Diagnostics Added to Detect Unreported I/O Problems

230785: INF: SQL Server 7.0 and SQL Server 2000 Logging and Data Storage Algorithms Extend Data Reliability

 

SQL Server for Developers

Robert Dorr is a Senior SQL Server Escalation Engineer on the Microsoft SQL Server Support team since 1994. Robert has published a variety of articles on the SQL Server engine on MSDN and Microsoft Technet, is a presenter at SQL Pass, and is the author of key SQL Server utilities (Read80Trace and SQLIOStress). Robert lives in suburban Dallas, TX with his wife and children.