What do those "IO requests taking longer than 15 seconds" messages on my SQL box mean?

You may be sometimes seeing stuck/stalled IO messages on one or more of your SQL Server boxes. This is something it is important to understand so I am providing some background information on it.

 

Here is the message you may see in the SQL error log:

SQL Server has encountered xxx occurrence(s) of IO requests taking longer than 15 seconds to complete on file [mdf_or_ldf_file_path_name] in database [dbname] (dbid). The OS file handle is 0x.... The offset of the latest long IO is: 0x....”.

The message indicates that SQL Server has been waiting on at least one I/O for 15 seconds or longer. The exact number of times you have exceeded this time for the specified file since the last message is included in the message. The messages will not be written more than once every five minutes. Keep in mind that read IOs on an average system should take no more than 10-20ms and writes should take no more than 3-5ms (the exact acceptable values vary depending on your business needs and technical configuration). So anything measured in seconds indicates a serious performance problem. The problem is NOT within SQL Server, this message indicates SQL has sent off an IO request and has waited more than 15 seconds for a response. The problem is somewhere in the disk IO subsystem. For example, the disk IO subsystem may have more load than it is designed to handle, there is a "bad" hardware or firmware somewhere along the path, filter drivers such as anti-virus software are interfering, your file layout is not optimal, or some IO subsystem setting such as HBA queue depth is not set optimally.

 

Though the root cause is IO, you can see other symptoms that are a side effect and may lead you down the wrong troubleshooting path. For example, if enough IO is backed up behind the stalled IO then you may see blocking in SQL Server (because locks that are usually taken for very short periods of time are now held for seconds), new connections may not be allowed, and the CPU usage can increase (because many threads are waiting), and a clustered SQL Server can fail over (because the IsAlive checks which are just SQL queries fail to complete like all the other queued queries). You may see other errors returned to the user or in the various logs, such as timeouts.

 

There are two ways to approach this problem. You can either reduce the IO on the system (change indexes or queries or archive data for example) or you can make the underlying system able to handle the IO load (fix hardware/firmware problems, change configurations, add disks or controllers, change the file layout, etc.).

Background:

· 897284 Diagnostics in SQL Server 2000 SP4 and in later versions help detect stalled and stuck I/O operations

https://support.microsoft.com/default.aspx?scid=kb;EN-US;897284

· Detecting and Resolving Stalled and Stuck I/O Issues in SQL Server 2000 SP 4 https://msdn.microsoft.com/en-us/library/aa175396(SQL.80).aspx

 

Troubleshooting:

· Every Windows 2003 SP1 or SP2 system should have this storport fix: 941276 A Windows Server 2003-based computer stops responding when the system is under a heavy load and when the Storport driver is being used https://support.microsoft.com/default.aspx?scid=kb;EN-US;941276

· Use PerfMon to look at the disk counters for sec/read, sec/write, bytes/sec, current disk queue length, reads/sec, writes/sec

· Collect data from sys.dm_io_virtual_file_stats and sys.dm_io_pending_io_requests.

· Ask your storage admins to monitor the entire IO subsystem from the Windows system all the way through to the underlying disks.