How To Talk To SAN Administrators About Disk Performance Issues

Editor's note: Care of our friends over at https://mcpmag.com , Clint Huffman, a Microsoft Premier Field Engineer, provides his take on how you can talk intelligently to SAN administrators and vendors when disk performance issues surface. Here’s a teaser. Please make sure you check out the full article.

Traditionally, disk queue related performance counters such as "Avg. Disk Queue Length", "% Idle Time", and "% Disk Time" have been staples in the IT professionals tool belt. They have great value when analyzing single spindle disks, but are less effective when more spindles are added to a LUN or when spindles are shared between LUNs. For example, if "Avg. Disk Queue Length" is 2 and there are 10 spindles behind the LUN, then the LUN should have no problems with handling the load. This would be like have 10 check-out lines with only 2 people in line. Likewise, "% Idle Time" and "% Disk Time" are simply measures of how often the disk queue is completely empty or not empty respectively.

“Avg. Disk sec/Read” and “Avg. Disk sec/Write” are performance counters that measure the I/O request packet response times for read and write operations respectively. Response times are our best indicator of poor disk performance because the response times reliably increase when the disk subsystem is overwhelmed.

The following chart shows the access times in milliseconds and I/O’s per second for common hard drives. Access Times are the longest that any I/O request should take to respond on the given hardware. Hardware  and software features of the disk subsystem such as short stroking and cache can dramatically increase these speeds and throughput. For example, my 5400 RPM USB disk drive can sustain 150 IOPS and stay under 5 ms average response times. The following table shows access times and IOPS of various hard drives.

Device IOPS* Access Time (ms)*
3.5” floppy disk USB drive 8 120
5400 RPM hard disk 59 17
7200 RPM hard disk 77 13
10K RPM hard disk 125 8
15K RPM hard disk 143 7
solid state drive (SSD) 5000 0.2

* Does not reflect actual products.

Based on the table above, we generically use sustained values of 15 ms or more as a warning threshold and sustained values of 25 ms as a critical threshold for disk response times using the “Avg. Disk sec/Read” and “Avg. Disk sec/Write” performance counters.

Note: All of the counters mentioned in this article are found on the LogicalDisk and PhyiscalDisk counter objects.

For complete details, check out the full article at:  https://mcpmag.com/articles/2011/05/12/how-to-speak-san-ish.aspx