Analyzing Storage Performance
Introduction
The purpose of this article is to provide prescriptive guidance on how to troubleshoot logical and physical disk response times in regards to Windows performance analysis.
Start with the following performance counters to analyze disk response times:
- \LogicalDisk\Avg. Disk Sec/Read
- \LogicalDisk\Avg. Disk Sec/Write
- \LogicalDisk\Disk Bytes/Sec
- \LogicalDisk\Disk Reads/Sec
- \LogicalDisk\Disk Writes/Sec
- \LogicalDisk\Split IO/sec
- \LogicalDisk\Disk Transfers/sec
These counters are generally the first ones to look at because we are looking for the following attributes of the Input/Output profile:
- Average disk seconds/read and average disk seconds/write (response times): Are the users having to wait for application responses? Are we exceeding established thresholds for disk drive performance degradation (generally > 15 ms)?
- Throughput: Are we saturating any of the pipes, such as mainboard bus, SCSI connection, SAN connection, or other link between servers and storage. Are we reaching the throughput limit of the disk subsystem?
- Transfers/second: Is the server and its applications generating more I/O than the disk subsystem can keep up with? For example, suppose you have allocated 4 disk drives to a single logical disk group that is configured as RAID 1+0. Assuming 200 Input/Output Operations per second (IOPs) capability of a given disk drive, that RAID group will be capable of around 400 IOPs (cache and read-ahead may increase that number somewhat). Even being very generous and saying that with cache and optimizations a disk drive can perform 400 IOPs, the most that could be hoped for in write operating on a 4 disk RAID 1+0 is 800 IOPs. If transfers/second exceeds that number at the same time as response times are deteriorating, chances are there just are not enough disk drives to back that logical disk and the assigned workload.
- Reads/second and writes/second: Gives you an indication as to the mix of workload that you are dealing with. Certain disk subsystem types handle certain workloads better than others. For example, some RAID-5 controllers can handle large I/O writes and sequential reads relatively well.
- Split I/O: Does the operating system have to perform more than one command for each I/O? Split I/O is a good indicator of fragmentation, which can reduce performance by causing excessive seek time.
This article is grouped by symptoms, then by possible causes.
Symptoms: Long disk response times and High I/O
Applies to:
- Windows Server 2003 (all editions) unless otherwise specified
- Windows XP (all editions) unless otherwise specified
- Windows Server 2000 (all editions) unless otherwise specified
Symptom Details:
- Long disk response times: A “LogicalDisk\Avg. Disk Sec/Read” or “LogicalDisk\Avg. Disk Sec/Write” value greater than 15ms though occasional spikes are not necessarily cause for immediate concern.
- High I/O: “\LogicalDisk\Disk Transfers/sec” is at or near the number of I/O operations per second that each physical spindle is designed to handle which is typically between 80 to 180 per disk.
Possible Causes |
How to Diagnose |
Possible Solutions and/or Recommendations |
Storage response time reduced because of misaligned partitions. |
|
|
Symptoms: General poor response from storage subsystem
Applies to:
- Windows Server 2003 (all editions) unless otherwise specified
- Windows XP (all editions) unless otherwise specified
- Windows Server 2000 (all editions) unless otherwise specified
Symptom Details:
- Long disk response times: A “LogicalDisk\Avg. Disk Sec/Read” or “LogicalDisk\Avg. Disk Sec/Write” value greater than 15ms though occasional spikes are not necessarily cause for immediate concern.
- Low I/O: “\LogicalDisk\Disk Transfers/sec” is well below the number of I/O operations per second that each physical spindle is designed to handle which is typically between 80 to 180 per disk.
- Split I/O: “Physical Disk\Split IO/sec can be an indicator of volume fragmentation.
- High Queue Lengths, poor response times. LogicalDisk\Average Disk Queue Length is averaging higher than 2-3 plus the number of spindles. At the same time, “LogicalDisk\Avg. Disk Sec/Read” or “LogicalDisk\Avg. Disk Sec/Write” value greater than 15ms are observed.
- Low throughput, high number of transfers. “Transfers/second” counter is relatively high, but the overall “Disk Bytes/sec” is low.
Possible Causes |
How to Diagnose |
Possible Solutions and/or Recommendations |
High disk fragmentation |
|
|
Lack of free space |
|
|
Insufficient number of disks |
|
|
Flooding the I/O channel and causing retries or “busy” from the storage device |
|
|
Low throughput, high number of transfers |
|
|
More Information
Perfmon Log Capture Interval: Generally speaking, if capturing performance data on a live system using the Windows Performance Monitor, the sampling interval should be kept fairly non-intrusive, such as every 10 seconds. The problem with sampling at 10 seconds or longer is that we tend to miss a lot of data. If in a testing environment we should set the capture interval to 1 second and capture both Physical Disk and Logical Disk counters. If we are capturing at short intervals like 10 seconds or less, we may not want to capture other counters at the same time so as to not impose too much overhead on the system for performance monitoring.
Capturing Logical Disk versus Physical Disk Counters: The other thing to keep in mind with Performance Monitor is that if we are gathering performance data, there is a cost in performance associated with gathering a specific performance counter. At the same time there is little additional host performance cost to go ahead and capture the entire performance object. The point being that when measuring storage performance, go ahead and capture Physical Disk and Logical Disk objects and not just individual performance counters. If the physical disks only have 1 partition per disk, then there is really no need to capture the Logical Disk counters. The exception being of course if you are making use of Mount Points within Windows and you need to measure performance of individual physical disks.
When capturing performance data, there is sometimes a concern about the size of the capture file. If capturing only Physical Disk and Logical Disk counters, even at a 1 second interval, the resulting file will not get to be excessively large. For a cost of 100 MB or so, and depending on the number of disk devices. If capturing only Physical Disk and Logical Disk counters, even at a 1 second interval, the resulting the counter log file will typically not grow, excessively large, perhaps 100 MB or so, depending on the number of disk devices.
References
- Ruling Out Disk-Bound Problems
https://technet.microsoft.com/en-us/library/5bcdd349-dcc6-43eb-9dc3-54175f7061ad.aspx - How to Identify a Disk Performance Bottleneck Using the Microsoft Server Performance Advisor (SPA) Tool
https://www.codeplex.com/PerfTesting/Wiki/View.aspx?title=How%20To%3a%20Identify%20a%20Disk%20Performance%20Bottleneck%20Using%20SPA1&referringTitle=How%20Tos - Download Details for Microsoft Service Performance Advisor (SPA)
https://www.microsoft.com/downloads/details.aspx?FamilyID=09115420-8c9d-46b9-a9a5-9bffcd237da2&DisplayLang=en
Written By: Robert Smith
Contributors: Clint Huffman, Jimmy May, and Ken Brumfield
Comments
- Anonymous
January 27, 2014
Great article. - Anonymous
June 23, 2015
Thanks, Clint. Terrific content as usual. Use Clint's PAL tool for help with analyzing perf data:http://pal.codeplex.com/ - Anonymous
July 16, 2015
Why would you use LogicalDisk and not PhysicaDisk? Specifically if you are monitoring I/O, wouldn't looking at LogicalDisk prevent you from really knowing what's going on? - Anonymous
September 10, 2015
Hello all,
Recently I came across an issue over a table that was being inserted into quite intensively