NT Server and Disk Subsystem Performance

Archived content. No warranty is made as to technical accuracy. Content may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist.

Curt Aubley

Chapter 6 from Tuning & Sizing NT Server, published by Prentice Hall PTR

When developing an NT Server solution, the disk subsystem portion of the solution allows for the greatest amount of flexibility that you can directly influence. The disk subsystem is one area in which having an understanding from strictly an NT Server perspective is not enough. The old adage of understanding either software or hardware blurs when it comes to the disk subsystem. The variety of configurations available for the disk subsystem is numerous. To get a handle on this important topic, we will follow the data flow from the disk drive to the CPU. If you understand the flow of data through the server hardware and NT Server itself, how the disk subsystems work becomes much more apparent. With this knowledge, it is then easier to detect where bottlenecks really are and how to tune and size your NT Server solution.

We will start with a high level view of the data paths involved between getting the data from the disk drive to the CPU. Using this data path as a guide, each of the critical hardware components are reviewed from a performance perspective illustrating some of the technology's strengths and weaknesses. Once the disk subsystem hardware is discussed, we'll then investigate how NT Server takes advantage of it. After these foundations are in place, key sizing and tuning concepts that revolve around the disk subsystem are explored.

On This Page

Disk Subsystem Technology: Following the Data
SCSI Bus Technology from a Practical Perspective
I/O Bus Technology
Redundant Array of Inexpensive Disks (RAID)
Scaling RAID Arrays
How NT Server Uses the Disk Subsystem
NT Server Device Drivers
Sizing NT Server's Disk I/O Subsystem
Disk Storage Capacity versus Disk Performance Capacity
Detecting Disk I/O Bottlenecks
General NT Server Tuning Considerations and Recommendations
SCSI Channel, Disk Subsystem, and Host Bus Adapter Tuning Considerations
RAID Tuning Considerations
Thinking Outside of the Box
Summary
About the Author

Disk Subsystem Technology: Following the Data

All that data needed for an enterprise level server is not capable of being kept in RAM, even with NT Server Enterprise Edition. A fast disk subsystem lends itself to providing significant transaction rates and excellent overall server performance. The Standard High Volume (SHV) server logical diagram that was used in earlier chapters is utilized here as our reference server architecture. Adding the primary components required in accessing the disk subsystem, we arrive at the following logical diagram (Figure 6–1).

It is important to understand the different transfer rates of each component of the server's disk subsystem. This information helps you to identify potential bottlenecks that can throttle your server's overall performance. In Figure 6–1 below, data travels from the actual disk drive to the embedded disk controller located on the disk drive unit (<10 Mbytes/sec), up the Ultra Fast/Wide SCSI channel 2 at 40 Mbytes/sec, through PCI Slot 1 on PCI Bus 1 at 133 Mbytes/sec to the Memory subsystem (533 Mbytes/sec), and then transferred to the CPU at a P6 system bus speed of 533 Mbytes/sec.

Cc767920.ntser1(en-us,TechNet.10).gif

Figure 6-1: Server data paths

Review each link in the data path. It is the weakest link that throttles the overall disk subsystem performance.

If one component is trying to send more data than the next component can handle, there is a bottleneck. An analogy to this is a plumbing example. If the primary water pipe carrying water away from your basement is five inches in diameter and you have five two inch pipes placing water into the five inch pipe, water will be spilling out.

By completing a little mathematical word problem, you can avoid bottlenecks even before they begin. For example, placing two 3 channel Ultra Fast and Wide SCSI cards (theoretical aggregate maximum throughput of 2x[3x40 Mbytes/sec] = 240 Bytes/sec) into a single PCI bus can overwhelm the single PCI Bus data link if all of the SCSI channels were active. A single PCI bus can only support a theoretical maximum of 133 Mbytes/sec. Jamming 240 Mbytes/sec of data into it just does not work very well. If this configuration were actually implemented, you would have created a bottleneck from the start. Placing each one of the 3 channel Ultra Fast and Wide SCSI cards onto their own respective PCI bus will spread the disk I/O activities across 266 Mbytes of total aggregate PCI bus throughput.

Disk Drive Technology

The disk drive itself is the slowest link of the data path. A disk drive is a series of stacked platters (Figure 6–3) with very small heads that read and write the data to the various platters. These platters rotate or spin at very high speeds, currently up to 10,000 rpm. As disk requests come down from NT Server, the heads move accordingly over the platters to obtain the requested data. The following is an illustration of a disk drive as if you were looking at it from a bird's eye view (Figure 6–2) and from the side (Figure 6–3):

Figure 6-2: Top view of a disk drive

Figure 6-2: Top view of a disk drive

Figure 6-3: Side view of a disk drive

Figure 6-3: Side view of a disk drive

When referring to the disk drives, there are a variety of terms describing the different disk operations. From a disk vendor perspective, one of their goals is to lower the amount of time required when retrieving data from a disk drive. Table 6.1 defines the most common terms.

Table 6.1 Disk drive technology definitions.

Disk Drive Related Term

Definition

Track

A logically defined band of data on the disk platter. Envision tracks as concentric circles on the disk platter (Figure 6-4).

Sector

A logical track is divided into smaller chucks of data referred to as sectors (Figure 6-4).

Latency

The time required for the disk platter to spin one complete revolution.

Average Latency

The time that the disk head waits for a sector to arrive as the disk platter rotates. On average, you will have to wait for approximately one half the rotational time of the drive platter for the sector to arrive.

Seek Time

The time required for the motion of the disk head as it moves to/from a particular track on the disk platter (Figure 6–4).

Average Seek Time

The time required for disk head to move to/from a particular track on the disk drive platter. On average, this will be the time it takes the disk head to travel half way across the disk platter.

Average Access Time

The average length of time it takes the disk to seek to the required track plus the amount of time it takes for the disk to spin the data under the head (Average Seek Time + Average Latency)

Transfer Rate

The speed at which the data bits are being transferred to/from the internal disk drive buffer.

NT Server Transfer Rate

The speed at which the data bits are being transferred externally from the disk drive through the SCSI data path to the server.

Disk Platter Rotation
Revolutions Per Minute (RPM)

The measurement of the rotational speed of a disk drive platters on a per minute basis. How fast the disk platters spin.

Basic Disk Drive Operation

Each disk drive is a stack of disk platters that are broken into cylinders, tracks, sectors, and clusters. When the disk drive is low-level formatted, logical tracks (logical concentric circles on the disk platters used for locating data on the disk itself) are established and the disk geometry of the disk drive is determined. Low-level formatting is not normally a function available directly from NT Server but is provided by the host bus adapter vendor. For example, the low level format tool for Adaptec SCSI host bus adapters are available when a server starts by depressing "[CTRL] + a" at the same time during the server's power on self test (POST) as the Adaptec SCSI adapter is detected by the server. This type of host bus adapter tool is either started during the POST of the server or by booting the server into MS-DOS using a boot floppy and then running the host bus adapter tool. Track numbers start at 0, where track 0 is the outermost track of the disk. These outermost tracks of the disk are actually the fastest portion of the disk drive.

This is one of the reasons disk drive manufacturers generally provide throughput in ranges (for example 3-5 Mbtes/sec) when describing their products. Performance is not really affected until the available disk drive space is less than 30 percent. When 70 percent of the disk drive is used, the remaining free space is located closer to the disk's spindle at the center of the disk drive, which operates at a lower performance level. Each track is broken down into sectors, normally 512 Kbytes in size. Part of the process completed when you format a disk drive under NT Server using the "format" command or Disk Administrator is that these sectors get grouped into clusters. Clusters are referenced under NT Server as the Allocation Unit Size (ALU) for the file system. Cylinders are the vertical stacking of tracks. Figure 6–4 depicts tracks, sectors, and clusters:

Figure 6-4: Disk spindle, track, sector, and cluster

Figure 6-4: Disk spindle, track, sector, and cluster

Disk Drive Performance

How do these disk drive definitions relate to overall performance? When the data requested by an NT Server process is identified on the disk drive, the disk head (which is a small transducer that reads magnetic medium) seeks to the appropriate track. The head then waits for the platter to spin until the proper sector of data is under the disk head. Once the data is under the head, the data is read. The combination of the seek and rotational time associated with obtaining the data is the access time. This data is then transferred to the embedded disk controller that is physically located on the disk drive unit. The embedded controller is the equivalent of a miniaturized computer, as it contains a small CPU and its own memory to obtain and transfer data from the physical disk drive.

The most costly portions of this process is the time it takes for the head to move to the correct track, for example, seek time and the rotation time of the disks platters (latency). Cost here is referenced in time. The longer it takes for the physical activity to be completed, the longer the process under NT Server must wait for its data. To get an understanding of what contributes to the relative time required to get data from the disk subsystem up to the PCI bus, look at the pie chart in Figure 6–5.

Figure 6-5: Relative size of disk I/O time components for random disk I/O.

Figure 6-5: Relative size of disk I/O time components for random disk I/O.

Source: Page 266, Brian Wong, Configuration and Capacity Planning for Solaris Servers.

Disk Drive Selection

Now that some of the key technologies used to implement disk technology are reviewed, how do you take advantage of them? Later in this chapter, we will reference this information when explaining RAID technology and NT Server disk subsystem tuning techniques. For now, use this knowledge when selecting the disk drive technology to implement in your server. Unless you're a vendor, you will not actually be changing the internal disk characteristics. What you can do though, is select the fastest disk drive possible that fits into your budget. Table 6.2 charts some of the current disk drive technologies found after a quick check on the Internet.

Table 6.2 Disk drive technology specifications.

Brand

Capacity (GB)

Embedded Disk Controller Buffer Size

RPM

Seek Time (ms)

SCSI Type

IBM UltraStar

4.3

512 Kbytes

5400

8

Ultra SCSI

Seagate Barracuda

4.5

512 Kbytes

7200

8

Ultra SCSI

Seagate Barracuda

4.5

512 Kbytes

10,000

7.5

Ultra SCSI

Quantum Viking

4.5

512 Kbytes

7200

8

Ultra SCSI

Actual NT Server Disk Subsystem Performance

Performance information provided by vendors can be impressive. For some of the drives listed in Table 6.2, the vendor's list corresponding sustained data rates up to 20 Mbytes/sec. Unfortunately, this performance information is not measured for actual NT Server application environments, so mileage will vary. When a disk drive vendor provides a throughput measure for a specific disk drive, they typically quote the burst speed from the embedded disk controller cache to the SCSI bus. This does not take into consideration the most costly portion of the disk activity, the actual physical activities (seek and rotation) associated with obtaining data. Vendors are in business to sell disks, so this recording practice is accepted and common. Sometimes more relevant data is available. When choosing a disk drive unit or comparing them, look closely at the fine print so that you can determine what they are really reporting.

To determine a general rule of thumb about the performance that is actually provided for your environment, you can create a complete server level stress test, develop your baseline, then change a single component (disk drive or disk array) and determine what effect it has. This technique is illustrated in the Chapter 8 file server case study when one disk array is replaced by another disk array. You can also test components in your NT Server on a per-component basis in a controlled environment. There are various publicly available component-testing tools such as Intel's Galileo. More information about Galileo can be obtained by visiting their website at https://developer.intel.com/design/servers/devtools/iometer. The NT Server Resource Kit also contains a disk-testing tool called response probe. The NT Server Resource Kit provides a thorough explanation of the use of the response tool, but some trade magazines, such as the Windows NT Systems March 1998 issue, provide condensed instructions on its use.

Many of the disk-testing tools that exercise disk I/O do so by setting the appropriate application code flags to bypass the NT Server file system cache and test the physical disk drive directly. Although this technique is useful in pure performance testing environments, I prefer to use tools that test server components in ways that closely represent actual NT Server environments. In most environments, NT Server's file system cache is used. There are times, however, when the file system cache is either not effective because the environment is truly random in nature, or it is defeated due to the sheer size of the files being moved to and from the disk drives. In either case, the file system cache influences the overall throughput achieved. Even applications that do bypass the file system cache implement their own file caching strategies, which results in similar results.

Disk Subsystem from an Application Perspective

To determine the throughput that a single disk drive can achieve, I employed the Neal Nelson Business Benchmark suite for NT. This test suite is not publicly available, but is a commercial product. If you are interested in this product, you can contact the company directly at (312) 755-1000 (sorry, no external website URL available). Using this suite of tests, various disk loads were placed on an NCR S46 server with 128 Mbytes of RAM, two Pentium Pro CPUs, and three Ultra Fast and Wide SCSI 7200 RPM disk drives. One disk drive was configured with NT Server, one housed the Pagefile, and one drive was used as the workload target. The test suite was run up to a 20-copy workload level which corresponds to 400 users and a disk work file size of 780 Mbytes. At this heavy load, the file system cache on a 128 Mbyte server is defeated, and the actual performance of the disk drive is obtained in a simulated multiuser environment. It is important to note that this is a multiuser/threaded test which is much more realistic of an NT Server environment compared to tests that only exercise the disk subsystem under a single user/thread workload. Table 6.3 shows the results for various disk characteristic environments.

Table 6.3 Physical disk performance.

Test Characteristics using 780 Mbyte Work File

Single 7200 RPM Fast/Wide SCSI Disk Drive Configuration Throughput achieved as reported by the application (Mbytes/sec)

Sequential Reads of 1024 Byte Records

2.01

Sequential Writes of 1024 Byte Records

0.66

Sequential Reads of 8192 Byte Records

1.93

Pseudo Random Reads of 4096 Byte Records

0.61

The disk vendor rated this particular disk drive at 3.9 Mbytes/sec to 6.6 Mbytes/sec. This is substantially less than what was recorded from the disk specifications provided, but remember that this test measured the disk performance from an application, not pure hardware, perspective. Vendors are not really trying to deceive you, they are just playing the benchmark marketing game and measuring their performance from a pure hardware, not application, perspective. For example, to achieve the 3.9 Mbytes/sec to 6.6 Mbytes/sec, the vendor utilized a sector size of 2 Mbytes. Most commercial SCSI host bus adapters default to a sector size of 512 Kbytes, a big difference.

The throughput levels obtained when completing disk activities are based on your user's access patterns and application environment. How efficient the application is with its disk I/O usage and how it implements block and buffer sizes directly influences that amount of throughput possible. If an application utilizes very large block and buffer sizes, much higher throughput levels can be achieved. Table 6.3 provides good information that we will utilize later in this chapter when detecting bottlenecks, tuning, and sizing NT Server's disk subsystem. Also, be careful of single-user and multiuser environments. Multiuser environments are much more difficult to support and subsequently much more stressful on NT server's disk subsystem.

Quick Disk and SCSI Bus Sizing Example

If, for example, your environment is sequentially read intensive, a single Fast and Wide SCSI channel (40 Mbytes/sec) could support a configuration up to 15–20 Fast and Wide SCSI disk drives (30–40 Mbytes/sec aggregate). Of course, your mileage may vary, but this is a good sizing starting point. If the server application and workload mix is very efficient, this may be too many disk drives for a single SCSI channel. Conversely, if the application and workload mix is more random in nature or less efficient, additional disk drives per SCSI channel can be configured.

Disk Workload Supported

Throughput is only one measure of a disk drive performance. The number of disk I/O operations that a disk drive can support is also very important. An I/O request is a general term that refers to a read or write request. As the number of I/Os per second increases, the disk drive provides higher and higher throughput until the Logical Disk: % Disk time increases over the 60–80 percent range. % Disk time is the percentage of elapsed time that the disk drive is busy servicing read or write requests. Above 80 percent Disk time, the disk drive's response time performance can begin to degrade beyond acceptable levels. Degradation experienced is the amount of time an NT Server process must wait for the disk operation to complete.

What actually occurs is that the disk drive itself can only support a certain level of disk activity as defined by the number of I/O operations per second. When the amount of time the disk drive is busy increases, so do the I/O operations per second, and the throughput. This sounds great. Your throughput is increasing and more data is being retrieved/sent from the disk drive. Unfortunately, it takes longer to service those disk I/O requests as the disk gets busy. More work is completed (throughput), but as the disk drive's utilization increases past 80 percent, the response time back to the requested process falls off. Depending on which vendor you approach and your environment, Table 6.4 outlines the number of I/Os per second a single disk drive can support. You can determine the I/Os per second by looking at Perfmon's logical disk counter: transfers/sec. To determine how long the I/O operations are taking to complete, review Average Disk sec/read, Average Disk sec/write and Average Disk transfers/sec. Take note that the amount of work that a single disk drive can support does not noticeably increase as disk capacity increases.

Table 6.4 Disk workload chart.

Disk Characteristics

4 GB Fast/Wide 7200 RPM SCSI Disk Drive Supported I/Os per second

Sequential Writes
(typified by log files)

150–190

Random Read/Writes
(typified by multiuser databases)

50–80

Mixed Environment Average

100

Figure 6–6 was obtained from a server when the user load was increased in blocks of 50 users against the same disk I/O subsystem that consisted of 30 4 GB disk drives. Measurements were taken of the various disk utilization levels as the user loads increased. Note how the amount of time a process's disk request must stay in the disk queue before receiving service increased.

This graph illustrates that when the disk is under increased workload levels noted by the %Disk Time, the time disk requests must wait in the disk queue increases drastically. What was interesting in this specific NT Server environment was that the throughput provided increased along with Average Time in Queue for the disk requests. So there is a tradeoff, response time in essence degrades, but the throughput increases. What is best? It depends on your environment.

The %Disk Time counter under Perfmon can be deceiving. A high % disk time alone is not a true qualifier of an overworked disk drive in an NT Server environment. Table 6–4 also helps to provide insight into determining when a disk drive is overworked. The Perfmon data in Figure 6–7 was collected and graphed from another NT Server environment.

Cc767920.ntser6(en-us,TechNet.10).gif

Figure 6-6: Disk subsystem average queue time.

Cc767920.ntser7(en-us,TechNet.10).gif

Figure 6-7: Single disk workload graph.

Grasping how long a millisecond is does not come easy for everyone (including me), so the Average Disk second/Transfer is scaled for this graph (Figure 6–7) by a factor of 1000. The %Disk Time counter reaches a maximum of 100 percent and plateaus. When the %Disk Time grew above 80 percent, the I/Os per second shown by Disk transfers/sec grew to 65 and the Average Disk seconds/Transfer begins to increase. The disk drive at this point began to get overloaded. The workload for this environment was random in nature; thus, a single disk drive can support roughly 50-80 I/O operations per second. When the Disk Transfers/sec are lower than 80, response time is less than a normalized 10 seconds (11 ms), which is acceptable in most environments. But, as the I/O's second (workload) increases beyond what the single disk drive can support, the response time becomes unacceptable. In fact, as the I/O workload moves past 80 I/Os per second, the disk drive saturation is so great that the response time from the disk increases six times! Imagine a process having to wait 67 normalized seconds (67 ms) for a disk access. That is an eternity from the CPU's perspective. Follow the sizing and tuning recommendations later in this chapter to avoid this overworked disk situation from happening to you.

SCSI Bus Technology from a Practical Perspective

SCSI stands for Small Computer System Interface peripheral bus and is the most common and predominant technology used to connect disk devices to servers in use today. Other technologies, once on the horizon, are now beginning to show themselves in the marketplace, but none are as prevalent as SCSI. In fact, according to Greg P Shultz's article in a recent SysAdmin column (March 1998), SCSI is predicted to be the dominant storage interface until 1999, when fibre channel should start gathering more momentum. (I will briefly touch on fiber channel later in this chapter.) SCSI buses are available in a variety of technology standards.

One of the advantages of SCSI is that the intelligence of the bus is actually spread across the peripheral devices. Each SCSI device normally has an embedded controller built into it, such as a SCSI disk drive. SCSI has various electrical standards associated with it as well as a standard protocol, which allows a host bus adapter to communicate with the devices on the bus. Intelligent SCSI devices receive SCSI commands and can disconnect from the SCSI bus while completing the requested action, such as obtaining a block of data from a disk array. In this way, other devices are not waiting to access the bus while other devices are making the necessary calculations to complete the request. Modern SCSI bus technology (past SCSI-2) allows these SCSI commands to be queued so that the embedded controller can decide on which commands to execute first for optimum performance.

As with any protocol implementation, the protocol adds some overhead to the bus for each transaction. When smaller Allocation Unit (ALU) sizes are in use by NT Server, more of the available SCSI bandwidth is used for SCSI protocol overhead. Conversely, when larger ALU sizes are implemented under NT Server, less SCSI bandwidth is sacrificed to SCSI command overhead; thus, more bandwidth is available for actual data. The larger the block size used for data transfer, the higher overall throughputs can be achieved. Due to this SCSI command protocol overhead, a general rule of thumb is not to expect more than 70-90 percent of the theoretical SCSI I/O channel bandwidth to be actually available for data. Table 6–5 charts SCSI technologies.

Table 6.5 SCSI Technology Chart.

Maximum Bus Length (Meters)

 

SCSI Interface

Theoretical Maximum Bandwidth (Mbytes/sec)

Practical Bandwidth (Mbytes/sec)

Bus Width (Bits)

Maximum Number of SCSI Devices

Single Ended

Differential

SCSI-1

5

4

8

8

6

25

Fast Narrow SCSI-2

10

8

8

8

6

25

Fast Wide SCSI-2

20

16

16

16

6

25

Ultra SCSI-3

20

16

8

8

1.5

25

 

20

16

8

4

3

NA

Ultra Wide SCSI-3

40

32

16

16

NA

25

 

40

32

16

8

1.5

NA

 

40

32

16

4

3

NA

Ultra2 SCSI Fast 40

40

32

8

2

NA

25

 

40

32

8

8

NA

12

Wide Ultra2 SCSI Fast 40

80

64

16

16

NA

12

Fibre-Channel FC-AL (SCSI_FCP)

100

N/A

16

126

30 Meters for Copper

2,000-10,000 Meters

Primary data from SysAdmin Journal, February 1998, Greg Shultz, p. 8

Implementing SCSI Technology

So which SCSI technology is best for you? Fortunately, SCSI technology auto negotiates the protocol and therefore the subsequent speed for devices on the SCSI bus. This is good and bad. The auto negotiation lowers the number of problems encountered when migrating or updating your disk technology. Unfortunately, you can't have it all in SCSI world. If the SCSI bus you are implementing has a combination of Fast Wide SCSI-2 devices and Ultra Wide SCSI-3 devices, you are limited to running the SCSI bus at the lowest common speed, which, in this case, would be Fast Wide SCSI-2 device. If you try to run a SCSI channel at Ultra SCSI speeds (normally set with the HBA configuration software not via NT Server) with a staggered set of disk technologies, SCSI timeouts will occur due to the variances in the impedance of the ULTRA SCSI and non Ultra SCSI devices. These SCSI timeouts are routed to either NT Server's event log or to the HBA specific monitoring software under NT Server or both.

Is there that much discrepancy between Ultra Wide SCSI-2 operating at 40 Mbytes/sec versus Fast Wide SCSI-2 operating at 20 Mbytes? The primary difference is the number of disk drives that each SCSI bus can support before becoming saturated. If you placed a single disk drive on either bus in a typical NT Server environment, you would not notice a significant performance difference. Why? As observed in Figure 6–5 "Relative Size of Disk I/O Time Components for Random Disk I/O," most of the time associated with getting data from the disk drive is due to the disk seek movements and latency times occurred from platter rotation. Initial tests show that transfers using Ultra Wide SCSI-3 are between 0.1 to 0.3 milliseconds faster than Fast Wide SCSI-2. This is not quite enough to rip out your current Fast Wide SCSI-2 investment and replace it with Ultra Wide SCSI-3. However, the price of Ultra Wide SCSI-3 is quite competitive with Fast Wide SCSI-2, so if you are planning a new disk subsystem, the slightly faster Ultra Wide SCSI-3 solution is easily justified.

What Ultra Wide SCSI-3 does bring is the capability of supporting more disk drives on a single SCSI bus than was previously possible. If you expect 2Mbytes/sec of throughput from each disk drive, then you can now configure 20 disk drives per SCSI bus channel versus a previous maximum of 10. These disk drive to SCSI bus ratios are based on the initial best-case throughput estimate of 2 Mbytes/sec. To support high numbers of disk drives on a single SCSI channel, intelligent RAID devices are required.

Determining SCSI Bus Throughput Under NT Server

How can you tell how many disk drives to configure per SCSI bus? First, understand your server environment. NT Server does not instinctively throw up a red flag and alert you to the fact that your SCSI channel is saturated. A good rule of thumb is to estimate 2 Mbytes/second for general sequential environments and 0.6 Mbyte/sec for random disk I/O based environments of throughput for each disk drive. For example, if you are implementing a single Fast Wide SCSI –2 channel that will support a random disk I/O environment, you can configure up to 27 disk drives per Fast Wide SCSI –2 channel as an NT Server initial disk subsystem sizing estimate.

How can you be sure that you have not overconfigured or underconfigured the Fast Wide SCSI channel? Proactively monitor your server's performance during regular business operations. One technique to determine the throughput of the SCSI bus under NT Server is to use Perfmon's Logical Disk object's Disk Bytes/sec counter. This counter reports the data rate in bytes that are transferred to or from the disk during write or read operations. Using Perfmon, select LogicalDisk:Disk Bytes/sec and then select all of the disk drives connected to the SCSI channel you are investigating. You will need to understand the physical hardware server connections to correlate which disk drives are on which SCSI bus. Then, sum the average, minimum, and maximum values report under Disk Bytes/sec for each of the disk drives on a particular SCSI channel by hand, or export the chart to a spread sheet file and then sum up the results. Once you get the final result, compare this value to the theoretical and practical SCSI throughput values shown in the SCSI technology chart shown in Figure 6–5. If the value of the summed Disk Bytes/sec is within 20 percent of the practical maximum throughput of the bus, consider adding another SCSI channel. If the summation values are found to be much less than the practical SCSI throughput shown in the SCSI technology chart, you have room to add more drives to the current SCSI channel.

FibreChannel

FibreChannel is an up and coming technology poised to eventually replace SCSI technology as the disk subsystem interconnect of choice. FibreChannel is based on ANSI standards similar to SCSI and is the next step in the disk I/O connectivity evolution. Parallel SCSI will eventually reach its electrical limitations, hence slowing its continued performance enhancements. Of course, every time you hear this, someone at a place like Bell Labs comes up with a creative solution to overcome accepted electrical characteristics. In lieu of this happening, FibreChannel brings new options to the disk subsystem realm.

FibreChannel's primary strength lays in the fact that it can be implemented over fiber optics and is serial in nature opposed to the parallel nature of SCSI. With the use of fiber optics, FibreChannel provides increased throughput, less distance limitations, and more topology implementation options than traditional SCSI technologies. Currently, throughput for a FibreChannel bus ranges up to 100 Mbytes/sec. When implementing FibreChannel, various topologies can be used which encompass arbitrated loop (most common), point to point, and switched implementations. Where differential SCSI is limited to 25 meters from the server to the disk drive, the fiber optic implementation of FibreChannel allows for distances up to 10,000 meters. This increased distance option allows for more flexibility in planning for disasters and overall system architectures.

Even with the new features and possibilities that FibreChannel brings, the same concepts apply that we used when reviewing SCSI technologies. The primary performance hindrance of a disk drive attached to a FibreChannel bus is still the physical disk drive limitations of the disk drive itself. With FibreChannel, more disk drives per FibreChannel bus channel can be supported due to the higher available throughput.

FibreChannel is initially being offered as a host bus adapter to attach external disk drives with little intelligence in the host bus adapter. Without any inherent intelligence, FibreChannel host bus adapters are not currently providing much in the way of hardware RAID support. This limits the current viability of this technology in the midrange NT Server space. In the higher end of the NT Server space, FibreChannel host bus adapters are more of a possibility when paired with an intelligent external disk array device. In this configuration, the intelligent external disk array device provides the hardware RAID implementation and the FibreChannel host bus adapter provides the connectivity back into the server. As the FibreChannel technology matures and more products become available, this technology will become more of a viable option in the price sensitive NT Server Super Server product space.

Host Bus Adapters

The next step in the data path is to get the requested data from the SCSI bus onto the PCI bus. The HBA performs this operation. HBAs provide a range of functions from basic SCSI connectivity to complex RAID support. There are four primary distinguishable features that separate the various HBAs available today: the number and type of SCSI channels supported, the amount of server CPU overhead introduced, RAID support level, and I/O workload supported.

The channel density that a SCSI HBA supports impacts the amount of disk devices that NT Server can support and how they are configured in the server. Today, one HBA can support from one to four onboard SCSI channels. This allows for the conservation of those precious PCI slots available on the server. Utilizing these higher densities HBAs, an SHV based server can support up to five HBAs for a total of 20 SCSI channels. Whether or not configuring that many SCSI channels in a single SHV based NT Server is a good idea is dependent upon your environment. If you do implement high numbers of HBAs in a single Intel based SHV server, you will most likely run into a shortage of interrupts. Each server vendor implements the setup of interrupts in a slightly different manner, so I will refer you to your vendor's server documentation for the steps to follow for interrupt control.

Server Interrupt Configuration Strategy

Regardless of how you set the interrupts, if you must share interrupts between devices configured in your server, try to share the interrupts between devices which use the same NT device driver. This helps to improve overall performance. The concept of plug and play PCI takes a backseat when configuring high end NT Servers that require sharable PCI interrupts to support higher number of peripheral devices. NT Server does provide a tool, Windows NT Diagnostics (found under Start|Programs|Administrative Tools|Windows NT Diagnostics) that helps to determine which interrupts are associated with which devices. There is a wealth of low level NT Server information provided by this tool. If you have not already investigated this tool, I urge you to take a moment to see what it has to offer. This tool can also be run from the command line on local or remote NT Servers and is an easy technique to get a quick inventory of an NT Server. Winmsd.exe is the command line, Windows NT Diagnostics. Run the command Winmsd.exe /? from the command line and a pop up window with available options are shown. Another, more flexible version winmsd.exe is winmsdp.exe. Figure 6–8 shows a screen shot of the interrupt information provided by Windows NT Diagnostics for a small NT Server.

Cc767920.ntser8(en-us,TechNet.10).gif

Figure 6-8: Using Windows NT Diagnostics to determine interrupt levels for devices.

Host Bus Adapter and CPU Power

Each HBA interrupts NT Server when an operation requires server CPU intervention. The amount of CPU cycles required varies between HBAs. The CPU time required to service an HBA is considered as the overhead the HBA places onto the server. This overhead is actually good in one sense because it is used to get data from the disk drives to the CPU so that productive work can be completed. Other I/O devices, such as network interface cards, also introduce this CPU interrupt overhead. It is important to be aware that CPU cycles are required to drive the I/O subsystem. For smaller servers that are configured with small disk subsystems, the CPU overhead introduced by the disk arrays is not significant enough to worry about. As NT Server solutions get larger and larger, the amount of CPU cycles required to drive the disk subsystem becomes more of an influencing factor. If the server application requires a great deal of computational power and there is a large amount of disk I/O activity, CPU power must be planned for to support both.

Various vendor specification sheets for four or eight CPU Pentium Pro servers state that multiple terabytes of disk can be connected to one server. One vendor stated support for 7 Terabytes. From a connectivity perspective this is possible, but the server will never be able to drive the disk subsystem if all of the disks and applications were active. Somewhere a balance must be found. If you examine the TPC-C benchmarks closely, for example, you would notice that for the larger two to four CPU NT Server benchmarks, 300 to 400 GB of disk drive space on average is configured. In these benchmark environments, vendors wish to get the highest performance rating at the lowest cost. For this reason, disks are normally not configured into the solution unless they are actually in use. If you match up the server hardware you have obtained and a benchmark that emulates your environment to published industry benchmarks, such as TPC-C, you can gather some insight into how many active disk drives can realistically be supported by your server.

These test results are subject to many factors such as NT Server version, database version, and numbers of clients, but still provide a rough estimate of the relative sizing between CPUs, SCSI channels, and disk drives in a client server database environment. Table 6.6 provides some insight to the CPU power and disk subsystem configuration used for some TPC-C benchmarks.

Table 6.6 Utilizing TPC-C results to determine relative overall server performance.

Number & Type of CPUs

Memory

Number of HBAs

Number and type of Disk Drives

Amount of Disk Space Configured (approximately)

Disk Drives per SCSI Channel

2 Pentium Pro 200 MHz/512 K

2 GB

5 2-Channel

79 4.3-GB
1 9.1-GB

348 GB

9

2 Pentium Pro 200 MHz/512 K

2 GB

6 2-Channel

79 4.3-GB
4 9.1-GB

348 GB

8

2 Pentium Pro 200 MHz/512 K

1 GB

4 3-Channel

72 4.3-GB
1 9.1-GB

318 GB

6

4 Pentium Pro 200 MHz/512 K

2 GB

3 4-Channel

114 4.3-GB

490 GB

10

8 Pentium Pro 200 MHz/1 MB

4 GB

2 2-Channel
7 1-Channel

2 4.3-GB
200 9.1-GB

1800 GB

20

HBA and Disk Drive Workload Sizing Example

There was more than just throughput to consider when reviewing a disk drive's performance and there is more to consider than just the number of SCSI channels that an HBA can support. When sizing the number of HBAs required, consider the theoretical I/Os per second that they can support as well. For example, if you are configuring 10 disk drives for a sequentially intensive disk subsystem environment, do the associated math. Each 7200-RPM SCSI disk drive in a sequential environment should be able to support up to 190 disk I/O operations per second and 2 Mbytes/sec of throughput. A 10 disk drive implementation would therefore require an HBA or HBAs that can support 1900 I/O's sec and 20 Mbytes/sec of aggregate throughput. To fulfill this outlined requirement, a 2 Channel Ultra Wide SCSI-3 Mylex DAC960PJ HBA could be used. A Mylex DAC960PJ HBA meets the throughput requirement (40 Mbytes/sec per channel) and allows room for growth by providing a second usable SCSI channel and has a workload rating of 2400 I/Os per second, which also provides room for growth. You can obtain this type of performance data from the server or HBA vendor's website, or by contacting them directly. The information used in this example was downloaded directly from the Mylex website (https://www.mylex.com).

I/O Bus Technology

From the peripheral adapter (HBA, NIC, etc.), data is passed to the Peripheral Component Interconnect (PCI) based I/O bus on its way to the system bus. The two PCI buses illustrated in the logical SHV diagram (Figure 6–1) each provide 133 Mbytes/sec (32 bit bus operating at 33 Mhz) of burst bandwidth and 80 Mbytes of sustained throughput. A PCI slot is not the same as a PCI bus. This distinction is often not clear in vendor literature, particularly for smaller 2 CPU servers. If there are two distinct peer PCI buses, 266 Mbytes/sec of aggregate I/O throughput is available to the server. The number of PCI slots per PCI bus is based on the number of electrical loads the PCI bus can support. On average, between three to five PCI slots are available per PCI bus. Occasionally, the PCI devices connected to a specific PCI bus are somewhat hidden, as they are connected to the PCI bus via a direct connection on the motherboard, not through a PCI slot.

To increase the number of PCI slots available on a single PCI bus, vendors can implement a PCI bridge. What a PCI bridge provides is an electrical extension of a single PCI bus. This increases the number of PCI slots available, but does not increase the bandwidth provided. On a single PCI bus based server that utilizes a PCI bridge to extend the PCI bus, the I/O throughput is still limited to 133 Mbytes/sec.

I/O Bus Importance

As larger NT Server solutions are developed to support higher numbers of concurrent users, the I/O bus is a common physical server resource that becomes saturated. To overcome the basic SHV limitation of only providing two peer PCI buses, higher end NT Servers are now available which provide more than two peer PCI I/O buses. Depending on the vendor, up to 4 peer PCI bus servers are now available which provide 532 Mbytes/sec of aggregate I/O throughput. With additional peer PCI buses, more PCI slots also become available. For example, the NCR 4380 supports 14 PCI slots spread across 4 peer PCI buses.

Placing 10 Quad Ultra Wide SCSI-3 bus adapters across the 4 peer PCI buses in an NCR 4380 provides for 400 Mbytes/sec of available SCSI throughput. This 400Mbytes/sec of potential SCSI throughput feeds 532Mbytes/sec of total aggregate I/O subsystem throughput, thus no I/O bottlenecks for this portion of the data path. These super NT Servers are configurable up to 8 Pentium Pro CPUs. Although CPUs are normally the centers of attention, it is the increased I/O throughput that provides the real advantage for databases and other I/O intense server applications. Vendors currently shipping their own 8 CPU NT Server designs that support 4 peer PCI buses are Axil Computer Corporation, a division of Hyundai, and NCR Corporation based in Dayton, OH (https://www.NCR.com).

I/O Bus Selection

From a legacy perspective, Intel-based servers have supported numerous system bus and I/O bus technologies throughout the years. Some environments still require the use of legacy peripheral adapters such as those based on Industry Standard Architecture (ISA, 8 Mbytes/sec), Extended Industry Standard Architecture (EISA, 33 Mbytes/sec), and Micro Channel (MCA, 80 Mbytes/sec) I/O busses. EISA and ISA bus support is still designed into many PCI based servers today. The SHV based server provides a bridge to an EISA I/O bus, which provides several slots for EISA/ISA based I/O cards. Micro Channel based servers are still available from a handful of vendors, but are rarely being produced with the most recent Intel or DEC CPU architectures that are supported by NT Server.

Avoid using any legacy bus based I/O cards. Unless you have no other choice for your environments, legacy based I/O cards (ISA, EISA, and MCA) should never be used in your enterprise class NT Servers. These older I/O cards do not provide nearly the same level of performance of modern 32 bit PCI cards that support bus mastering, direct DMA support, and PCI burst speed (133 Mbytes/sec) support.

Next Generation PCI I/O Bus

The current PCI bus employs a 32-bit data path operating at 33 MHz. The next generation PCI bus is based on a 64-bit data path operating at 66 MHz and supports the ability to hot swap PCI adapter cards. To accommodate this next generation of I/O bus, both NT Server and the server hardware must be capable of taking advantage of the new I/O design. NT Server 5.0 will support the next generation I/O bus and there will most likely be a service pack update for NT Server 4.0 to support it as well. This, of course, is subject to change by Microsoft. What if Microsoft does not support the new I/O bus technology in previous versions of NT Server? There will have to be enough vital Microsoft customers clamoring for support of the new I/O bus technology if they are to influence Microsoft's decision. From a server hardware perspective, current servers that support 32-bit PCI I/O bus will require a motherboard swap out or an incredibly creative solution (or kludgey, depending how you look at it) to support the forthcoming 64-bit PCI bus.

Next Generation Server I/O Technology: Intelligent I/O Initiative (I2O)

There are two primary, limiting factors when working with I/O operations besides the physical device. Software is actually the first area that slows down I/O operations. There are many operating system layers and device driver layers that slow down I/O operations before they ever reach the actual hardware. Secondly, even with bus mastering technology which allows PCI devices to detach from the bus to perform operations, peripheral devices still rely on a high number of CPU interrupts to complete the requested I/O tasks. This software and hardware combination lowers the overall scalability of the server.

To overcome the current server limitations, an Intelligent I/O Initiative (I2O) special interest group was formed. Although the I2O standard is not fully completed, many vendors are already endorsing it. What I2O technology does is attempt to improve overall I/O performance. The first major change is in the operating system software architecture. In the I2O software architecture, the driver is split. The Operating System Module portion of the driver handles I/O interaction with the operating system, thereby controlling I/O subsystem access to the CPU. The Hardware Device Module is the other part of the I2O subsystem that manages how hardware controllers interact with I2O-compatible I/O devices to gain access to I/O services. The goal of this new design is to lower the overhead associated with I/O operations traversing the operating system hierarchy to the device and standardize device driver development and management.

Another benefit of I2O technology is the advent of truly intelligent peripherals. These I2O based peripheral devices will be able to operate without constantly interrupting the CPU for their operations. Many of these devices will include their own CPU and memory subsystems, which offload the actual device activities from the server's CPU. This can improve the performance of the peripheral card, lower the traffic on the system bus, and lower the amount of CPU power required for I/O operations. It allows the server's CPU to focus its attention on more productive application and user related tasks and improves overall server throughput by lowering system bus contention. I2O Disk I/O devices will still need to interrupt the CPU, not for their internal operations, but when the requested data is available for transfer up the data path.

NT Server 4.0 will not currently take advantage of I2O technology unless either Microsoft or other vendors implement the new software architecture to interact with the I2O ready devices. Until a service pack or the next release of NT Server arrives, this technology is only on the horizon. The server itself must also be aware of the I2O technology and be equipped with an intelligent PCI based I2O device. Most of the top tier vendors have already pledged their support to this new technology. From a hardware perspective, for the server to be I2O ready, the server's BIOS must be ready to handle the new peripheral devices or be capable of being upgraded when the technology is available. For the NT Server marketplace, I2O technology will only be on the horizon until the combination of NT Server, server hardware, peripheral intelligent I2O aware devices and their associated drivers become available. Why be concerned with this technology? When obtaining new servers, you should ensure that they support I2O technology either now or in the future. Adding I2O technology to currently deployed servers shows the promise of improving their I/O subsystem and overall performance levels.

The Server's System Bus

The system bus is the final data highway that leads to both the memory and CPU subsystems. The SHV based reference server's system bus is 64 bits wide and supports a bandwidth of 533 Mbytes/sec. (The system bus was reviewed in detail in Chapter 4.) It should be very clear that the server's system bus is the backbone of any NT Server. It is startling when the system bus is referred to as the PCI bus—it makes one wonder what they are referring to.

Next Generation Server System Bus

The next generation of standard high volume server bus is currently being touted as operating at 100 MHz. This technology utilized in conjunction with a new generation of CPUs and memory technology will improve overall NT Server technology performance. A new system bus for the server's motherboard is not a trivial plug in upgrade. For a server to be upgraded to the new system bus, the server's motherboard would be required to be completely replaced and the power and cooling system will need to be capable of accommodating the new electrical loads. By the time the motherboard is replaced, it might be less expensive to purchase a new server with the latest technology and just re-deploy the older technology.

Redundant Array of Inexpensive Disks (RAID)

It should be obvious from following the data path through an SHV based server that a single disk drive cannot provide enough performance capacity to meet the performance requirements of an enterprise NT Server or even begin to tax the server's I/O subsystem. RAID allows for the grouping of multiple smaller, or inexpensive, disk drives into larger logical disk devices. Adding RAID technology to your server's disk I/O subsystem is becoming a more common choice for enterprise servers to increase performance, disk management, and availability. RAID is a particularly excellent choice if multiple disk drives are available and it is not possible to break up the data files across separate disk drives to balance the use of the disk subsystem. Implementing a RAID array, you can place data files onto a single logical drive under NT Server that takes advantage of the multiple drives that compose the actual RAID array. NT Server allows for the use of both software and hardware based RAID solutions.

The following outlines the performance and fault tolerance tradeoffs of the various RAID levels.

RAID 0–Disk Striping

RAID level 0 stripes the disk activity across two or more disk drives. This logical layout provides the advantages of better performance for read, write, random, and sequential environments. RAID 0 also provides direct disk capacity improvement. When two disk drives are configured in a RAID 0 array, the total available disk space is the sum of the capacity of each individual disk drive in the array. If three 9 GB disk drives are configured as a RAID 0 array, there is 27 GB of user available data space.

For optimum performance and efficient capacity usage, the disk drives used in the array should always have the same performance characteristics and capacity. When a RAID 0 array is implemented, the average seek time required to access the data is lowered when compared to that of a single disk drive. This is one of the ways in which a RAID 0 or any of the RAID levels improves performance. The tradeoff for using RAID 0 is it provides no fault tolerance. If you lose one drive in your RAID 0 disk array, you will lose the data for the entire array. In a RAID 0 environment, increasing the number of drives in the array improves the overall I/O performance.

RAID 1–Disk Mirroring

RAID level 1 mirrors the disk activity across two or more disk drives. This logical layout provides for better read performance, especially in multiuser environments, but lower performance in write intensive environments. When data is read from a RAID 1 mirror, it does not matter which disk drive in the array provides the data; each individual drive can perform simultaneous read operations, thus the read operations are typically dispatched across the different drives in the array. Once the data is provided by any disk drive in the array, the host bus adapter considers it valid and continues on.

Writing to the RAID 1 array works slightly differently. Each write must be sent to each drive in the RAID 1 mirror so that all data is kept identical. Because of this, the HBA must wait until all drives in the array acknowledge a successful write before moving on. This contributes to lower write performance when compared to a single disk drive configuration.

RAID level 1 provides complete data redundancy, even if you are using only two disks. The tradeoff for this high level of fault tolerance is that the capacity of a RAID 1 two-disk mirror is lowered by 50 percent. For example, if you have two 9 GB disk drives in a RAID 1 mirror, there are only 9 GBytes of storage space available to NT Server.

RAID Levels 2, 3, 4, and others

These RAID levels are not commonly used in general computing environments so they are not addressed here. If you are curious, seek a more in- depth understanding of RAID technology, or are just looking for some trivia facts, search the web for "RAID technology." You can also visit the web pages of one of the many hardware based RAID manufacturers such as Adaptec, AMI, DPT, Mylex, or Symbios (just to name a few) for a complete definition of all of these RAID levels. The RAID Advisory Board also keeps a website located at https://www.raid-advisory.com/. As you peruse these different websites you may find it interesting how each site interprets RAID technology just a little differently, each adding their own spin.

RAID 5–Disk Striping with Parity

RAID level 5 stripes the disk activity data and calculated parity data across three or more disk drives in a rotating fashion. When data is written to a RAID 5 array, the data is broken up and written to all of the disk drives in the array. For each write, the parity information must be generated and updated. This is accomplished by determining which data bits were changed by the write operation to the other disks in the array and then completing a logical XOR operation with this information. This parity data is then written to the RAID 5 array.

In RAID 5, the parity information is rotated between all of the disk drives in the disk array. For example, if there are three disk drives in logical RAID 5 array E: composed of disk 1, disk 2, and disk 3, as the first set of data is written to the array it is placed onto disk 1 and disk 2, while the parity information is written to disk 3. Subsequently, when the second set of data is written to the RAID 5 array E:, data is written to disk 1 and disk 3 and the parity information is written to disk 2, and so on, in a rotating order. In this manner, parity information is spread across all of the disk drives in the RAID 5 array. To lower the confusion level with this concept, refer to Figure 6–9, which shows data placement in RAID 5 array.

Figure 6-9: Writing data to a RAID 5 array.

Figure 6-9: Writing data to a RAID 5 array.

If one drive was to fail, there is enough parity information available on the surviving disk drives in the RAID 5 array to continue disk operations and rebuild the failed disk drive's data when the failed disk drive is replaced with a working disk drive. If the server is experiencing its regular workload during a reconstruction of a failed drive from the parity located on the other disk drives in the array, the performance provided by the RAID 5 array to server is lowered. Under normal RAID 5 operations, for every write, there are four disk transfers (I/O's). Each write is actually composed of read data, read parity, calculate parity by logically XORing read information, write data, and then write parity information. Not surprising, write operations take longer to complete in a RAID 5 array than any other RAID configuration due to this read/compare/write characteristic.

On the bright side, when data is read from a RAID 5 array the performance provided is very similar to that of a RAID 0 stripe set. Data is spread across the drives in the RAID 5 array so you glean the advantage of having multiple disks service your requests at the same time.

RAID level 5 provides data redundancy, allowing for the loss of one of the arrays member disk drives without the loss of any data. If two drives in the same RAID 5 array fail, all data in that array is lost. Of course, that is what good backup strategies are for. The tradeoff for this redundancy is that the capacity of a RAID 5 stripe with parity is lowered by a factor of 1/(size of the one disk drive members). For example, if you have three 9 GB disk drives in a RAID 5 array, there is 18 GB of usable storage space. If there were 10 9-GB disk drives in a RAID 5 array there would be 81 GB of usable disk space, a capacity loss of only 1/9. In RAID 5 environments, increasing the number of drives in the array improves the random read I/O performance.

RAID 10: Combining Raid 1 Mirroring and RAID 0 Stripes (1+0)

RAID 10 stripes data across two or more drives then mirrors that logical drive set with another stripe set for fault tolerance. This logical layout provides better overall performance than a direct implementation of RAID 1. The tradeoff for this performance improvement and high level of fault tolerance is that the capacity of a RAID 10 mirror is lowered by 50 percent, just as it is for a direct RAID 1 mirror implementation. For example, if you have two 9 GB disk drives in a RAID 0 stripe that are mirrored to another two 9 GB disk drives in a RAID 0 stripe, there is only 18 GB of usable storage space. This RAID configuration provides for excellent performance in all disk environments and complete fault tolerance. It is, however, the most costly RAID implementation in cost per megabyte.

Just a Bunch of Disks (JBOD)

JBOD borders on a trivia fact, but is referenced in some RAID literature so I will mention it here. Depending on the vendor, JBOD is referred to in many different contexts. Someone must have become tired of referring to a single disk drive operating on its own without the influence of any RAID level, thus the phrase JBOD, or just a bunch of disks. In any case, now you can sound extra technical when referring to any single stand-alone disk drive not associated with any RAID levels.

Implementing Software Based RAID With NT Server

You can use NT Server to implement RAID levels 0 or 1 without any concern of a significant performance overhead penalty. NT Server refers to RAID 0 as a stripe set, RAID 1 as a mirror set, and RAID 5 as a stripe set with parity. Do not use NT Server to implement a software based RAID 5 solution unless there is a significant amount of server CPU capacity available that can be allotted for calculating the parity calculations required for RAID 5. To implement RAID level 0,1, or 5 under NT Server, use the Disk Administrator found under Start|Programs|Administrative Tools|Disk Administrator. The steps to configure one of the supported NT Server software based RAID levels is omitted here since NT Server actually provides good step-by-step instructions under the Disk Administrator Help menu option.

A good rule of thumb is to implement all RAID arrays, especially RAID 5, using hardware based RAID solutions. Hardware based RAID solutions offload the RAID calculations from the server CPU to a CPU on the RAID Host Bus Adapter and does not occupy any precious RAM on the server like software RAID implementations do. Hardware based RAID solutions also shield NT Server from problems in the underlying array disk drives. NT Server does provide a solid and cost effective RAID solution for smaller NT Server environments.

Other Benefits of Hardware Based RAID

When a RAID 1 mirror is implemented under NT server and the primary disk drive fails, what happens when the server reboots? It doesn't. First, the failed drive must be replaced with the new disk drive. Then you must utilize an NT Server boot diskette so that you can tell NT Server to boot from the second disk drive in the RAID 1 mirror. Hopefully, you created this diskette and tested its functionality before there was a problem. Once the NT Server is back on its feet, you can use NT Server's disk administration to recreate the mirrors on the new disk drive. This process unfortunately requires a server reboot. This is an inconvenient process, but still quite functional.

Use of a hardware based RAID controller avoids this problem. The hardware based RAID controller shields NT Server from disk problems that are covered by the various RAID levels. In generic terms, when a disk drive fails in hardware based RAID implementation, the following sequence of events occur:

  • One disk drive fails if the RAID 1 mirror fails

  • The fault indicator light above the drive is displayed

  • A hardware RAID controller provided service running under NT Server detects the disk problem and sends the alert to the event log and console (if not another alternate mechanisms as well)

  • At this point, someone hopefully notices the problem

  • The old disk drive is removed

  • If the drive is hot pluggable, the hardware raid controller shields NT Server from the fact that one of its drives was just removed, thus a server reboot is not required

  • A new drive is inserted to replace the failed drive

  • Either the drive begins an automatic rebuild or the hardware RAID vendor supplies a tool to initiate the rebuilding of the new drive from the mirrored information of the old drive

Is a hardware based RAID host bus adapter the best option for you? It depends on the performance level required and budget available. With many of the components of entry level NT Servers commoditized, a hardware based RAID solution is a sound investment. I prefer to use hardware RAID controllers on all enterprise class NT Servers.

Tuning Flexibility When Utilizing Hardware Based RAID Solutions

Another advantage when using a hardware-based RAID solution is the increased overall performance that it can provide, particularly on servers that are CPU bound. Besides offloading server CPU and memory resources to the RAID HBA, read and write performance of the array can be improved through the use of another level of cache (fast RAM) on the RAID HBA. The advantages of an extra layer of cache vary for each server environment. The advantages of having a RAID host bus adapter level cache is that it allows the algorithms in the hardware based RAID more flexibility when working with the arrays it is controlling. Some hardware based RAID solutions provide a tunable read ahead caching algorithm that can complement NT Server's file system cache.

Where hardware based RAID caches exhibit higher levels of added value is when they are used in conjunction with RAID 5 implementations. Caches on RAID HBA's are normally set to a write through mode by default, which allows data to pass directly through the cache. This does not significantly provide any performance improvements. Cache selected for write through operations is the default setting in place by most vendors to ensure data integrity in the face of a power outage. If the RAID HBA does not include a battery backup for the cache and power is lost to the server, there is a potential for data loss. If the cache is supported by a battery backup system on the RAID HBA, you can set the cache for write back functionality with confidence.

The write back cache setting is configured on most NT Servers by booting from an MS-DOS diskette and running the particular vendor's RAID setup tools. Write-back cache signals back to the requesting process that the data is committed to the disk drive once it is resident in the hardware based RAID HBA cache, before the data is physically committed to the disk drive. When this occurs, the NT Server process reaps the benefit of writing data to a faster cache holding area. The RAID HBA can then calculate the required parity and commit the data to the RAID 5 array as needed. As long as the write-back cache is activated, the caching RAID controller helps compensate for the inherent disk write overhead associated with implementing a RAID 5 solution.

RAID Performance Under NT Server

How do you decide which RAID level provides the highest performance levels? It depends on your environment, your budget, and the level of fault tolerance required. As more and more disk drives are in use, the probability that one might fail actually increases. Windows NT Magazine ran a good article in October of 1997 on RAID performance under windows NT Server titled "Optimizing Exchange to Scale on NT by Joel Sloss: Side Bar RAID Performance Under NT." The results of this exhaustive series of tests are shown in Figure 6–10.

Cc767920.ntser10(en-us,TechNet.10).gif

Figure 6-10: Relative RAID performance chart.

This test series was run on a Tricord NT Server that used eight disk drives for each RAID array test. You can utilize this graph to help determine the relative performance impact implementing different RAID levels in your NT Server might have, as well as the affect that different RAID stripe sizes can have on your RAID performance. RAID stripe sizes are reviewed in the next section, RAID Stripes. To read this graph, compare the performance results for similar stripe size settings. If you were curious of what relative performance difference there would be between implementing a RAID 5 array and a RAID 0 array for a random read environment using a stripe size of 64 sectors/stripe, you would notice that the performance difference is negligible. If, however, you compared a RAID 5 array and a RAID 0 array in a random write environment at the 64-sectors/stripe size, you would see a huge difference of over 5 Mbytes/sec. As always, how any disk subsystem solution performs will vary with the NT Server environment it is operating in, but using test results such as these, you can gain insight as to how to initially configure your server and help in planning for future disk subsystem growth.

Optimizing RAID Stripes

When data is written to a RAID array, for example, a RAID 0 array composed of four disk drives, the data is broken into chunks and these chunks of data are placed onto each drive in the array. The size of this logical, continuous chunk of data placed on each drive in the array is referred to as the stripe size. The stripe width is the number of disk drives in the array. There will be no test on this later, but try to keep stripe size (chunk) and stripe width straight in your mind. In a RAID 5 array, the stripe width is one less then the number of drives in the array, since one drive is logically used for parity.

The optimum stripe size (chunk) is a function of the server application in use. Ideally, we would want the typical I/O size to be the same as the stripe width times the stripe size (chunk). So to determine the optimum stripe size (chunk), divide the typical I/O size by the stripe width.

When optimum stripe size (chunk) is found, use this data to low level format hardware based RAID solution with the tools provided by the RAID vendor. The hardware based RAID technology can provide the highest level of performance through the efficient use of writes and reads. If the data is much larger than the optimum stripe size, more writes are required. If the data is smaller than optimum stripe size, then the reads are less efficient due to the fact the unused portion of the stripe size goes unused. You can determine what size I/O operations are occurring in your server by separating disk activities across as many individual disk devices as needed to focus in on the different disk activities that are occurring. Next, use Perfmon to monitor the Logical Disk object's counter average disk bytes/write and average disk reads/write. These counters will show the average number of bytes transferred to the disk during write and read operations.

Stripe Size Optimization Short Cut

Ok, determining the optimum stripe size (chunk) is time consuming, but can be worth the effort. Looking at the relative RAID performance in Figure 6–10, when the stripe size was set from 64 sectors/stripe to 128 sectors/stripe, there is a significant performance increase across most the different RAID configurations and disk loads. At the point where the stripe size (chuck) closely matched the applications I/O characteristics, the best performance is achieved. If you are not interested in this level of tuning, try to at least determine if the general disk characteristics are either sequential or random in nature. In general, larger stripe sizes provide better performance for large sequential data transfers, while smaller random data transfers benefit from smaller stripe sizes (chunks).

Figure 6–10 can be a challenge to decipher, so a condensed reference chart (Table 6–7) was compiled. This performance guide shows the relative performance ratings when comparing the various RAID options using a sector/stripe size of 128 KB. Use this guide when selecting the appropriate performance level that matches your server's disk I/O characteristics and overall disk subsystem goals.

Table 6.7 Condensed version of the relative RAID performance chart.

Raid Level

Random Read

Random Write

Sequential Read

Sequential Write

Stripe (0)

3rd

1st

3rd

1st

Mirror (1)

4th

3rd

4th

3rd

Stripe w/parity (5)

2nd

4th

2nd

4th

Mirrored Stripe Set (10)

1st

1st

1st

2nd

Scaling RAID Arrays

Now that the respective performance levels of the various RAID levels are outlined, how well does NT Server take advantage of RAID arrays when comparing them from the perspective of increasing the number of disk drives in an array? NT Server will in fact take advantage of additional drives in RAID arrays. To characterize what performance levels you could expect, I ran a series of disk subsystems tests using the Neal Nelson Business Benchmark Suite of NT Server tests. The server used for this disk test was an NCR 4300 with two Pentium Pro 200 MHz/512 K cache CPUs, 128 Mbytes of RAM, and two internal 7200 RPM Fast and Wide SCSI disk drives configured on a single SCSI channel. Sizing the server in this manner ensured that there was enough horsepower to drive the disk-intensive test and inhibit NT Server's file system cache from growing beyond 128 Mbytes in size. As in previous tests performed in this book using this test suite, the disk subsystem tests were configured to run to a 20 copy workload test that generates a 780 Mbytes work file. The combination of this large work file size and relatively small amount of RAM configured in the server ensured that the test suite would defeat NT Server's file system cache and that actual physical level disk drive activity would occur.

For disk test 1, a single 7200 RPM Fast and Wide SCSI disk drive was configured on its own SCSI channel. For disk test 2, a 10 disk (7200 RPM, FW SCSI) Symbios Logic Series 3 intelligent external RAID array with 16 Mbytes on board cache was configured as a single RAID 5 array. From NT Server's perspective, this 10-disk array appeared as one logical disk device that provided 36 Gbytes of available storage. The external disk array was attached to the server via an Adaptec 2940 Ultra Fast Wide differential SCSI adapter. Both disks were formatted under NT Server with an NTFS file system and an ALU size of 64 Kbytes. This ALU option is not available under the Disk Administrator and must be completed from the command line. The command used to format the drives as described was "c: >format F: /fs:ntfs /a:64k." Formatting drives from the command line is also a great way to format multiple drives simultaneously by running the format command from multiple command prompt windows. Table 6.8 outlines these disk results.

Table 6.8 NT Server RAID Disk Performance Results.

Test Characteristics using a 780 Mbyte Work File

Percent (%) Improvement

10 Disk RAID 5 Array

Single Disk

 

 

Physical Disk Throughput
(Mbytes/sec)

 

Sequential Reads of 1024 Byte Records

310.9

6.25

2.01

Sequential Writes of 1024 Byte Records

407.5

2.69

0.66

Sequential Reads of 8192 Byte Records

481.8

9.30

1.93

Sequential Writes of 8192 Byte Records

191.5

2.92

1.67

Pseudo Random Reads of 4096 Byte Records

790.1

4.82

0.61

There are performance improvements across the board. As is typical for a RAID 5 environment, the greatest gains came from the random read environment, while the slightest performance improvement resulted from the write intensive test. This is not an earth-shattering discovery, but is very useful when sizing an NT Server disk I/O subsystem. With this information in hand, a general idea of what you might experience when implementing larger disk arrays is now available.

Using Perfmon to Investigate the Disk Array Scaling

From an NT Server perspective, the highest throughput measured under Perfmon for this test by the Logical Disk: Disk Bytes/sec counter was 9.59 Mbytes/sec, which is very close to what the test suite application reported for the Sequential Reads of 8192 Byte Records test. From an I/O workload perspective, Perfmon reported that the maximum Transfers/sec was 690. For RAID 5 arrays, the Transfers/sec reported is not directly representative of the actual workload the disk array is experiencing. To determine RAID 5 Stripe with Parity I/Os per second, use this formula: (Disk Reads/sec + 4x[Disk Writes/sec] / (number of drives in this RAID 5 array). For this sequential read environment, the I/Os per second for each disk drive in the array works out to (690+4[40])/10=85 I/Os per second.

The RAID 5 array was composed of 10 disk drives. From NT Server's perspective, the RAID array appears only as a single, logical device which is now capable of supporting a higher number of concurrent I/O requests because of the multiple drive configuration. There were no other devices connected to the SCSI channel containing the RAID 5 Array. From the provided information and understanding of the configuration, you can conclude that the 20 Mbytes/sec SCSI bus was not near saturation and the disk array could support a slightly increased workload. Each disk in the disk array can support approximately 100 to 190 I/O operations on their own in a sequential read intensive workload environment.

How NT Server Uses the Disk Subsystem

NT Server implements several techniques to improve overall I/O performance when working with the disk subsystem: file system caching, disk read ahead, and lazy writes, lazy commits. NT Server's I/O Manager manages all requests of the disk subsystem. The process that NT Server actually utilizes when interacting with the physical device is quite complex, in fact, entire books are dedicated to the topic. For an in-depth review of the I/O Manager, peruse Windows NT File System Internals by Rajeev Nagar. For this discussion, I will focus on some of the performance aspects of NT Server's I/O process.

Caching Read Requests

Unless a process specifically requests that the file system cache should not be used to obtain data from the disk subsystem, the I/O manager passes the request for data to the Cache Manager. When the Cache Manager obtains the request, it first checks the file system cache that resides in physical memory (RAM) to see if the requested data is available there. In Chapter 5, the file system cache was investigated in depth. If the data is found in the file system cache, it is returned to the requesting process. This is the best way to implement disk I/O: Avoid physical disk activity completely. By avoiding actual physical disk drive activity, the data is presented to the requesting process faster and overall server performance increases. In testing where the file system cache is effective, applications can achieve over 100 Mbytes/sec of throughput. Having a server application that takes advantage of a properly tuned NT Server results in a night and day difference in overall performance.

If the data is not found in the file system cache, it is read from the physical disk drive itself. To increase the chances that the next request for data from the process is found in the file system cache, the Cache Manager employs a disk read ahead caching technique. Statistically, there is a good chance that the next piece of data requested by the process will be close to the last. This concept is referred to as locality of reference. To capitalize on this, the Cache Manager reads ahead on the disk drive and obtains slightly more data than required, in the hopes that the next data requested will already be in the file system cache. Depending on the data access patterns of the process, the read ahead functionality can operate in the forward or reverse direction of the file data.

Caching Write Requests

When data is written to the disk subsystem, the I/O Manager again invokes the Cache Manager unless otherwise requested to do so. The Cache Manager will place the data into the file system cache and report back to the process that the data write is complete. This is a lazy write. By somewhat falsely reporting back to the process that the data is committed to disk, the process can trot off on its merry way faster than if it had to wait on the actual physical disk I/O. Again referencing Table 6.3, the throughput recorded when completing an actual sequential write to the disk drive was 0.66 Mbytes/sec compared to 40 Mbytes/sec when the data could be cached (not shown in the chart, but recorded during the same test suite). There is a tradeoff for this performance enhancement. If the server suddenly lost power, data could be lost. The NTFS file system does utilize techniques to compensate for potential data inconsistencies that could occur due to a power loss. Data that has not actually been committed to the disk drives still resides in the file system cache; this is one of the reasons it is recommended to use the shutdown command to turn off an NT Server as opposed to just shutting off the power.

At some point the data destined for the disk subsystem must be moved to the actual physical disk drives. The Cache Manager has a timer that controls the flushing of data destined for the disk drive from the file system cache. Once every one to three seconds the Cache Manager schedules a scan through the cache to find candidates that should be flushed to the disk subsystem. This scan is adaptive and will flush up to one-fourth of cache at a time.1 After four scans or earlier (if all the writeable data is flushed), all data destined for the physical disk will have actually been written. By waiting an adaptive period of time (perhaps when the server workload is lighter) before flushing the cache, more data is written to the disk at one time. This technique is NT Server's lazy commit. Lazy commit improves the disk subsystem's efficiency and helps write the data in a more contiguous fashion onto the physical disk.

Similar to the UNIX sync command that forces a flush of the disk buffers to disk, Mark Russinovish developed a tool under NT Server named "NTSync.exe" to accomplish this same task. When you run "NTSync" from the command line, the file system cache is immediately flushed. This freeware tool is available from https://www.Ntinternals.com. There is no reason why you have to use this tool as NT Server's file system cache flushing algorithm works fine. If, however, you are working with some very important data and you want to make sure the data is committed to the actual physical disk, such as making sure a few hundred pages of a book you are working on are committed to physical disk, it is a helpful tool.

NT Server Device Drivers

A device driver is a specialized piece of code that allows NT Server and the specific hardware device to communicate, such as a SCSI host bus adapter. Device drivers are pieces of software that are commonly overlooked but have a great impact on the performance that is obtained by peripheral devices attached to the server.

There are several steps inside of NT Server in which the SCSI driver code must traverse. Each one of these code path steps adds a certain amount of overhead to the I/O data path. Why is this important? The more steps that the code path must take, the slower the access is to an external device. As NT Server matures, so do the crucial device drivers that drive the disk subsystems. NT Server has a very defined structure for device driver developers to follow. For people not writing the device drive code, but just implementing solutions with them, it is imperative to keep abreast of the latest driver technologies. Other operating systems such as Linux, publicly distribute their source code in the hopes that if something looks inefficient someone will take the time to improve it. For Linux, this process works surprisingly well. Information on Linux is available on a variety of websites throughout the Internet. A good place to start is https://www.linux.org.

In the NT Server realm, you are dependent on the independent vendor for device driver updates and improvements. From a performance perspective, vendors are always working to make their device drivers more efficient and faster. More efficient refers to the amount of overhead added to the servers CPUs to drive the external device. The more efficient the device driver is, the more work it can complete with the least amount of direct CPU cycles. From a speed perspective, when code paths are lowered, improved efficiency is achieved which leads to obtaining data from the external device in less time.

It does not matter how fast the external device is, if the device driver is not written well, access to the peripheral device will be limited. One way to guarantee your device driver runs as fast as possible is to ensure that the device driver in use is the latest and greatest from the vendor's device manufactured. Some smart devices such as hardware RAID host bus adapters also have their own BIOS or firmware. If a peripheral device does have its own firmware, then it may also require periodic updates. There is nothing wrong with the device drivers provided on the NT Server CD-ROM, but if the manufacture date of the CD-ROM is six months old, the device drivers it contains may have undergone two new iterations in that time span.

Updated NT Server Device Drivers Can Equal Better Performance

Once NT Server is installed, the device drivers are normally forgotten about. If the server runs reliably, who cares? If you want the best performing server you should care. With a multitude of vendors producing devices for servers, it may be time consuming to check all of the vendors websites on a regular basis to download new device drive releases. Fortunately, web technologies are moving fast, and browsers such as Netscape can automatically check websites for you and alert you of changes. NT Magazine recently ran an evaluation of SCSI host bus adapters comparing their performance. (You always want to choose the fastest and most reliable devices for your server.) Evaluations like the one NT Magazine completed help to provide insight on the differences that a new device drive can make. Table 6.9 is a chart of their results.

Table 6.9 NT Server device driver comparison

SCSI Host Bus Adapter

Device Driver Version

Bench32 Disk Mark Score

% Improvement

Adaptec AHA-2940U

Default NT CD-ROM Driver

150

N/A

Adaptec AHA-2940U

Latest Driver from vendor

161

+7.33%

Mylex/Bus Logic FlashPoint LT

Default NT CD-ROM Driver

162

 

Mylex/Bus Logic FlashPoint LT

Latest Driver from vendor

167

+3.09%

Qlogic ALQ-1949

Default NT CD-ROM Driver

171

 

Qlogic ALQ-1949

Latest Driver from vendor

175

+2.34%

Symbios Logic SYM8751SP

Default NT CD-ROM Driver

145

 

Symbios Logic SYM8751SP

Latest Driver from vendor

176

+21.38%

(Windows NT Magazine, January 1998, page 190 by Sean Daily)

As shown in these results, updating your device driver influences the performance that you can achieve. The larger the NT Server solution that you deploy, the bigger impact on your server's overall performance drivers have. Some vendors develop specialized drivers for their benchmark tests, which they subsequently allow limited public access to. In particular, some vendors have developed what is referred to as monolithic SCSI drivers. These monolithic drivers can provide higher levels of SCSI performance, yet utilize lower amounts of the server's CPU. To obtain these monolithic SCSI drives, contact your SCSI host bus adapter, or server vendor for availability.

As with any new technology, test the new device driver in a controlled environment to ensure there aren't any problems with the revision of the devices you have deployed before fielding the new drivers. I personally have experienced huge performance improvements in disk I/O subsystems when updating a device's firmware and device drivers.

Sizing NT Server's Disk I/O Subsystem

Throughout this chapter, references to sizing the I/O subsystem for an NT Server solution are intertwined in the various sections presented. In this sizing section, the focus is on the two most common areas of concern that I have encountered when sizing the disk I/O subsystem:

  • Disk Storage Capacity versus Disk Performance Capacity

  • The Number of Disk Drives per SCSI Bus

Disk Storage Capacity versus Disk Performance Capacity

There is an important distinction between providing enough disk storage capacity for your requirements and providing enough disk performance capacity. We will focus on this topic, but when sizing the disk subsystem also consider the following selection factors:

  1. Cost

  2. Amount of Usable Disk Space Required

  3. Performance

  4. Fault Tolerance Level Required

  5. Available Memory

  6. Data Path Sanity Check

These are not the only factors involved when sizing and selecting your disk subsystem. However, they do cover the major areas of concern. As with all methodologies presented in this book, use them as a guide, not gospel. If you have more criteria you want to add, go for it and have fun!

Let's look at each of these storage selection factors a little closer.

  1. Cost

    If the initial hardware cost is the only factor, then the highest density disk drives will provide the most effective cost solution. A few years ago this was not true. But today with the commoditization of entry level and PC's NT servers, one 9 GB disk drive is less expensive than two 4 GB disk drives. Making the decision totally on cost is sometimes the only option. However, if you want a good long-term solution that meets your performance and management objectives, choosing the minimum number of disk drives based on cost alone will rarely meet your needs.

  2. Amount of Usable Disk Space Required

    Perhaps cost is not the driving factor. If this is the case, let's look at the other disk storage selection factors. The amount of usable disk space required. This becomes a factor for two reasons. First, if you only have five disk bays available in your server and require more, an external array solution is required which adds to your cost considerations. Secondly, the higher the disk drive density, the more data that is lost to fault tolerance RAID features. If you have a RAID 5 array and implement it with four 9-GB disk drives, you would lose 9 GB of disk space to parity storage versus only 4 GB of storage space lost if you implemented a 7 4- GB disk array. When working with RAID devices, remember to complete the math equation for the RAID levels chosen to ensure that you end up with the amount of usable disk space you desire.

  3. Performance

    The next factor to consider is performance. Always note that the workload level in I/O's per second and throughput that the disk subsystem will be required to support is separate from storage capacity. Plain and simple, the more disks and SCSI buses that are implemented provide for more disk subsystem performance. This is a true statement up to the point when either the aggregate throughput of the I/O buses are overwhelmed or there is not enough CPU horsepower to drive the extensive disk arrays and the application at the same time.

    Remember that overall server performance will not improve unless the disk subsystem is the bottleneck throttling the server's overall performance. Even though a disk drive's capacity increases, its relative performance stays about the same. This statement is applicable for disk drives in the same vendor drive family, (for example, they operate at the same RPM levels, seek times, etc.). A 9 GB 7200 RPM Ultra Fast Wide SCSI disk drive provides roughly the same level of performance as a 4 GB 7200RPM Ultra Fast Wide SCSI disk drive. For example in a random environment, either drive will still only support an approximate random application workload of 100 I/Os per second and a throughput level of 0.6 Mbytes/sec. Higher performance levels are possible if the file system cache is avoided and very large block sizes are used when working with the disk drives.

  4. Fault Tolerance

    Choose the appropriate RAID that will provide the level of data protection you require. It is easier to configure and manage an NT Server that has one large single RAID 5 disk array than some combination of RAID levels. This premise aligns with obtaining the highest density drives and placing them into a RAID 5 array. If, however, some of your data is truly critical and the workload disk characteristics are varied, having additional disk drives not only provides more fault tolerance options, but tuning options as well. If seven 4 GB disk drives were selected (let's leave the usable disk space requirement out of this for a moment) and you had a requirement to house particularly important data that could survive a dual disk drive crash, you have options. With the seven 4 GB disk drives, you could implement two RAID 1 mirrors consisting of two disk drives each, and a RAID 5 array with the other three disk drives. In this manner, the critical data placed on the RAID 1 mirror could withstand a dual drive failure. Using a combination of RAID levels to implement the fault tolerance required can help lower the cost of your disk subsystem solution by not having to group all noncritical with critical data.

  5. Available Memory

    With NT Server's dynamically sized file system cache, the addition of physical memory to the server can improve your disk subsystem performance. When more disk operations are cached, the disk I/Os per sec that actually make to the disk subsystem are lowered. If this occurs, fewer disk drives are required to meet the work load (I/Os per second) requirement since there are less I/Os per second to support. Higher throughputs are also achieved when disk requests are fulfilled in cache.

    There is no steadfast rule of thumb on disk storage to available memory ratios that always work. For this decision, your environment really dictates what is best. As a starting point, configure your server with the amount of memory outlined in Chapter 5. Then observe the Logical Disk: Transfers/sec and Memory: Cache Bytes counters to determine if adding more memory is helping. If the Logical Disk: Transfers/sec go down while the Memory: Cache Bytes goes up without a memory bottleneck forming when more memory is added to the server, the additional RAM is beneficial in relieving some of the disk I/O workload. Of course, this is only a valid observation if the workload placed onto the server is similar before and after the additional RAM is added. Cache memory placed onto hardware based RAID solutions can also increase the amount of work a disk subsystem can withstand.

  6. Data Path Sanity Check

    Configuring huge data paths that overwhelm the SCSI channel, HBA, or I/O channel is not desirable. After developing the disk drive or array solution, follow the data path. Utilize the performance charts in this chapter to aide in making an initial sanity check that none of the data paths are overloaded before you field the solution.

Disk Subsystem Sizing Example

To illustrate the difference between storage and performance capacity, let's walk through an example and then use the disk storage selection factors above to help in making the selection. If you wanted to deploy a generic application that required 27 Gbytes of usable disk space, how should you purchase your disk drives? Before we make a decision, let's consider three options to meet this basic disk storage requirement.

  • Option 1: obtain four 9 GB disk drives

  • Option 2: obtain seven 4 GB disk drives

  • Option 3: obtain 15 2 GB disk drives

Which option you select directly influences how you answer the disk storage selection factors. Let's add some requirements information based on the disk storage selection factors and make a selection.

  1. Cost

    A consideration, but it is more important to provide a solution that works well.

  2. Amount of usable disk space required

    Stays the same at 27 Gbytes.

  3. Performance

    Limited Perfmon log information is available from another server running this generic application, but supporting only one half of the projected load for this server. The old server supported 300 Transfers/sec (I/Os per second) on a sustained basis. Throughput information is not available at this time. With almost a doubled projected workload, the new server's I/O subsystem will need to support at least 500 I/Os sec.

  4. Fault tolerance level required

    The data is important but not critical. Continued operation after losing a single drive is acceptable.

  5. Available Memory

    There are two Pentium II 300 MHZ/512 K cache CPUs configured with 256 MB of RAM.

  6. Data Path Sanity Check

    Single Fast and Wide SCSI-2 channel (20 Mbytes/sec) and hardware based RAID controller are supplied which supports over 500 I/Os per second. The server has a single PCI bus.

And The Solution Is….

Based on the above information each option is revisited. Option 1: Using four 9 GB RAID 5 solution will meet the available disk space requirement of 27 GB, but will have trouble meeting the 500 I/Os per second requirement. Option 3: Using 15 2 GB disk drives meets both the usable disk space and performance requirements, but is deemed too costly. Option 2: Using seven 4 GB disk drives fulfills both the usable disk space and performance requirements. Option 2 is neither the most expensive nor cheapest, but is a solution that meets the requirements and provides more tuning and availability options for the future.

The current memory configuration looks as if it is large enough to feed the CPUs and dynamical buffer data headed to the disk. It will be watched closely. A single Fast and Wide SCSI-2 channel provides enough bandwidth to support a seven disk RAID 5 array, with room to grow. A RAID disk controller is selected to offload parity calculations from the two server CPUs and is reported by the vendor to support more than 500 I/Os per second requirement.

Detecting Disk I/O Bottlenecks

Recall from Chapter 2 that it is important to consider the entire server performance picture when trying to locate an NT Server bottleneck. There can be more than one server resource area that is contributing to the throttling of the server's overall performance. Once all of the major server resource areas are evaluated, focus in on the resource that is farthest to the left in the Performance Resource Chart introduced in Chapter 1 and repeated here in Figure 6–11 for reference. This strategy will provide the greatest immediate gain to the server's overall performance.

Cc767920.ntser11(en-us,TechNet.10).gif

Figure 6-11: Performance Resource Chart depicting a disk resource bottleneck

When a disk drive is unable to keep pace with the requested workload, the response time that the disk drive provides begins to degrade when it becomes overworked. If the requested workload increases to an even higher degree, then the disk drive is capable of servicing; the disk I/O requests begin to backup. When this occurs, disk queues form. When disk queues form, the NT process that requested a disk I/O operation must wait in line for service from the disk drive. When was the last time you enjoyed waiting in line? This worst case disk I/O queuing is easier to detect than the slowing of disk response times. Table 6.10 outlines the counters to observer when sleuthing for disk bottlenecks.

Table 6.10 Primary counters for disk bottleneck detection.

Object Counter

Definition

Rule of Thumb for Bottleneck Detection

Logical Disk:
% Disk Time

% Disk time is the percentage of elapsed time that the selected disk drive is busy servicing read or write requests.

If this value increases above 60-80%, then the response time from the disk drive may become unacceptable. This is the red flag to begin investigating the other disk counters listed below.

Logical Disk: Average Disk Queue Length

Average Disk Queue Length is the average number of both read and write requests that queued for the selected logical disk during the sample interval.

If this value is greater than two for a single disk drive and the % Disk Time is high for a sustained period of time, the selected disk drive is becoming a bottleneck. This value is an average calculated during the Perfmon sample period. Use this counter to determine if there is a disk bottleneck and the Current Disk Queue Length counter to understand the actual workload distribution.

Logical Disk: Current Disk Queue Length

Current Disk Queue Length is the number of requests outstanding on the disk at the time the performance data is collected.

If this value is greater than 2 for a single disk drive over a sustained period of time and the % Disk Time is high, the selected disk drive is becoming a bottleneck. This value is instantaneous. Collect granular statistics over time to ensure that there is a sustained problem, not an instantaneous workload increase.

Logical Disk: Disk Transfers/sec

Disk Transfers/sec is the rate of read and write operations on the disk. This counter provides a good indication of the workload level the disk drive is supporting.

If this value rises consistently above 80 for a single physical disk drive, observe if the Average Disk sec/Transfer counter is reporting values higher than your baseline or what you consider acceptable. If it is, then the disk drive is slowing down the overall server's performance.

Logical Disk: Average Disk sec/Transfer

Average disk sec/Transfer is the time in seconds of the average disk transfer.

When the Transfers/sec counter is consistently above 80 for a single disk drive or the %Disk time is above the 60–80 percent range, the Average Disk sec/Transfer should be observed to determine if it is rising above your baseline. A value greater than 0.3 seconds indicates that the selected disk drive's response time is uncommonly slow.

Logical Disk: Disk Bytes/sec

Disk Bytes/sec is the rate bytes are transferred to or from the disk during write or read operations.

Sum this counter's value for each disk drive attached to the same SCSI channel and compare it to 80 percent of the theoretical throughput for SCSI technology in use. If the summed Disk/bytes per second is close to this 80 percent value, it is the SCSI bus itself that is becoming the disk subsystems bottleneck. Use this data and some math to review the complete disk subsystem data path.

Detecting the Obvious Disk Subsystem Bottleneck

Trying to observe an inordinate amount of disk counters at the same time may become confusing. The two most important counters to keep an eye on are: Logical Disk: % Disk Time and Average Disk Queue Length. Whenever the %Disk Time raises above 60 to 80 percent, the selected disk drive is earning its keep by working hard. Once this counter moves into this range, it is like a flashing red light indicating more disk problems may be occurring and should be investigated. If this high %Disk Time is associated with an Average Disk Queue Length greater than two, the disk has become a bottleneck. To investigate if the queue length is not being affected by only spikes of heavy disk workload activity, observe the Current Disk Queue Length counter by changing the Perfmon sampling period to a finer granularity to glean a better understanding of the workload patterns.

Determining the Disk Queue Length for a Disk Array

When determining the Average Disk Queue Length for a disk array, values greater than two for the entire disk array is acceptable. A disk array is composed of multiple disk drives that can support concurrent disk operations. To determine if the disk array's queue length is above the acceptable level of two outstanding requests per disk drive, divide the Average Disk Queue length by the number of disk drives in the array. For example, if you had a 10-disk RAID 0 array, and the reported Average Disk Queue Length was 11, the normalized Disk Queue Length would be one. One outstanding disk queue operation per disk drive is acceptable. However, if the reported average Disk Queue length is 24, the normalized disk queue length would be greater than two per drive, which is unacceptable. Without knowing exactly what is configured in your server, the values reported by Perfmon are practically useless.

Detecting the Not So Obvious Slow Disk

The %Disk Time climbing into the 60 to 80 percent range is a good indicator to begin an investigation into other areas of disk performance. Even if the disk queue is not greater than two, the disk drive could still be hindering overall server performance. There can be times when the %Disk Time is actually at 100 percent, but the disk queue is not greater than two; subsequently, there is not a definitive disk bottleneck. Figure 6–12 is an example where a 10 disk RAID 5 array is experiencing 100 percent Disk Time, but the disk queue is not yet indicating a sustained bottleneck (a sustained value above 20):

Cc767920.ntser12(en-us,TechNet.10).gif

Figure 6-12: 100% Disk Time but no bottleneck.

When this condition occurs, the disk drive is working hard and you are obtaining a good return on your disk investment. However, it can also indicate the performance levels being provided, although not a bottleneck, are becoming increasingly slower. If you see this type of condition occurring, review the Transfers/sec counter, which indicates the disk I/O per second workload that the disk device is experiencing. As illustrated earlier in this chapter, if the per disk I/Os per second are in the range of 80–100 and rising in a random environment, review the Average Disk sec/Transfer. This counter will show how long the disk device is taking to service both read and write requests. If this value steadily rises above your baseline, or what you consider acceptable, it is a prelude to a disk bottleneck occurrence. As a general rule of thumb, if this value rises above 30 milliseconds, you should consider some of the tuning methods outlined in the next section. This concept of slow disk response times was also illustrated in the Electronic Mail Case Study in Chapter 8. In the case study, it was the disk write times that were getting slower. For the case study environment, some of the write intensive log file activity was relocated from the RAID 5 array to a RAID 1 array to alleviate the situation. By proactively managing a performance slowdown, you can tune around it before it ever becomes an actual bottleneck.

Remember the Data Path

If the disk drive is not a bottleneck or even experiencing high workload levels, but the disk subsystem feels sluggish from your application perspective, what could be wrong? Before even configuring your server, follow the disk sizing guidelines presented in this chapter, but remember that computer performance is part art as well (there, I slid it into this chapter, too). Occasionally, sum up the Disk Bytes/sec value of all of the logical disk devices on a particular SCSI chain as well as the Transfers/sec. Once you have this information in hand, work from the beginning of the I/O data path (the disk drive), then continue working your way up through the SCSI channel, and start doing your math.

If the summation of the Disk Bytes/sec is closing in on 80 percent of the theoretical bandwidth of the SCSI technology in use, the SCSI bus itself may be slowing down the disk subsystem operations. To solve this, either less work must traverse the SCSI channel or another SCSI channel must be added. To lower the workload on the specific disk subsystem in question, additional physical memory or SCSI channels can be added to the server. If a new SCSI channel and perhaps disks are added to the server, the workload should be evenly spread across the new hardware. Also, compare the summation values of the Transfers/sec counter of all of the SCSI channel devices feeding the HBA. Is the HBA rated to support such a workload? To remove such a bottleneck, an HBA with increased workload capacity could be used or additional HBAs can be implemented, again distributing the disk I/O workload across the new hardware.

Calculating I/Os per second for RAID Devices

Calculating the I/Os per second for a single disk drive is straightforward, review the transfers/sec counter. When determining the I/Os per second for RAID array, Perfmon's Logical Disk only reports on the logical drive, not each individual disk in the array. To determine the I/Os per second for each drive in the array, use the following guidelines:

  • RAID 0 Stripe I/Os per second equals:

  • (Disk Transfers/sec) / (number of drives in this RAID 0 array)

  • RAID 1 Mirror I/Os per second equals:

  • (Disk reads/sec + 2x(Disk Writes/sec) / (number of drives in this RAID 1 array)

  • RAID 5 Stripe with Parity I/O's per second equals:

    (Disk Reads/sec + 4x(Disk Writes/sec) / (number of drives in this RAID 5 array)

    In RAID 5 arrays, each write consists of: Read data block, read parity block, write data block, write parity block, and so on, for each single write operation.

Tuning Disk Resources

Now that we have all of this wonderful information on how the servers disk subsystem operates, how do we make it go faster or, at least, become more efficient? Properly armed with an understanding of how NT Server and the server hardware works is actually the first step in tuning the disk subsystem. The primary strategies to implement when tuning the disk subsystem are

  1. Evenly distribute the disk workloads

  2. Appropriately size and tune the server's memory subsystem

  3. Lower the overhead associated with the disk operations

  4. Try to improve the disk operations efficiency

  5. Understand the workload characteristics and implement RAID technology accordingly

General NT Server Tuning Considerations and Recommendations

Evenly Distribute File System Activity

Review the Perfmon logs regularly to insure that you have evenly distributed the workload across the entire disk subsystem. If %Disk Time or Disk Transfers/sec approach the thresholds mentioned above, consider spreading the data of the affected logical drives to other physical devices that are in less demand. A common source of contention is having all applications loaded and running on the root NT Server disk. The root disk is where WinNT (%SystemRoot%) resides, most commonly "C:". The root disk can quickly become a bottleneck.

To distribute the disk subsystem workload, you need to understand where the workload is emanating from. When you know where the workload is emanating from, you can work with the particular application generating the workload and tune it to place its data or logs in another location. A tool that is helpful in determining what processes are accessing which disk subsystems on your server is ntfilmon.exe. This tool is freeware and is available at https://www.ntinternals.com. Figure 6–13 dipicts a short output from ntfilemon.exe that you might expect.

Another tool that can aid in determining which processes are using which logical disk is Nthandle.exe. NTHandle is a utility that displays information about open handles (files) for any process in the system. This tool is also by Mark Russinovich and is available from https://www.NTinternals.com. Use this tool in conjunction with nfilemon.exe to help in understanding where to distribute the disk I/O workload among the entire disk subsystem.

Cc767920.ntser13(en-us,TechNet.10).gif

Figure 6-13: Using ntfilmon.exe to determine disk activity and the guilty processes.

Tuning NT Server File System Cache Usage

The most common way to control how NT Server uses the available RAM for disk caching purposes is outlined in Chapter 5's "Tuning Memory Resources: Selecting NT Server Memory Strategy" options. In particular, the selection of either option 1; Maximize Throughput for File Sharing, or option 2; Maximize Throughput for Network Applications, drastically affects your server's disk I/O performance. Although it depends on your environment, adding additional RAM will improve the disk I/O performance.

If the server is functioning as a file server and option 1 is chosen, nothing more than a reboot of the server is required to take advantage of the additional RAM. If option 2 is selected, just rebooting the server will improve some disk I/O performance as more RAM is available for NT Server's dynamic cache management, but tuning any applications that can be adjusted internally should be revisited. For example, Microsoft Exchange allows the amount of memory allowed for use to be easily set via the Exchange Optimizer Program. So if more memory is added to the server, tune the applications accordingly to take full advantage of your investment.

Control Access to Network Shared Disk Subsystem Resources

It is normal to find disk resources shared to the network under NT Server. If a specific popular network disk subsystem resource becomes a bottleneck, but it is not currently possible to tune around the bottleneck, what can you do? One technique is to control the number of concurrent users who access the disk resource until the bottleneck can be removed. To control the concurrent user access for a specific shared resource, use the following command from the NT command prompt: "c: > net share SharedDiskResource=F:\DirectoryShared /users:40 /remark: "For a limited time only 40 concurrent users can use this resource." Now that the shared resource access is limited, let's not allow people to tie up this popular shared resource. To control the amount of time a shared resource connection can set idle before disconnecting the user, use the following command from the command prompt: "c:>net config server /autodisconnect:5." The net config server command sets the maximum number of minutes a user session can be inactive before it is disconnected. The default value for this command is 15 minutes. Also, this command is immediate and permanent. To change this value, you will need to rerun this command. For information on the status of a network share, perhaps to determined sessions timed out, run the following command from the command prompt "c:>net statistics server | more."

Low Level Format Disk Drives Before You Format Them Under NT Server

When using a SCSI disk drive on a different SCSI host bus adapter, always use the host bus adapter's provided tools to low level format the disk drive before attempting to format the drive under NT Server. The geometric translation combination in use varies between each host bus adapter and disk combination. When SCSI disk drives are placed onto a new host bus adapter, you want to ensure the correct translation information is being used. Low level formatting the disk drive with the tools that the host bus adapter vendor provides confirms that the translation in use is correct. This will provide the proper functionality, reported capacity, and performance achieved when using the disk drive with NT Server.

Allocate Only One Logical Drive per Physical Drive

A technique that helps isolate disk performance problems and improves disk drive performance is to format only one logical drive per physical drive. This technique can lower the amount of head movement (seek) over the disk drive. For example, if you have three disk drives, create only three logical drives, C, D, and E.

To illustrate the goal behind this tuning technique, consider a single disk drive formatted with two logical NTFS file systems E: and D:. Suppose application 1 has data files on E:, and application 2 has data files on D:. When application 1 requires data, the disk head moves to the E: partition to locate the data. For this example (see Figure 6–14), the data is on the outside edge of the platter. If the data for application 2, which is on D:, is on the inside edge of the drive platter, there is a significant amount of head movement which results in additional time added to accessing the disk data.

Figure 6-14: Data on a platter.

Figure 6-14: Data on a platter.

With seek time being the costliest portion of a disk operation, it makes sense to attempt to lower it. If only a single NTFS file system is placed on the entire drive, the data has a better chance of being written together instead of forcing it to different areas of the disk based on logical partitioning. Seek time has a greater chance of being lowered when data being read is grouped closer to one another. Lowering the disk drive's seek times significantly improves overall disk drive response time. The concept of seek times was explored in the disk drive technology and SCSI technology sections of this chapter.

Select the Appropriate Allocation Unit Size (ALU)

Consider matching the file system ALU to the block size of the application you are using. If SQL Server is using a 4 KB block size, when you format a file system on a new disk drive launch Disk Administrator, create the partition, commit the partition changes, select Format, and then set the ALU to 4096 bytes. Matching the file system block sizes to the server's workload characteristics can improve the efficiency of the disk transfers when you use the application. For more ALU size options, use the format command from the command prompt in exchange for the Disk Administrator tool. An example use of the format command used from the command line to set up an NTFS file system with a 64 Kbyte file system is: "format h: /FS: NTFS /A:64K." For all of the format command options, type "format /?" at the command prompt.

The NTFS file system uses clusters as the fundamental unit of disk allocation. A cluster is composed of disk sectors. When using the format command or the Disk Administrator, cluster sizes are referred to as Allocation Unit Size (ALU). In the NTFS file system, the default ALU depends on the volume size. Utilizing format command from the command line to format your NTFS volume, you can specify any ALU. The default cluster sizes for the NTFS file system are shown in Table 6.11.

Table 6.11 Automatic NT Server format options default ALUs.

Partition size

Sectors per ALU (cluster)

Default ALU (cluster) size

512 MB or less

1

512 bytes

513 MB–1024 MB (1 GB)

2

1 K

1025 MB–2048 MB (2 GB)

4

2 K

2049 MB–4096 MB (4 GB)

8

4 K

4097 MB–8192 MB (8 GB)

16

8 K

8193 MB–16,384 MB (16 GB)

32

16 K

16,385 MB–32,768 MB (32 GB)

64

32 K

> 32, 768 MB

128

64 K

This technique of setting the proper ALU is implemented when tuning the server (see Chapter 8's: file server case study). There is always a tradeoff somewhere when tuning your server. There is some loss of available capacity when using the larger ALUs. Unless you are really low on disk space or value every byte of possible storage, the slightly lower capacity sizes associated with larger ALUs is a nonissue.

To determine the size of your server's disk I/O operations, consider reading the friendly manuals provided by the vendor or complete research on the operation of the application. Sizes of disk I/O operations and workload characteristics can also be determined by using Perfmon. Observe the Logical Disk counters: %Disk Read Time, %Disk Write Time, Average Disk Bytes/read, and Average Disk Bytes/write.

Select the appropriate file system to meet the solution requirements. Which file system to choose has traditionally been a heavily debated topic. For this topic, I will stray from performance for a moment. Under NT Server, there are two file system options: NTFS or FAT. If security is a consideration at all, there is only one choice, NTFS. If FAT file systems are in operation, try to limit their use to disk drives or partitions smaller than 500 GB. For smaller file system sizes that are characterized by many small files, FAT can actually outperform NTFS. FAT file systems are more apt to become fragmented in a shorter period of time and begin to degrade in overall performance for larger file system sizes. Thus, for all file systems larger then 500 GB, use NTFS.

Disable Short Name Generation

Disabling short name generation on an NTFS partition can increase directory performance significantly if a high number on non-8.3 filenames are in use (for example, long file names). This is becoming increasingly common as legacy clients and applications are updated. NT Server must calculate the short 8.3 compliant filename every time a long NT Server file name is used. This increases CPU overhead and causes additional writes to the disk subsystem. To disable short name generation, use REGEDIT32.exe to set the registry DWORD value to 1 in the following Registry location: HKEY_LOCAL_ MACHINE\SYSTEM\CurentControlSet\Control\Filesystem\NtfsDisable8dot3nameCreation. It is very important to remember that this will cause problems if legacy 16 bit MS-DOS and MS-Windows based applications are still in use on the NT Server you use this technique.

Disabling Last Access Updates

Another registry tunable that can improve file system performance by lowering file system overhead is located in: HKEY_LOCAL_MACHINE \SYSTEM \CurrentControlSet \Control \FileSystemNtfsDisableLastAccessUpdate. Changing the default REG_DWORD value of this key from 0 to 1 will stop NT from updating the last access time/date stamp on directories as directory trees are traversed.

Defragmenting NTFS

Like any other file system, NTFS can become fragmented over time on heavily used disks. There are commercial products available to defragment disk drives which can improve performance of the file system. Some of these disk fragmentation tools market that they not only defragment the file system, but also place files not touched for longer periods of time together on slower areas of the disk drive. Conversely marketed is that they group recently used files together on faster areas, or more commonly traveled areas of the disk. These techniques, if successful, will improve disk access time by lowering the time for seek operations and freeing up contiguous areas of disk for NT Server's use. If these defragmentation tools are in use in your server, you may not want to utilize the "Disabling Last Access Updates" tuning technique mentioned above. Using this tuning technique may conflict with the defragmentation tools operations.

Currently there isn't an industry standard benchmark to be utilized when reviewing these disk defragmentation tools. The performance gains achieved will vary greatly from one environment to another. Custom server baseline tests, such as the ones mentioned in Chapter 2, are helpful in determining the value that these disk defragmentation tools have in your environment. Your Perfmon baseline is also a helpful tool when evaluating the effectiveness of disk defragmentation. Compare the Perfmon counter Average Disk Second/Transfer before and after the disks are defragmented under similar workloads (same time of day during, similar workloads activity, etc.) to help determine if they are providing a beneficial service.

To lessen the impact of fragmentation and take advantage of the physical characteristics of disk drives without a defragmentation tool, try to keep file systems at a capacity lower than 60-80 percent. This will allow NTFS to compensate for fragmentation by taking less time to find additional free space when needed.

An evaluation copy of Diskeeper by Executive Software is included in this book's CD-ROM for you to take a test drive with to determine if this technology is helpful in your environment. This tool is also available for download from https://www.sunbelt-software.com.

Avoid Compression

When NT Server's file system compression is activated, it taxes all of the server's resources. The CPU is needed to make the calculation, which requires memory, and during this period the process is waiting for the disk I/O to complete. Unless absolutely necessary, do not use disk compression. If you must use disk compression, relegate its use to archiving less frequently used data.

SCSI Channel, Disk Subsystem, and Host Bus Adapter Tuning Considerations

Group Similar Physical Devices on the Same SCSI Channels

To maximize the performance and efficiency of your SCSI channels, group similar external devices on the same SCSI bus. By placing an active CD-ROM on the same SCSI bus as a 10 Disk RAID 0 array, the CD-ROM can effectively slow down access to the faster disk array. When SCSI commands are sent to request data from the CD-ROM, which is a slower device than a disk drive, the other devices on the SCSI bus must wait until the CD-ROM (or another slower device) transfer is complete before transferring the data associated with the other devices. A good rule of thumb to follow is to place CD-ROMs on their own SCSI channel, place each tape unit on its own SCSI channel, and group disk drives with similar features (size, rpm, throughput) on their own SCSI channels. Also, avoid configuring SCSI devices using different levels of SCSI standards on the same SCSI channel, as they will be limited to running at the speed of the slowest SCSI standard on the particular SCSI channel. For example, operating Fast and Wide SCSI-2 (20 Mbytes/sec) and Ultra Fast and Wide SCSI-3 (40 Mbytes/sec) on the same SCSI channel limits the SCSI channel to the slower Fast and Wide SCSI-3 speed.

SCSI Command Queuing

Some drivers for SCSI adapters have registry settings for SCSI command queuing. By increasing this value, you can improve the performance of the attached disk subsystem. When this value is increased, more SCSI commands can be in the disk device queue. This technique is particularly helpful in disk array environments. Due to the multiple disk drive nature of disk arrays, they are capable of coalescing multiple SCSI requests in the most efficient manner to achieve higher levels of performance. Use this technique cautiously and test your performance before and after editing the registry values. For most large disk array (>10 disks) environments, doubling the default value for the driver improves disk performance. Contact the disk adapter vendor for assistance in finding the SCSI command queuing entry in the registry. For example, Symbios SCSI adapters (whose default is 32) SCSI command queue entry is located in the following location: HKey_Local_Machine \System \CurrentControlSet \Services \symc8xx \Parameters \Device \NumberOfRequests (REG_DWORD 32).

Monolithic SCSI Drivers

In an effort to improve SCSI host bus adapter performance, some vendors have implemented what is referred to as monolithic device drivers. These specialized drivers appear to have been born out of necessity to improve NT Server's performance for the TPC-C benchmark. Monolithic device drivers incorporate the NT Server SCSI port driver and miniport driver into one large, monolithic driver. These monolithic device drivers lower the amount of code path steps required that disk I/O request must traverse. The benefits of these drivers are twofold. First, these drivers tend to improve the disk I/O performance on heavily utilized disk subsystems. Secondly, they require less CPU cycles when implementing the actual disk I/O activity, which is particularly beneficial in a CPU intensive server environment. Vendors such as Compaq, Dell, NCR, and others offer the monolithic device drivers for some of the SCSI host bus adapters that they use. Check the respective server or host bus adapter vendor's website or contact them directly for information on how to obtain this technology.

Properly Configure the Host Bus Adapter (HBA)

The host bus adapter setup determines what performance level the SCSI channels connected to it will operate at. Because technology moves so fast, an HBA that may be rated for Ultra Fast and Wide SCSI speeds does not guarantee that it is configured for it. As the server boots, or using the tools provided by the host bus adapter vendor, be sure that the SCSI channel speeds are set to operate at the desired performance level. Some of the HBA settings to double-check and enable are: SCSI bus speed, tagged command queuing, disconnect, and wide transfers. You have an investment in the technology for your server; make sure you get to use it.

Update Host Bus Adapter Device Drivers and BIOS (Firmware)

This is one of the easiest techniques to implement to improve disk I/O performance. Manufacturers of disk drive adapters are constantly working on removing bugs and improving the performance of respective disk adapters. Typically, the latest drivers are available from the manufacturer's website. Even before installing NT Server for the first time, check to see that it is the latest, best performing stable disk adapter device driver available. Check your manufacturer's website periodically. It is amazing how much performance you can obtain through the use of improved device drivers. Also, don't forget to keep the embedded HBA BIOS levels current too.

RAID Tuning Considerations

RAID Array Background Services

When NT server utilizes a disk array, either through software or hardware implementation, the data consistency of the actual array is not checked. Parity or mirrored operations are completed, but the health of the disk array itself is not checked during regular disk I/O operations. Either through hardware implementation or NT background services, vendors commonly implement routines to periodically check the disk array's health. This is an important activity that should occur on a regular basis.

For example, the SMART-2 Array HBAs from Compaq default to checking the disk array's health every hour. On very busy servers, consider lowering the frequency of these health checks or postponing them to periods of lower workloads. By controlling the frequency of disk array health checks, you control the overhead that detracts from the array performance but still ensures the health of the array. Array checking techniques differ between vendors, so you'll want to look at the vendor's documentation closely on this one.

Tuning RAID Controller Cache

If there is a built-in cache on the RAID host bus adapter and a battery backup unit, the general rule of thumb is to configure it for the write back caching turned on. The default setting for most adapter caches is write through. Having the write back cache enabled is particularly helpful in write intensive environments implemented with RAID Level 5 disk arrays where there are pauses between periods of heavy disk activity. When your environment is characterized by heavy disk write activity followed by a lull, the write back cache takes advantage of this workload slowdown to write the cache data to disk.

There are, however, some instances when write back cache is not as helpful. One such instance is when log files are continuously being written to. When data is constantly being written to a disk array, the write back cache's effectiveness is diminished because it does not get a chance to take advantage of a workload slowdown window in which to flush its cache. For these continuous activity environments, enabling the write through cache can provide a higher level of performance. Many hardware RAID vendors allow you to selectively control how the cache is used on the various arrays under its control. For example, perhaps you have a RAID 1-mirror configured for the application logs and a RAID 5 array configured for a transaction based database. You can set write through caching for the RAID 1 mirror and write back caching for RAID 5 array. Customizing your cache settings to match the workload's characteristics helps to improve the performance of the overall disk subsystem. This technique can be quite effective but is application-dependent. This technique should always be tested for the specific application and RAID technology in use before fielding.

Setting RAID Stripe Sizes

Ideally, you want the typical server application I/O size to be the same as the stripe size on the RAID array. This concept was investigated earlier in this chapter under RAID Stripes. For now, use this condensed version as a reminder. Depending on the hardware RAID implementation, there are numerous options available for setting up the RAID array stripe sizes. When you do set the arrays stripe size, the array will require a new low level format using the hardware RAID vendor's tools. Stripe size is not the same as ALU, which can be set under NT Server with the format command. If you implement software based RAID arrays using NT Server, stripe size cannot be tuned.

Short of tuning the RAID array stripe sizes, you should consider at least setting it according to your general disk subsystem workload characteristics. In general, larger stripe sizes provide better performance for large sequential data transfers, while smaller random data transfers benefit from smaller stripe sizes (chunks).

Group Similar Disk Drives into the Same Array for the Best Capacity and Performance

It is tempting to build a disk array from different disk drive makes and models. Whatever is lying around the office or lab will do, right? If you place a lower capacity disk drive into an array with a higher capacity disk drive, you lower the array's overall capacity. For example, placing a 2 GB disk drive into a RAID 0 (stripe) array with four other 4 GB disk drives limits the overall array capacity to 5 x 2 GB=10 GB, not 4x4 GB+2 GB=18 GB. This same concept is true for performance. Grouping one older 5400-RPM drive into the same array as 10,000-RPM disk drives lowers the overall performance of the array.

Use More than One RAID Level in Your Solution: Group Similar Disk Characteristic Work Loads

You can group all of your disks into one large array and let the server applications have at it. With fast enough hardware in large enough quantities, this tactic actually can provide acceptable performance. Unfortunately, implementing this tactic can become expensive and may not meet other requirements, such as the required fault tolerance levels. To properly lay out the disk subsystem, you need to understand the technology you are implementing (hardware and software) and the server disk subsystem workload characteristics.

Either through reading those famous manuals (RTFM) that come with software products or by using Perfmon and ntfilmon.exe (https://www.NTinternals.com) , try to determine the characteristics of the disk I/O activities occurring on your server. Determine which applications exhibit sequential activities, random activities, and are read or write intensive. A little research before you begin provides insight on how the server applications do or do not behave, whichever is the case.

Once a server is fielded, user access patterns influence which areas of the disk subsystem are used more heavily than others. To help to determine what disk resources are under siege, use the Perfmon counters listed above in Figure 6–10: Primary Counters for Disk Bottleneck Detection to determine which disk resources are being used heavier than others. For an even more granular view of what is going on in your disk subsystem, review the Perfmon counters in Table 6.12.

Table 6.12 Perfmon counters.

Object: Counter

Definition

Rule of Thumb for Bottleneck Detection

Logical Disk: % Disk Write Time

Disk write time is the percentage of elapsed time that the selected disk drive is busy servicing write requests.

This counter becomes crucial in determining disk I/O characteristics which directly influences how to lay out and sizing of the Disk I/O system.

Logical Disk: % Disk Read Time

Disk read time is the percentage of elapsed time that the selected disk drive is busy servicing read requests.

This counter becomes crucial in determining disk I/O characteristics which directly influences how to lay out and sizing of the Disk I/O system.

Logical Disk: Disk Read bytes/sec; Disk write bytes/sec

Disk Read (or Write) is the rate bytes are transferred from (or to) the disk during read (or write) operations.

These counters help provide insight on the amount of throughput being used for either read or write operations. This data is helpful in spreading disk subsystem activity across multiple host bus adapters, SCSI channels, disk drives, and RAID levels.

It is sometimes helpful to draw a logical picture of the server's disk subsystem or build a table of the disk usage for each resource. With this data, you can then determine how and where to rearrange or add additional disk subsystem hardware.

Once you have determined the characteristics of your disk activities, group similar activities on the same disk drives or arrays. This is a corollary rule to Evenly Distribute File System Activity. For example, place sequential write intensive log files on a separate disk drive and file system rather than on a general user database area characterized by a random read intensive workload. Once the workload characteristics are evenly distributed, group disk activities across the disk subsystems utilizing the various RAID levels based on the performance guidelines provided in the next section. For example, to improve the performance of sequentially write intensive logs files, place them onto RAID 0, 1, or 0+1 arrays and avoid RAID 5. For a predominantly random environment that is read intensive, RAID 0 and 5 are good selections. The more clearly you understand your server's environment, the better tuned your NT Server will be.

Data Placement and RAID Selection

With the information presented above fresh in your mind, you are in an educated position to make decisions on which RAID level or combination of RAID levels to implement, where to add disk drives, and perhaps where to take them away. In the Chapter 8 Electronic Mail case study, the disk subsystem workload characteristics for Microsoft Exchange were reviewed very closely. Repeated here in Table 6.13 is a high performance and availability example of distributing the disk workload among various RAID levels to meet the performance and availability objectives.

Table 6.13 Distributing Disk Subsystem Workloads Across Multiple RAID Array.

Logical Disk Assignment

SCSI Bus

Physical Disk Arrangement

Motivation behind data placement and RAID selection

Microsoft Exchange File Location

C:

1

2 Disk
RAID 1 Mirror

Critical resource if NT Server does not boot, the server is useless. A RAID 1 mirror allows for loss of drive without loss in operation.

NT Operating System and Exchange Binaries

E:

1

2 Disk
RAID 1 Mirror

Log data is sequential and write intensive. RAID 1 does not provide the best performance, but it provides much better write performance than RAID 5 and still provides increased data availability.

Information Store Logs

F:

1

2 Disk
RAID 1 Mirror

Same as above

Directory Service Logs

D:

2

1 Disk
NO RAID, JBOD

NT Server paging activities sequential in nature, good write and read performance required. This drive is not as critical as other data drives, but provides good performance and overall value.

NT Paging file Note: Dedicated single drive used. Heavy paging not in design plan.

G:

2

2 Disk
RAID 1 Mirror

General read/write database characteristic workload. This is deemed critical for Exchange operation. Same motivation as the other RAID 1 implementations above.

Message Transfer Agent

I:

2

3 Disk
RAID 5 Array

Random read/write database characteristic workload. Provides excellent read performance with cost effective fault tolerance.

Directory Service

E:

3
(External Array)

10 Disk
RAID 5 Array

Intense Random read/write database characteristic workload. Heaviest used disk subsystem in case study environment. RAID 5 composed of a large number of disk drives provides excellent random read performance and a sound level of fault tolerance.

Private Information Store

E:

3
(External Array)

3 Disk
RAID 5 Array

Potential intense random read/write database characteristic workload. RAID 5 provides excellent random read performance and a sound level of fault tolerance.

Public Information Store (if used)

Disk Capacity Storage and Disk Capacity Performance

This concept is reviewed in-depth throughout this chapter. Because it is so important but commonly misunderstood, I'll review the condensed version again.

Should you choose three 9 GB disk drives in a RAID 5 array which provides 18 GB of usable storage to meet an 18 GB disk requirement or six 4 GB disk drives in a RAID 5 array that provides 20 GB of usable storage? Both solutions meet the 18 GB of usable disk space storage requirement, but the three 9 GB solution provides a lower level of performance. Which to choose depends on the level of performance required and economics of your situation.

If economics is a greater concern than disk performance scalability, the three 9 GB disk drives meet the requirements. Alternatively, if performance is the more important factor, choose the six 4 GB or 12 2 GB disk drive solution. Mileage may vary for every environment, but in general the addition of disk drives (spindles) can greatly improve your disk subsystem throughput performance and supported I/Os per second. The three 9 GB disk drive solution can support approximately 300 I/Os per second and a sustained physical drive sequential read throughput of 6 Mbytes/sec. The six 4 GB disk drive solution, supports 600 I/Os per second and a sustained physical drive sequential read throughput of 12 Mbytes/sec. This comparison is based on using the same family (SCSI connection, RPM level, etc.) disk drives at different capacities.

Additional Disk Subsystem Hardware—The Last Resort

As with any resource that has become a bottleneck, additional resources can normally be added if all other efforts to improve performance do not meet your requirements, or if you just need more disk storage capacity. Always select the fastest components that have a positive life cycle (remember VESA versus PCI years back).

For the disk subsystem, this would involve selecting the fastest available disk drives and disk adapter technology available for insertion into the fastest I/O bus available. For example, if there was a need to add three additional disk drives, selecting a PCI based SCSI-3 Ultra Fast/Wide Disk Adapter and SCSI-3 Ultra Fast/Wide Disk Drives rotating at 10,000 rpms would be a good place to start.

Thinking Outside of the Box

RAM Disks

No matter how well you sized and tuned your NT Server disk subsystem, the disk subsystem is still composed of physical devices that are slow when compared to RAM. There may be times, regardless of how well you have load balanced and tuned your disk subsystem, when "hot spots" still exist. "Hot spots" are defined here as areas of the disk subsystem that application(s) consistently request services from which greatly impact the application's performance. Consider another technique to alleviate hot spot situations; try implementing RAM disks.

RAM disk technology is available in two forms. The first RAM disk form is also referred to as solid state disk drives. These types of devices look similar to normal disk drives, but are composed of nonvolatile or battery backed RAM that appears to the NT Server as a regular disk drive. These solid state disks connect to an NT Server over a SCSI connection, but are much faster than traditional disk drives. The second type of RAM disk is software based which allows you to configure part of NT Server's memory subsystem to appear as a disk drive. This is what we will investigate here.

RAM disk support is not native to NT Server, so I obtained a RAM disk package from EEC systems (https://www.eecsys.com) named SuperDisk-NT. An evaluation copy of this software is included in the CD-ROM that accompanies this book and is also available from https://www.SunBelt-Software.com. This software actually provides two RAM disk options. The first option is a traditional RAM disk in which you can allocate some of NT Server's main memory to act as a disk drive. The second option allows for a disk drive backed RAM Disk implementation—in other words, the RAM Disk is mirrored to a physical disk drive.

Installing the SuperDisk-NT product is similar to installing other NT Server applications. To configure the RAM disk, you bring up the super disk configuration tool and set the RAM disk size as needed then reboot the server. Server RAM designated for the RAM drive is removed from use by NT Server's Virtual Memory Manager. For example, if you created a 200 MB RAM disk on a server that is configured with 1 GB of RAM, only 800 Mbytes would be available to normal NT Server RAM operations. Once the server is rebooted, another drive letter—S: is now available to use as needed (your RAM disk). Files can now be placed into this area either manually or via a start up script. When applications that use these files are launched, they will operate at much higher performance levels while interacting with the RAM disk subsystem. This is a great way to speed up your disk subsystem, but there is a draw back. Files on the RAM disk should not be critical. If the files are critical they probably should not be on a RAM disk, but if you do wish to keep them, you must copy them to a normal disk drive before powering down NT Server. As with all tuning, there are tradeoffs for improved performance. When power is removed from the server, any files located in the RAM disk area are lost. If the system crashes, you will also lose the data on the RAM disk.

Another option available by SuperDisk-NT is to provide a normal disk drive to back up RAM disk operations. This option provides higher performance levels by using the normal (nondisk drive backed) RAM disk, but then saves the data written to the RAM disk in a lazy write mode process. This lazy write mode of operation periodically saves the data from the RAM disk to a designated NT Server partition (this same lazy write concept was reviewed earlier in this chapter). The goal of this combination is to provide improved performance through the use of a RAM disk and nonvolatile data storage. Using this technique, data is not lost when NT Server is powered down. When NT Server is booted, a SuperDisk-NT driver again creates the RAM disk, but then populates it with the data from the RAM drive backup partition.

RAM Disk Performance

So, how much performance improvement might you expect when using RAM disk technology? To address this question, the Neal Nelson Business Benchmark for NT was again employed. Test 18: Sequential Reads of 1024 Kbyte Records from NT Files and Test 19: Sequential Writes of 1024 Kbyte Records to NT Files were selected and executed at the 20-copy workload level. A 20- copy workload level exercises the disk subsystem under a load of 400 simulated users using a combined work file size of 700 Mbytes. An NCR 4300 configured with NT Server 4.0 (Service Pack 3), 1 GB of RAM, 3 Ultra Fast/Wide SCSI 10,000 rpm disk drives, and 4 Pentium Pro 200MHz/512 K cache CPUs was the system under test.

To accommodate the 700 GB benchmark work file, an 800 GB RAM disk was created using SuperDisk-NT. So that the relative performance level provided by a RAM disk could be more easily related to, this same benchmark was also run against a single Ultra Fast/Wide SCSI 10,000 rpm disk drive. Table 6.14 reports the benchmark results:

Table 6.14 RAM Disk benchmark results

Ultra Fast/Wide SCSI 10,000 rpm disk drive (Mbytes/sec)

Disk Backed RAM Disk (Mbytes/sec)

RAM Disk Only (Mbytes/sec)

Test 18
Sequential 1 KB reads

2.3

18.6

23.5

Test 19
Sequential 1 KB writes

0.76

11.59

15.7

RAM Disk Performance Results

Performance improvements provided by the RAM disk are dramatic! What is more important than the specific throughput values reported are the relative performance differences between using the RAM disk and a traditional disk drive. For the read intensive environment, the RAM disk provides over 10 times the performance of a normal disk drive and over 15 times the performance in a write intensive environment. Even when the RAM disk data is backed by a normal disk drive, the performance improvement is drastic. With this information, you are now aware of the relative disk subsystem performance that a RAM disk can provide. You can make your own decision of whether the additional cost of the extra RAM for use as a RAM disk is justified by the increase in performance.

Using a RAM disk to improve disk subsystem performance is not a replacement for utilizing sound tuning and sizing techniques for the rest of your disk subsystem. It does, however, provide a strong ally to add to your arsenal of tuning techniques. The benchmark used in this test is a particularly intense multiuser benchmark. The performance you achieve is influenced by the various elements of your specific environment such as: user workloads, application in use, CPU performance, type of memory, and memory bandwidth (system bus) of your server.

Other Possibilities for RAM Disk Technology

Now that you are familiar with some of the positive and negative aspects of using a RAM disk product like SuperDisk-NT, there are numerous areas you may consider when implementing this technology to improve the disk subsystem performance of your NT Server solution. The most common justification for using a RAM disk is to remove disk subsystem hot spots, but don't let this one example limit your imagination. Other areas where you might employ RAM disk technology are:

  • TEMP directory/file system replacement

    This technique is similar to what is done in some UNIX implementations such as Solaris. Use a RAM disk to replace the TEMP directory. Some applications use the TEMP directory to store frequently accessed, but temporary files, thus a potential hot spot.

  • Database Enhancements

    If you have a database and enough memory, it is possible to place the entire database on a SuperDisk-NT partition. If you don't have a lot of memory, but the index to your database can be separated from the rest of the database, you could place it on RAM Disk mirrored to a disk drive.

  • Web Site Enhancements

    If your server is hosting a web site, you can improve the access speed of your web pages by placing the web site, or frequently accessed web pages, on RAM Disk mirrored to a disk drive.

  • Pagefile replacement

    Even if your NT Server is not paging heavily, small areas of the Pagefile are commonly accessed on a regular basis. For some environments, replacing the Pagefile with a disk backed RAM disk solution can improve end user response times.

You may think of even more options. Deciding when it is best to implement these options is based on your environment and requirements.

Summary

By reviewing each link in the disk subsystem data path, it is the weakest link that throttles the overall disk subsystem performance. In this chapter, the importance of understanding the entire path that the disk subsystem data follows was closely examined in conjunction with reviewing how NT Server tries to maximize the performance of the disk subsystem. With this information it becomes easier to identify potential bottlenecks that can throttle your server's overall performance. Building upon this foundation, specific practical examples of how to tune and size the disk subsystem were provided to help you get the maximum performance from NT Server's flexible disk subsystem.

About the Author

Curt Aubley, formerly the Senior System Architect and MCSE for NCR, is Chief of Technology at OAO Corporation and author of many published articles on Windows NT.

Copyright © 1998 by Prentice Hall PTR

We at Microsoft Corporation hope that the information in this work is valuable to you. Your use of the information contained in this work, however, is at your sole risk. All information in this work is provided "as -is", without any warranty, whether express or implied, of its accuracy, completeness, fitness for a particular purpose, title or non-infringement, and none of the third-party products or information mentioned in the work are authored, recommended, supported or guaranteed by Microsoft Corporation. Microsoft Corporation shall not be liable for any damages you may sustain by using this information, whether direct, indirect, special, incidental or consequential, even if it has been advised of the possibility of such damages. All prices for products mentioned in this document are subject to change without notice. International rights = English only.

International rights = English only.

1 Paraphrased from NT File System Internals, p. 353

Link
Click to order