Winsock Direct: The Value of System Area Networks

Article
12/09/2009

Abstract

This paper provides a general overview of system area networks (SANs) and how organizations can benefit from deploying SANs using Winsock Direct. This paper outlines the value of SANs; provides details on Microsoft’s Winsock Direct technology for use with SANs; contrasts the performance of Winsock Direct with Transmission Control Protocol/Internet Protocol (TCP/IP)-based networks; and discusses the future of SAN and Winsock Direct technologies in the upcoming Whistler Server operating system.

Introduction

Winsock Direct

Today, most organizations use their existing Transmission Control Protocol/Internet Protocol (TCP/IP)-based networks for communications between servers in their data centers. For instance, an application running on a server or farm of servers that needs to interact with a back-end database may use an existing TCP/IP-based network. The use of TCP/IP-based networks is often perfectly acceptable, but in some situations, particularly high-traffic data centers, this approach wastes valuable server resources on network communications.

System Area Networks (SANs) are designed to free up valuable server resources—such as central processing unit (CPU) cycles—which in turn provides more resources to the application running on the server. This ultimately results in:

Increased capacity of servers and applications.
Improved scale-out options due to reduced overhead in scale-out strategies.
Faster communications between servers in a multi-tier environment.

Most SANs require applications specifically designed to support the underlying hardware. As a result, very few applications are currently available for deployment in SAN environments. To address this issue, Microsoft has devised a new approach. Rather than developing new application programming interfaces (APIs) and getting independent software vendors (ISVs) to modify their applications, Microsoft has created Winsock Direct, a new protocol that fits seamlessly underneath the Winsock API that serves to integrate server applications into SAN environments. Winsock Direct bypasses the kernel networking layers and communicates directly with the SAN hardware. Winsock Direct was introduced with Microsoft Windows 2000 Datacenter Server and is now available in Windows 2000 Advanced Server with the delivery of Service Pack 2. It is also available in the Windows Embedded family of operating systems. The Winsock Direct features and performance outlined here are the same on all three versions of the operating system. For brevity this white paper will discuss Winsock Direct in relation to Windows 2000 Advanced Server only.

The availability of Winsock Direct gives IT departments a broad choice of server applications to deploy in their SANs. Organizations can now deploy their existing applications—without modification—in a SAN environment and immediately benefit from increased capacity, more scale-out options, and faster communication.

Testing Winsock Direct Performance

To quantify the performance benefits of using Winsock Direct, Microsoft performed several tests comparing the technology to TCP/IP. This paper details the results of those tests. Three main application categories were tested:

Front-end Internet Information Server (IIS) and application server to back end database server
Application-to-application communication
Backups and content distribution

Thus, the objectives of this paper are to:

Outline the benefits of SANs.
Explain how Winsock Direct works with SANs.
Demonstrate the performance benefits of using a SAN in a Windows 2000 Advanced Server- or Windows 2000 Datacenter Server-based environment.

Overview of Winsock Direct

The Tiers of the Data Center

The data center has evolved considerable in the last decade. Today’s data center may be an e-commerce site such as an on-line retail store or airline reservation system, or it can be a private corporate data center hosting an inventory control or payroll system. A typical data center is comprised of multiple application-specific tiers.

The volume of networking traffic within a tier and between tiers is dependent on the type of applications being run on each tier. The number of tiers maps to individual customer requirements. Typical Internet service provider (ISP) and application service provider (ASP) models are generally logically divided into a three-tier architecture. The first tier is the interface to the wide area network (WAN) and hosts a Web server. The second tier is typically the application server tier, where customer-specific applications are run. The third tier is typically the backend database.

Additional tiers are often added to either balance the load within a given tier or provide firewall protection against attacks. Network load balancing can be installed in front of the Web tier to balance connections across the tier. For the application tier, some form of a component load balancing can be deployed to balance the execution of application components across the tier.

Each tier usually has fault tolerance capabilities. The Web server tier typically runs on less powerful computers and fault tolerance is achieved through massive redundancy of computing resources. Because there is little to no shared state between the servers, fault tolerance is achieved by simply routing the traffic around failed nodes. The application tier typically runs on more powerful computers. State is often shared between systems, so fault tolerance requires that state be failed over to the backup system and that the load be reallocated. The same issues apply to the database tier. Winsock Direct works well with existing fault tolerant technologies, but for brevity this white paper does not examine failover or firewall deployment scenarios.

Although the three-tier architecture is typical, many data centers are actually deployed in two tiers, with the Web and application tiers collapsed into a single tier. These layers can be collapsed for several reasons: the application doesn’t scale well if remote from the Web server, the application does not need the additional resources of a separate application tier, or the organization is trying to reduce the number of systems to administer. In any case, this paper views data centers as using a logical three-tier architecture that is physically implemented as either two or three tiers.

Several systems that configure and manage infrastructure interface with the tiers. Some of the most common such systems include backup systems used to backup state on the second and third tier and content servers that serve content to the Web caches on the first or second tier. These applications can potentially send terabytes of data, and in today’s 7x24 data center environment, it is imperative that these operations do not significantly affect the performance of the data center.

Introduction to SANs

The Microsoft architectural model for SAN is to extend bus-oriented semantics across the network. This includes the following characteristics:

Reliable, in-order delivery with deterministic error models
Two transfer modes:
- Message semantics (also called send semantics)—packets are sent without specifying the destination buffer location.
- Remote Direct Memory Access (RDMA) semantics—packets are sent specifying the destination buffer location. RDMA semantics can include either an RDMA Write and/or an RDMA Read.

This SAN abstraction has been mapped to a wide variety of SAN fabrics, including several proprietary SAN interconnects (such as VIA fabrics) and custom shared memory interconnects. In the near future third parties will also ship support for SAN capabilities on Fibre Channel and Ethernet. In the future, Infiniband support will also be provided.

SAN hardware usually implements most of its data transfer capabilities in hardware through methods that include:

Segmentation and reassembly (SAR) of buffers in hardware. Typically an application posts a buffer that is larger than the packet size for the network. SANs can slice the buffer into packet-size chunks and reassemble it at the destination, offloading this from the host CPU.
Kernel bypass. SANs typically enable data to be sent and received directly into the user application, bypassing the kernel. All the protection mechanisms enforced by the kernel are then placed into the network interface card (NIC).
Interrupt moderation. If the application is not currently processing network packets, it must be interrupted to operate on the incoming or outgoing data stream at periodic intervals. This interruption can involve significant overhead; SANs have sophisticated mechanisms to minimize this overhead.

Windows 2000 Advanced Server with Service Pack 2, Windows 2000 Datacenter Server, and Windows Embedded directly support these capabilities.

Microsoft’s Approach to SANs

Most companies have chosen either to create new application programming interfaces (APIs) to enable SAN capabilities or to provide simple sockets like interfaces, which require applications to change their networking interface. Microsoft has created the only binary-compatible, SAN-enabled sockets protocol available today. Existing applications can run without modifications over a SAN with performance gains similar to those possible by programming directly to the hardware when using the SOCK_STREAM Winsock 2 API. Thus applications do not need to program to SAN specific—often SAN vendor specific—APIs. Instead, organizations can simply install a SAN and the Winsock Direct drivers to obtain substantial performance improvements for their existing applications.

Figure 1 below outlines the differences between the traditional networking model and Winsock Direct. The “switch” is the key to enabling Winsock Direct—it enables the application winsock calls to be redirected to Winsock Direct rather than going down the conventional TCP/IP path. Winsock Direct then calls down to the SAN provider to manage fabric and hardware specific issues.

Figure 1: Winsock Direct and SAN Architectural Model

Winsock Direct has been designed to be SAN independent. The SAN Provider Interface (SPI) enables many varieties of SAN to be implemented, including Giganet cLAN, Compaq Servernet II, Fibre Channel, and eventually InfiniBand.

The Benefits of Winsock Direct for the Enterprise

Many customers face the problem of how to enable corporate data centers to handle increased workload. Customers need to deploy faster CPU/memory architectures or examine CPU off-load mechanisms. Off-load mechanisms work well for software modules that consume large portions of CPU or memory bandwidth and are modular. Networking protocol stacks and applications that are written at the logical second tier meet these criteria.

A data center’s capability to handle increased workload can be improved by deploying SAN for several reasons:

Offloading network protocol overhead allows existing CPU and memory bandwidth resources to be redeployed to the application, thus improving application performance.
Dividing the enterprise into three physical tiers allows Web serving and application serving to be divided between different physical resources. Winsock Direct enables SAN to be deployed in this scenario by seamlessly working underneath existing libraries for distributing the workload. Winsock Direct reduces the library overhead, allowing application distribution to scale better than with conventional network infrastructures.
Offloading network protocol overhead can substantially reduce system demand due to backups or content distribution to the caching servers, enabling new online backup and content distribution scenarios.

Winsock Direct is key in deploying SAN technology because it enables all the above scale-out strategies without requiring custom applications. The system administrator can tune existing application scenarios by moving components around the SAN without having to create custom versions of each application for each scenario.

Protocol Offload Protocol stack overhead can directly affect the performance of server applications that are constrained by CPU or memory bandwidth. In effect, reducing protocol stack overhead frees up CPUs in a multi-processor system for the application. Put in simple terms, an eight-processor system can perform like a 10- or 12-processor system by adding a SAN adapter that supports protocol offload.

Current first-generation network protocol offload algorithms provide TCP checksum offload and large send offload. Many vendors are pursuing a second-generation offload mechanism that is centered on offloading the transport stack (typically TCP) to the network adapter. Microsoft has focused on not only offloading the transport protocol stack, but also offloading or avoiding the work that occurs between the application and the protocol stack. This approach provides superior offload characteristics compared to second-generation offload techniques. Note that this is not a research project—this capability is available with Windows 2000 Datacenter Server and Windows 2000 Advanced Server with Service Pack 2, and existing application binaries work without modification.

Application Offload In addition to protocol offload, data centers can also handle increased workload through scale-out-the addition of more servers to the data center to handle increased load. Scale-out can be done in several ways:

Add additional servers within a tier. For the second and third tiers, there is significant shared state that must be synchronized within the tier. If network overhead is substantial, this synchronization overhead can limit the range of scalability.
Convert from a physical two-tier solution to a physical three-tier solution. However, there is a possible roadblock to physically separating the two tiers—the overhead associated with running the application remotely rather than on the same machine. Winsock Direct substantially reduces this overhead, enabling scale-out in this fashion where it was not appropriate with TCP-based technologies.

In either case, if network communication overhead is high (which is typical for conventional networks), scale-out results can be substantially less than expected due to a large percentage of the added resources being consumed by networking overhead rather than being applied to the application. In some cases, as much as 30 percent of the CPU cycles may be spent on network communication within the data center. Winsock Direct substantially reduces this overhead, making the scale-out solution more compelling by eliminating the wasted CPU cycles traditionally associated with scale-out solutions.

The Benefits to ISVs

Independent software vendors (ISVs) also benefit from Winsock Direct. They no longer need to ship custom binaries to enable each instantiation of a SAN for customers using Windows 2000 Datacenter Server or Windows 2000 Advanced Server with Service Pack 2. Instead, ISVs simply program to the Winsock API. If a SAN is installed, the operating system will automatically detect it and steer the traffic over the SAN using Winsock Direct.

Organizations’ use of Winsock Direct to solve scale-out problems assures ISVs that their customers can solve CPU bottlenecks, thereby allowing software vendors to develop innovative distributed computational capabilities that enable new scale-out scenarios and data center capabilities. Additional modularization of ISV applications will also provide for enhanced scale-out scenarios. This work can be done knowing that Winsock Direct allows the same application to be deployed on single systems and on highly distributed systems.

Performance Capabilities of Winsock Direct

Winsock Direct Performance Overview

Winsock Direct provides network acceleration for a broad range of applications, and is most useful for applications that have a high percentage of network usage. This white paper provides performance measurements for three main application categories:

Front-end Internet Information Services (IIS) and application server to back end database server
Application-to-application communication
Backups and content distribution

This section shows that Winsock Direct can provide significant performance gains between tiers no matter what the physical configuration.

Winsock Direct Performance for Databases

To test the performance of Winsock Direct, Microsoft selected the Doculabs @Bench Nile benchmark application. Nile is a good example of a two-tier architecture in which the Web server and application server run on the same host and communicate to a back-end database. Modeled after an online bookstore, Nile is a simple Web-based application.

All benchmark applications are simple applications meant to test the ability of an underlying technical infrastructure to handle the load placed on real-world applications. Real-world applications are often much more complex, of course, and typically incorporate communication with internal legacy systems and much more complex logic than found in Nile. Yet Nile serves a purpose—it highlights how well an application server can handle the core tasks of database communication, dynamic page generation, and transactions. These are the bread and butter tasks of any application server product.

Functionally, the Nile application is loosely based on TPC-W, an emerging standard from the Transaction Processing Performance Council (TPC) for e-commerce benchmarks. The Nile application is designed to highlight the use of an application server for product catalogs, ad hoc product searches, product browsing, customer account management, shopping carts, and order transactions. Of the pages in the benchmark, 90 percent are dynamically generated from database content.

Doculabs specified the application, which meant that Microsoft had to adhere to many rules that limited testers’ ability to use the more advanced techniques that might be used for a real application. For example, no database information is ever cached in the Nile benchmark test, even though certain pages call for some level of caching. In addition, state had to be managed on the logical middle tier or data tier, with a single cookie used to map client sessions back to their individual shopping carts. Finally, security was not a consideration. All users come over anonymously with no data encryption. In a real world application, the customer account page and the transaction page would certainly use secure hypertext transfer protocol.

Results for the Nile benchmark test were first published in a July 1999 PC Week article, "Application Server Shootout." In the original Doculabs*/PC Week* benchmark, Microsoft Windows NT 4.0 on Compaq Proliant servers beat seven different UNIX-based application servers, including Sun Microsystems' iPlanet running on high-end Sun servers. The updated Doculabs report on Windows 2000, @Bench Test Report: Performance and Scalability of Windows 2000, by Marianne Pendleton and Gautam Desai, was published in August 2000.

Note: To download the Doculabs report, please go to https://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnbda/html/DoculabsWS.asp.

The Nile applications and installation instructions are available from the Microsoft MSDN Online Code Center at https://msdn.microsoft.com/library/default.asp?url=/library/en-us/dndotnet/html/manprooracperf.asp.

The Test Platform The test platform consisted of the following hardware interconnected as shown in Figure 2 below:

One Compaq Proliant 8500 with eight 550 megahertz (MHz) Intel Pentium III processors with 2 megabytes (MB) Level 2 (L2) cache, three gigabytes (GB) of random access memory (RAM), running Windows 2000 Advanced Server with Service Pack 2 and SQL Server 2000.
Four Compaq 8500 Web/application servers, each with eight 550 MHz Pentium III processors with 2 MB L2 cache and 512 MB RAM, running Windows 2000 Advanced Server with Service Pack 2.
Cisco gigabit network backbone.
100 Dell client workstations to generate load. The workstations were single-processor machines with 500 MHz processors and 128 MB RAM running Windows 2000 Server.

Figure 2: Nile Hardware Configuration

The Tests Benchmark Factory from Quest Software was used to generate the load, measure the performance results, and to track the error rates during the reliability tests.

The load characteristics of Benchmark Factory are similar to those of Microsoft's Web Application Stress tool, meaning that a small number of clients can put enormous stress on the system. To be safe, Microsoft used 100 client machines to ensure that the clients were never the bottleneck during the tests. To maximize the stress placed on the systems, we ran all tests with no “think time,” as did Doculabs in its original Windows NT 4.0 tests as published in PC Week, the follow-up August 2000 retest of Windows 2000 Advanced Server, and a later revised test. This means that one virtual "user" in the test equals many "real" users hitting the server. Doculabs estimates that one user with no think time equates to at least ten real users with a normal user-delay between page requests.

The Results The primary benchmark metrics for the Nile test are the number of Web pages served per second. For this white paper, the tests were run using Microsoft SQL Server™ 2000, Windows 2000 Advanced Server with Service Pack 2, and Internet Information Services 5.0. Figure 3 below shows the Microsoft SQL Server 2000 results running over Winsock Direct and Microsoft SQL Server 2000 running over gigabit ethernet over TCP/IP.

The December update to the Nile benchmark used Windows 2000 Advanced Server and Microsoft SQL Server 7.0. The test was rerun with the same client server topology and same computers—the only hardware change was adding a SAN NIC. Software was updated from Windows 2000 to Windows 2000 Advanced Server with Service Pack 2 and from Microsoft SQL Server 7.0 to SQL Server 2000. This resulted in improved performance for the TCP/IP case. Pages served per second increased from 8,452 to 8,813 for the TCP/IP case. With Winsock Direct the peak number of pages served per second was 9,940, or 13 percent more than with TCP/IP.

Figure 3: Nile Benchmark Results

Winsock Direct Performance for Application-to-Application Networking

To assess scale-out, Microsoft also analyzed the effect of distributing the application tier on a tier separate from the Web server tier. Note that in a physical three-tier data center, the benchmark results above for database acceleration still apply for tier two to tier three communication. The use of three tiers provides data center administrators with another method of improving performance.

The goal in this section is to evaluate the overhead associated with remoting an object. Because objects can vary substantially in terms of how many CPU cycles they require to execute, this section focuses on benchmarking the performance of two of the most common network libraries used by applications-COM+ and remote procedure call (RPC). Both libraries enable the application to send a command across the fabric, execute it remotely, and return the results to the calling node. Understanding total system performance also requires examining the object being placed on the remote system. If the object requires substantial CPU cycles, then the relative cost of remoting the object is not significant. However, if the CPU requirements for an individual object are generally moderate, the overhead of remoting the object becomes significant unless Winsock Direct and SAN are deployed.

For either COM+ or RPC, the benchmark run calls into the library to execute a procedure on the remote node called a “Null” procedure, which is a procedure that does no work and simply returns the results to the local node.

Remote Procedure Call Library Performance The test involved 16 client threads making NULL RPC calls to the server. For more test details please see Appendix B.

Table 1: Procedure Call Library Performance

Test

RPC Calls/Second

Client CPU Load

Server CPU Load

Winsock Direct

28,860

98.6 percent

93.7 percent

TCP/IP

21,310

95.6 percent

82.5 percent

The data shows that Winsock Direct provides a 35 percent increase in the number of RPC calls per second.

COM+ Library Performance The COM+ library performance test involves a NULL COM method being invoked from an ASP page. The ASP page uses Visual Basic to perform the following operations:

Instantiate the COM object.
Invoke NULL method.
Render the return value from the method to generate the page to be served out to the HTTP client.

The architecture involves IIS running on one machine hosting the ASP page. A Web client running the Microsoft Web Application Stress Tool continuously issues 50 concurrent requests for the ASP page from the IIS server. The connection from the Web client machine to the IIS server machine is a 100 megabit per second (Mbps) Ethernet connection. For more test details please see Appendix B.

Microsoft ran three variations of the COM+ library performance test:

COM object installed on an IIS server. IIS was configured with “Application Protection” at Medium, which causes the COM object to reside in a separate process from the IIS process. This configuration is typical for running Web applications because it protects the IIS server from any faults in the application.
COM object remoted to a second server using COM+. The object’s COM+ proxy executes in the IIS process because the proxy is automatically generated code and there’s no danger of it failing and corrupting the IIS server process. IIS server communicates to the COM server using Winsock Direct.
Same as variation 2 above except communication is through TCP/IP.

Please see Appendix B for more details of the test configuration.

Table 2: COM+ Library Performance

Case

Pages per Second

IIS CPU Load

Remote CPU Load

Case 1 (local COM)

138.8

92.8 percent

N/A

Case 2 (Winsock Direct)

224.9

99.2 percent

69.4 percent

Case 3 (TCP)

176.9

95.0 percent

68.6 percent

Winsock Direct reduces CPU overhead significantly, enabling the same CPU resources to perform 62 percent better than a COM implementation and 27 percent better than a COM+ implementation over TCP on gigabit ethernet.

Another issue that arises when distributing objects through COM+ is whether the total CPU resources in the system are adequately used. For example, in the COM case, the test was run with two CPUs. For the TCP and WSD case, the tests were run with two systems, each with two CPUs, for a total of four CPUs. An alternative way of examining the results is to normalize them to a single CPU to show how well a single CPU resource is being used. This is done by taking the results shown above and dividing it by the amount of CPU resources used to drive the network.

For the local COM case, the server served 138.8 pages per second at 92.8 percent CPU usage for two CPUs. This translates into 0.748 pages per CPU usage unit [138.8/(92.8*2)].
For remote object using TCP, this factor is 0.541 pages per CPU usage unit [176.9/(95.0*2 + 68.6*2)]—28 percent less than the local COM case.
For remote object using Winsock Direct, this factor is 0.667 pages per CPU usage unit [224.9/(99.2*2 + 69.4*2)]— only 11 percent less than the local COM case.

These results show that Winsock Direct enables efficient scale-out—doubling the number of CPUs in the system caused a resultant 62 percent increase in capacity, and near linear scaling of resource capacity was achieved (just 11 percent less than linear scaling).

Winsock Direct Performance for Backup and Content Distribution

Several applications in the data center can be modeled as a streaming application, where large amounts of data are shoveled in one direction. Typical customer scenarios include backing up disks, restoring from a disk, and content distribution.

Backup and Restore Backing up either the application server or the database server typically involves moving large amounts of block-oriented traffic from disk across the network to the backup device. Organizations need to be able to perform this backup while the system is in use. Backups can be scheduled during lower stress periods if the additional disk, network and CPU usage required for backing up the system do not require more resources than can be provided without affecting the normal Web traffic.

Depending on an organization’s needs, backups can either occur on the same network as the data center application traffic or on a separate network if the backup bandwidth would provide too much load. In either case, the application server or database server CPUs must multiplex between application work and backups. With Winsock Direct, the CPU usage for the backup application’s network load drops substantially, enabling the idle CPUs to be used for the application. The use of Winsock Direct enables backups to be performed online when previously the backups had to be done by taking the application server offline, or it enables the application/database server to be scaled back in terms of CPU count because of the reduced peak load of Web serving and backup.

Because Winsock Direct does not require copying data from private network buffers into the backup application buffers, the CPU usage is substantially less than with gigabit Ethernet using TCP/IP.

Content Distribution Web serving content must often be distributed to the Web servers’ caches. The distribution of new content often involves streaming the data into the content cache (either in the application tier or the Web server tier). In either case, the benefits are similar to the backup analysis scenario above, with the caveat that instead of reading from the application disk, the disk is being written to.

NTTCP Streaming Results TTCP was used to represent the streaming requirements of backup and content distribution. The tests run were using 64 KB user buffers using asynchronous Winsock calls. Data flow was in one direction. The TCP network interface was configured with Jumbo frames and 9000-byte MTUs.

Table 3: Single-Socket Streaming Results

Test

Throughput (Megabits Per Second)

Sender CPU Load

Receiver CPU Load

Winsock Direct

660.5

7.3 percent

8.4 percent

TCP/IP

605.4

22.0 percent

55.1 percent (256 kilobyte window size)

Table 4: Multi-Socket (Four Socket) Streaming Results

Test

Throughput (Megabits Per Second)

Sender CPU Load

Receiver CPU Load

Winsock Direct

891.2

11.0 percent

11.2 percent

TCP/IP

769.3

25.4 percent

56.6 percent (64 kilobyte window size)

When a single socket is used, Winsock Direct improves the throughput by 9.1 percent while reducing the CPU usage on the sender to one-third the TCP value, and the receiver CPU usage by one-sixth the TCP value.

When multiple sockets are used—which is typical in a data center—Winsock Direct improves throughput by 16 percent while reducing the sender CPU usage to 43 percent of the TCP value and the receiver CPU usage to 20 percent of the TCP value over gigabit Ethernet. The lower CPU usage is due to Winsock Direct’s support for zero-copy algorithms.

Note that the above tests are a best-case scenario for TCP over gigabit Ethernet compared to a typical scenario for Winsock Direct. The TCP case used the best-in-class gigabit Ethernet NIC, configured with 9 kilobyte maximum transmission unit (MTU). Typical customer installations do not use 9 KB MTUs; rather, they use 1,500 byte MTUs, which means that six times more packets are generated. This increase results in substantially higher CPU usage and lower throughput, and thereby makes Winsock Direct even more competitive.

Future Directions

Future Investments in Winsock Direct

Winsock Direct and SAN offload is a strategic networking direction for Microsoft. There is significant ongoing work to continue to improve the capabilities and performance of Winsock Direct. This paper documents the real-world performance benefits achievable by deploying SAN technology using Winsock Direct on Windows 2000 today. As good as the performance described in this paper is, there will be a significant performance boost for Winsock Direct in Whistler Server. Additional improvements in Whistler Server include:

Caching of the pinning of pages for I/O. Currently every application I/O buffer is pinned and unpinned on a per call basis. In Windows Whistler Server application I/O buffers will be cached. Typically an application reuses its I/O buffers, so this can be a large performance benefit. This feature is being implemented in a robust fashion, requiring close interaction with the operating system to ensure when an application frees memory the cache is flushed correctly.
Implementation of a “fast-path” for small buffers. After Winsock Direct support was released, a second round of performance analysis showed that implementation of a fast-path for small buffers can significantly impact performance. This capability will be included in Whistler Server.
Implementation of RDMA Read semantics. Currently Winsock Direct can only push the data from the transmitter to the receiver (RDMA Write). In Whistler Server either RDMA Reads or RDMA Writes will be used, automatically adapting to application behavior to ensure the highest performance.
New optimizations for command-oriented traffic. Whistler Server will also provide new optimizations for applications whose traffic pattern causes a command to be sent and then wait for some data to be returned. This is typical of many applications, including databases.

Additionally, work is being done to further improve I/O completion ports and other interactions with the operating system. These new features will work on the existing SAN hardware being deployed today.

Infiniband and Winsock Direct

Winsock Direct will be supported on Infiniband shortly after production hardware is available. Winsock Direct is uniquely suited for Infiniband because the Winsock Direct transfer primitives are RDMA and Sends, which map directly to Infiniband hardware protocols.

Gigabit Ethernet and Winsock Direct

Gigabit Ethernet-based Winsock Direct solutions do not currently exist, but several vendors are examining providing solutions in this area. Microsoft is strongly committed to enabling Winsock Direct over Ethernet, and is working closely with the Internet Engineering Task Force (IETF) to create protocols that turn an Ethernet into a SAN. This work has just begun and is consequently further out than Infiniband.

Appendix A: Test Configuration for Nile

The full test configuration is available at https://msdn.microsoft.com/windows2000/. The test compared TCP/IP over gigabit Ethernet and SAN over Compaq Servernet II.

The December 2000 update to the original benchmark was written to showcase various Microsoft technologies for deploying applications. Because the focus for this benchmark was to showcase the performance enhancements achievable with Winsock Direct, the fastest version of the benchmark was used. This benchmark was written with Visual Studio 6.0 development tools. This tool is being superceded by the Visual C++ .NET toolset called Active Template Library (ATL), which should significantly simply the process of developing an application using this type of architecture.

The application server includes the use of a thin Internet Server Application Program Interface (ISAPI) layer to activate Visual C++, COM+ components that do all the work/data access. The data access does not use ActiveX Data Objects (ADO), but instead used the Open Database Connectivity (ODBC) API from SQL.

The Visual C++ implementation is an adaptation of the original C++ Nile application benchmark in PC Week by Doculabs. In the original implementation, all the code was contained within a single ISAPI DLL—a monolithic two-tier structure. The new COM+ application is more elegant in that it is a logical three-tier architecture that is more easily maintained and offers more deployment options, such as the use of component load balancing and COM+ transactions. However, due to time constraints, the COM+ version was tested with just a physical two-tier configuration.

Appendix B: Test Configuration for COM+ and RPC tests

The RPC and COM+ tests were conducted on two dual-processor 600 MHz Pentium III machines with 256 KB cache and 256 MB RAM. The SAN network used for Winsock Direct communication was Giganet cLAN and the Gigabit Ethernet network used for TCP/IP communication was 3Com’s Gigabit EtherLink, both in back-to-back configurations.

RPC Tests

The RPC measurements were obtained using a test application that starts 16 threads on the client side. Each thread repeatedly invokes the NULL RPC procedure on the server machine. These RPC calls are made over persistent connections. The server simply receives the client calls, executes the NULL procedure, and returns the result to the client.

16 client-side threads were used in order to reach 100 percent CPU usage when using TCP/IP. Communication over TCP/IP has longer latency than over Winsock Direct. As such, each RPC call takes longer to complete when using TCP/IP, which means more calls need to be outstanding at a time to keep the machines busy. Thus, when using Winsock Direct, it only takes eight threads (that is, eight concurrent RPC calls at a time) to achieve 100 percent usage, while with TCP/IP it takes 16.

COM+ Tests

Web Client Configuration Microsoft’s Web Application Stress Tool (https://www.microsoft.com/technet/archive/itsolutions/intranet/downloads/webstres.mspx) was used as the Web client. The Web client was run on a 1-gigahertz single-processor computer connected to the IIS server through gigabit Ethernet. The Web Application Stress Tool was configured to use five threads, 10 sockets per thread (for a total of 50 concurrent requests to the server), and no delay between requests.

IIS Configuration The IIS server was configured to handle more than 100,000 hits per day. For the local COM object test (COM object running on the same machine as the IIS server), IIS was configured to run the COM application using “Medium (Pooled)” application protection. For the remote COM object case, IIS was configured to execute the COM+ proxy with “Low (IIS Process)” application protection.

Instructions for adjusting these settings can be found at https://www.microsoft.com/technet/prodtechnol/windows2000serv/technologies/iis/maintain/optimize/iis5tune.mspx.

COM Object Microsoft used the IBank COM object distributed as part of the Windows DNA Performance Kit (https://www.microsoft.com/com/resources/WinDNAperf.asp). The DoCommitOnly method of this object is a NULL method which simply returns SUCCESS upon invocation. This is the method invoked by the ASP page served by the IIS server.

ASP Code The ASP page consists of VBSCRIPT code that instantiates the COM object (using Server.CreateObject), invokes the above-mentioned method on this object, and sends the result returned by the method to the Web client using Response.Write. These VBSCRIPT calls are documented in MSDN.

COM+ Configuration COM+ was configured to perform access checks only at the process level. Just-in-time activation was enabled.

Appendix C: Test Configuration for TTCP and NTTCP

The streaming performance tests were conducted on two dual-processor 600 MHz Pentium III machines with 256 KB cache and 256 MB RAM. The SAN network used for Winsock Direct communication was Giganet cLAN. The Gigabit Ethernet network used for TCP/IP communication was 3Com’s Gigabit EtherLink. Both networks were in back-to-back configurations. The EtherLink NIC had jumbo frames enabled and MTU set to 9000 bytes.

TTCP was used for the single-socket streaming test. It was configured to use asynchronous Winsock calls (WSASend and WSARecv) for data transfer.

NTTCP was used for the multi-socket streaming test. This is a multi-threaded version of TTCP. NTTCP starts one thread per socket (that is, it was using four threads in the four-socket test) and uses asynchronous Winsock calls for data transfer.

Perfmon was used for CPU usage measurements because it is more accurate than TTCP/NTTCP for low CPU usage cases.

For More Information

For the latest information on Windows 2000, visit our Web site at https://www.microsoft.com/windows2000

For the latest information on Winsock Direct, visit https://www.microsoft.com/whdc/default.mspx.

Winsock Direct: The Value of System Area Networks

On This Page

Introduction

Winsock Direct

Testing Winsock Direct Performance

Overview of Winsock Direct

The Tiers of the Data Center

Introduction to SANs

Microsoft’s Approach to SANs

The Benefits of Winsock Direct for the Enterprise

The Benefits to ISVs

Performance Capabilities of Winsock Direct

Winsock Direct Performance Overview

Winsock Direct Performance for Databases

Winsock Direct Performance for Application-to-Application Networking

Winsock Direct Performance for Backup and Content Distribution

Future Directions

Future Investments in Winsock Direct

Infiniband and Winsock Direct

Gigabit Ethernet and Winsock Direct

Appendix A: Test Configuration for Nile

Appendix B: Test Configuration for COM+ and RPC tests

RPC Tests

COM+ Tests

Appendix C: Test Configuration for TTCP and NTTCP

For More Information

Additional resources