The Cable Guy TCP Receive Window Auto-Tuning
Joseph Davies
Welcome to the first installment of The Cable Guy in TechNet Magazine. Fans of the column on the TechNet Web site already know we cover all manner of networking issues, and we'll continue that tradition here each month. If you're new and looking for an archive of previous columns, head over to the Cable Guy site.
Now let's get started with our first topic here-the TCP Receive Window.
Throughput over TCP connections can be limited by sending and receiving applications, sending and receiving implementations of TCP, and the transmission path between the TCP peers. In this column I'll describe the TCP receive window and its impact on TCP throughput, the use of TCP window scaling, and the new Receive Window Auto-Tuning feature in Windows Vista™ and Windows Server® 2008 that optimizes TCP throughput for received data.
The TCP Receive Window
TCP connections have a number of important characteristics. First, they are a logical point-to-point circuit between two Application Layer protocols. TCP does not supply a one-to-many delivery service, it provides only one-to-one delivery.
Second, TCP connections are connection-oriented. Before data can be transferred, two Application Layer processes must formally negotiate a TCP connection using the TCP connection establishment process. Similarly, TCP connections are formally closed after negotiation using the TCP connection termination process.
Third, reliable data sent on a TCP connection is sequenced and a positive acknowledgment is expected from the receiver. If a positive acknowledgment is not received, the segment is retransmitted. At the receiver, duplicate segments are discarded and segments arriving out of sequence are placed back in the proper order.
Fourth, TCP connections are full-duplex. For each TCP peer, the TCP connection consists of two logical pipes: an outgoing pipe and an incoming pipe. The TCP header contains both the sequence number of the outgoing data and an acknowledgment (ACK) of the incoming data.
In addition, TCP views the data sent over the incoming and outgoing logical pipes as a continuous stream of bytes. The sequence number and acknowledgment number in each TCP header are defined along byte boundaries. TCP is not aware of record or message boundaries within the byte stream. The Application Layer protocol must provide the proper parsing of the incoming byte stream.
To limit the amount of data that can be sent at any one time and to provide receiver-side flow control, TCP peers use a window. The window is the span of data on the byte stream that the receiver permits the sender to send. The sender can send only the bytes of the byte stream that lie within the window. The window slides along the sender's outbound byte stream and the receiver's inbound byte stream.
For a given logical pipe (one direction of the full-duplex TCP connection) the sender maintains a send window and the receiver maintains a receive window. When there are no data or ACK segments in transit, a logical pipe's send and receive windows are matched. In other words, the span of data in the outbound byte stream that the sender is allowed to send is matched to the span of data in the inbound byte stream that the receiver is able to receive. Figure 1 illustrates this send and receive relationship.
Figure 1 Matching Send and Receive Windows (Click the image for a larger view)
To indicate the size of the receive window, the TCP header contains a 16-bit Window field. When the receiver gets data, it sends ACKs back to the sender indicating the successfully received bytes. In each ACK, the Window field notes the number of bytes remaining in the receive window. When data is sent, acknowledged, and retrieved by the application, both the send and receive windows slide to the right. The receive window is the window that controls how much unacknowledged data can be in flight from the sender to the receiver.
Because there can be data in the receive window that has not been retrieved by the app and data that has been received but not acknowledged, the TCP receive window has additional structure, as Figure 2 shows.
Figure 2 Types of Data in the TCP Receive Window (Click the image for a larger view)
Notice the difference between the maximum and current receive windows. The maximum receive window is a fixed size. The current receive window is of variable size and corresponds to the remaining amount of data that the receiver is allowing the sender to send. The current receive window's size is the value of the Window field advertised in ACKs sent back to the sender, and is the difference between the maximum receive window size and the amount of data that has been received and acknowledged but not retrieved by the application.
The TCP Receive Window and TCP Throughput
To optimize TCP throughput (assuming a reasonably error-free transmission path), the sender should send enough packets to fill the logical pipe between the sender and receiver. The capacity of the logical pipe can be calculated by the following formula:
Capacity in bits = path bandwidth in bits per second * round-trip time (RTT) in seconds
The capacity is known as the bandwidth-delay product (BDP). The pipe can be fat (high bandwidth) or thin (low bandwidth) or short (low RTT) or long (high RTT). Pipes that are fat and long have the highest BDP. Examples of high BDP transmission paths are those across satellites or enterprise wide area networks (WANs) that include intercontinental optical fiber links.
Increasing Sender-Side Performance for High-BDP Transmission
The new Receive Window Auto-Tuning feature provides enhanced performance for receiving data over high-BDP links, but what about sender performance?
The existing algorithms that prevent a sending TCP peer from overwhelming the network are known as slow start and congestion avoidance. These algorithms increase the send window (the number of segments that the sender can send) when initially sending data on the connection and when recovering from a lost segment.
Slow start increases the send window by one full TCP segment for either each acknowledgment segment received (for TCP in Windows XP and Windows Server 2003) or for each segment acknowledged (for TCP in Windows Vista and Windows Server 2008). Congestion avoidance increases the send window by one full TCP segment for each full window of data that is acknowledged.
These algorithms work well for small BDPs and smaller receive window sizes. However, when you have a TCP connection with a large receive window size and a large BDP, such as replicating data between two servers located across a high-speed WAN link with a 100ms round-trip time, these algorithms do not increase the send window fast enough to fully utilize the bandwidth of the connection.
To better utilize the bandwidth of TCP connections in these situations, the Next Generation TCP/IP stack includes Compound TCP (CTCP). CTCP more aggressively increases the send window for connections with large receive window sizes and BDPs. CTCP attempts to maximize throughput on these types of connections by monitoring delay variations and losses. In addition, CTCP ensures that its behavior does not negatively impact other TCP connections.
In testing performed internally at Microsoft, large file backup times were reduced by almost half for a 1Gbps connection with a 50ms RTT. Connections with a larger BDP can have even better performance. CTCP and Receive Window Auto-Tuning work together for increased link utilization and can result in substantial performance gains for connections with large BDPs.
CTCP is enabled by default in computers running Windows Server 2008 and disabled by default in computers running Windows Vista. You can enable CTCP with the "netsh interface tcp set global congestionprovider=ctcp" command. You can disable CTCP with the "netsh interface tcp set global congestionprovider=none" command.
The size of the Window field in the TCP header is 16 bits, allowing a TCP peer to advertise a maximum receive window size of 65,535 bytes. You can calculate the approximate throughput for a given TCP window size from the following formula:
Throughput = TCP maximum receive windowsize / RTT
For example, with a 65,535 byte receive window you can only achieve an approximate throughput of 5.24 megabits per second (Mbps) on a path with a 100ms RTT, regardless of the transmission path's actual bandwidth. With today's high-BDP transmission paths, the originally designed TCP window size, even at its maximum value, becomes a throughput bottleneck.
TCP Window Scaling
For larger window sizes to accommodate high-speed transmission paths, RFC 1323 (ietf.org/rfc/rfc1323.txt) defines window scaling that allows a receiver to advertise a window size larger than 65,535 bytes. A TCP Window Scale option includes a window scaling factor that, when combined with the 16-bit Window field in the TCP header, can increase the receive window size to a maximum of approximately 1GB. The Window Scale option is sent only in synchronize (SYN) segments during the connection establishment process. Both TCP peers can indicate different window scaling factors to use for their receive window sizes. By allowing a sender to send more data on a connection, TCP window scaling allows TCP nodes to better utilize some types of transmission paths with high BDPs.
Although the receive window size is important for TCP throughput, another important factor for determining the optimal TCP throughput is how fast the application retrieves the accumulated data in the receive window (the application retrieve rate). If the application does not retrieve the data, the receive window can begin to fill, causing the receiver to advertise a smaller current window size. In the extreme case, the entire maximum receive window is filled, causing the receiver to advertise a window size of 0 bytes. In this case, the sender must stop sending data until the receive window has been cleared. Therefore, to optimize TCP throughput, the TCP receive window for a connection should be set to a value that reflects both the BDP of the connection's transmission path and the application retrieve rate.
Even if you could correctly determine both the BDP and the application retrieve rate, they can change over time. The BDP rate can vary based on the congestion in the transmission path and the app retrieve rate can vary based on the number of connections on which the app is receiving data.
The Receive Window in Windows XP
For the TCP/IP stack in Windows XP (and Windows Server® 2003), the maximum receive window size has a number of significant attributes. First, the default value is based on the link speed of the sending interface. The actual value automatically adjusts to even increments of the maximum segment size (MSS) negotiated during TCP connection establishment.
Second, the maximum receive window size can be manually configured. The HKLM\System\CurrentControlSet\Services\Tcpip\Parameters\TCPWindowSize and HKLM\System\CurrentControlSet\Services\Tcpip\Parameters\Interface\InterfaceGUID\TCPWindowSize registry values can be set to a maximum of 65,535 bytes (without window scaling) or 1,073,741,823 (with window scaling).
Third, the maximum receive window size can use window scaling. You can enable window scaling by setting the HKLM\System\CurrentControlSet\Services\Tcpip\Parameters\Tcp1323Opts registry value to 1 or 3. By default, window scaling is only used on a connection if the received SYN segment happens to contain the Window Scale option.
Finally, the maximum receive window size can be specified by an application by using the SO_RCVBUF Windows Sockets option when a connection is initiated. For window scaling, the application must specify a window size larger than 65,535 bytes.
Despite the support for scalable windows, the maximum receive window size in Windows XP can still limit throughput because it is a fixed maximum size for all TCP connections (unless specified by the application), which can increase throughput for some connections and decrease throughput for others. Additionally, the fixed maximum receive window size for a TCP connection does not vary with changes in the application retrieve rate or congestion in the transmission path.
Receive Window Auto-Tuning in Windows Vista
To optimize TCP throughput, especially for transmission paths with a high BDP, the Next Generation TCP/IP stack in Windows Vista and Windows Server 2008) supports Receive Window Auto-Tuning. This feature determines the optimal receive window size by measuring the BDP and the application retrieve rate and adapting the window size for ongoing transmission path and application conditions.
Receive Window Auto-Tuning enables TCP window scaling by default, allowing up to a 16MB maximum receive window size. As the data flows over the connection, the Next Generation TCP/IP stack monitors the connection, measures its current BDP and application retrieve rate, and adjusts the receive window size to optimize throughput. The Next Generation TCP/IP stack no longer uses the TCPWindowSize registry value.
Receive Window Auto-Tuning has a number of benefits. It automatically determines the optimal receive window size on a per-connection basis. In Windows XP, the TCPWindowSize registry value applies to all connections. Applications no longer need to specify TCP window sizes through Windows Sockets options. And IT administrators no longer need to manually configure a TCP receive window size for specific computers.
With Receive Window Auto-Tuning, a Windows Vista-based TCP peer will typically advertise much larger receive window sizes than a Windows XP-based TCP peer. This allows the other TCP peer to fill the pipe to the Windows Vista-based TCP peer by sending more TCP data segments without having to wait for an ACK (subject to TCP congestion control). For typical client-based networking traffic such as Web pages or e-mail, the Web server or e-mail server will be able to send more TCP data more quickly to the client computer, resulting in an overall increase in network performance. The higher the BDP and application retrieve rate for the connection, the better the performance increase.
The impact on the network is that a stream of TCP data packets that would normally be sent out at a lower, measured pace, are sent much faster resulting in a larger spike of network utilization during the data transfer. For Windows XP and Windows Vista-based computers performing the same data transfer over a long, fat pipe, the same amount of data is transferred. However, the data transfer for the Windows Vista-based client computer is faster due to the larger receive window size and the server's ability to fill the pipe from the server to the client.
Because Receive Window Auto-Tuning will increase network utilization of high-BDP transmission paths, the use of Quality of Service (QoS) or application send rate throttling might become important for transmission paths that are operating at or near capacity. To address this possible need, Windows Vista supports Group Policy-based QoS settings that allow you to define throttling rates for sent traffic on an IP address or TCP port basis. For more information, see the resources on policy-based QoS.
Increasing TCP Throughput for High-Loss Networks
High-loss networks can dramatically decrease TCP throughput because of frequent timeouts and retransmissions. Examples of high-loss networks are wireless networks—such as those based on IEEE 802.11, General Packet Radio Service (GPRS), or Universal Mobile Telecommunications System (UMTS)—that can have high packet losses depending on network conditions, signal attenuation, electromagnetic interference, and the changing location of the computer.
The Next Generation TCP/IP stack supports the following four RFCs in order to optimize throughput in high-loss environments.
RFC 2582: The NewReno Modification to TCP's Fast Recovery Algorithm
The Fast Recovery algorithm, defined in RFC 2001, is based on the Reno algorithm, which increases the amount of data that a sender can send when a segment is retransmitted due to a fast retransmit event. Although the Reno algorithm works well for single lost segments, it does not perform as well when there are multiple lost segments. The NewReno algorithm provides faster throughput by changing the way that a sender can increase their sending rate during fast recovery when multiple segments in a window of data are lost and the sender receives a partial acknowledgment (an acknowledgment for only part of the data that has been successfully received).
RFC 2883: An Extension to the Selective Acknowledgment (SACK) Option for TCP
SACK, defined in RFC 2018, allows a receiver to indicate up to four noncontiguous blocks of received data by using a SACK TCP option. RFC 2883 defines an additional use of the fields in the SACK TCP option to acknowledge duplicate packets. By doing this the sender is able to determine when it has retransmitted a segment unnecessarily and adjust its behavior to prevent future unnecessary retransmissions. The fewer retransmissions that are sent, the better the overall throughput.
RFC 3517: A Conservative Selective Acknowledgment-based Loss Recovery Algorithm for TCP
The current implementation of TCP/IP in Windows Server 2003 and Windows XP uses SACK information only to determine which TCP segments have not arrived at the destination. RFC 3517 defines a method of using SACK information to perform loss recovery when duplicate acknowledgments have been received, replacing the older fast recovery algorithm when SACK is enabled on a connection. The Next Generation TCP/IP stack keeps track of SACK information on a per-connection basis and monitors incoming acknowledgments as well as duplicate acknowledgments to more quickly recover when multiple segments are not received at the destination.
RFC 4138: Forward RTO-Recovery (F-RTO): An Algorithm for Detecting Spurious Retransmission Timeouts with TCP and the Stream Control Transmission Protocol (SCTP)
Spurious retransmissions of TCP segments can occur with a sudden increase in RTT, leading the retransmission timeouts (RTOs) of previously sent segments to begin to expire and TCP to start retransmitting them. If the increase occurs just before sending a full window of data, a sender can retransmit the entire window of data. The F-RTO algorithm prevents spurious retransmission of TCP segments through the following behavior.
When the RTO expires for multiple segments, TCP retransmits just the first segment. When the first acknowledgment is received, TCP begins sending new segments (if allowed by the advertised window size). If the next acknowledgment confirms the other segments that have timed out but have not been retransmitted, TCP determines that the timeout was spurious and does not retransmit the other segments that have timed out.
The result is that for environments with sudden and temporary increases in the RTT, such as when a wireless client roams from one access point to another, F-RTO prevents unnecessary retransmission of segments and more quickly returns to its normal sending rate. The use of SACK-based loss recovery and F-RTO are best suited for connections that use GPRS links.
Joseph Davies is a technical writer with Microsoft and has been teaching and writing about Windows networking topics since 1992. He has written eight books for Microsoft Press and is the author of the monthly TechNet Cable Guy column.
© 2008 Microsoft Corporation and CMP Media, LLC. All rights reserved; reproduction in part or in whole without permission is prohibited.