Winsock2 - TCP Resends packet with increased sequence

Eggi 21 Reputation points
2021-08-05T09:39:12+00:00

Win21H1 (19043.1110) Win10 Pro 64Bit, and seen on 20H2, office PCs.

While sending data greater than the MTU with Winsock2 and TCP sometimes i experience that the last two packets of the stream are "retransmitted". Only that the resend packet contains the data of the 2nd last packet combined with the last packet and with a data offset of the last packet and an increased sequence number of the last packet - so it is not wrong data, but no typical retransmission where the same packet with the same data and sequence is sent.

I have not found any pattern to make this happen. It occures about every 5 to 15 times when sending the first data packet.
What i can say is, that after i connect to the peer and when i get this behaviour, i get it for a few (maybe even all the time, not tested) packets. A disconnect and reconnect, or sending data with a different size eliminates this behaviour.
Heres the sequence of what I've done and tried. I mixed all variants, the programm executes everything - except WSAStartup - in the same thread.

Initialization:

  1. WSAStartup (2,2)
  2. Winsock2.socket(AF_Inet,SOCK_STREAM,IPPROTO_TCP) and IPPROTO_IP
  3. WSAIOCTL with FIONBIO to set it non-blocking
  4. setsockopt to SOL_SOCKET, SO_REUSADDR FALSE
  5. setsockopt IPPROTO_TCP TCP_NODELAY to enable and disable Nagle
  6. bind with and without
  7. connect - in non blocking with a select loop to check if valid
  8. setsockopt SOL_SOCKET SO_SNDTIMEO to check if it's some RTO thingy
  9. setsockopt SOL_SOCKET SO_RCVTIMEO
  10. Here my programm loops through select fd_rcv to check if I got data

Data Send:

  1. setsockopt SOL_SOCKET SO_SNDBUF - made no difference, it's said that the newer version should not need it
  2. some select to see if fd_write is ok
  3. send - Tried WSASend too, no lappend
  4. The programm continues polling select to see if rcv data is ready

120792-retrseqincremented.png

Windows 10 Network
Windows 10 Network
Windows 10: A Microsoft operating system that runs on personal computers and tablets.Network: A group of devices that communicate either wirelessly or via a physical connection.
2,286 questions
0 comments No comments
{count} votes

Accepted answer
  1. Gary Nebbett 5,721 Reputation points
    2021-08-09T10:07:13.94+00:00

    Hello @Eggi ,

    Thanks for the .etl file - it was satisfying to see how clearly it shows "tail loss probe" behaviour.

    In the .pcapng file that you made available, there were four retransmissions and one can measure the time from sending the last small segment to the retransmission of that data. For example:

    15:17:21.239458 192.168.127.10.57117 > 192.168.127.200.50101: P 16129:16385(256) ack 1 win 64544 (DF)  
    15:17:21.270311 192.168.127.10.57117 > 192.168.127.200.50101: P 12353:16385(4032) ack 1 win 64544 (DF)  
    

    Here, the interval is 30.85 milliseconds and the subsequent intervals are 38, 61 and 106 milliseconds.

    The default Tail Loss Probe (TLP) timeout is 2*SRTT (SRTT = smoothed round trip time); the .etl file shows how the SRTT develops during the transfer:

    121623-image.png

    The .etl file also contains explicit events for TLP - they look like this:

    121661-image.png

    The TLP mechanism "probes" with the next packet (rather than old (retransmitted)) data if it is available and allowed by the various "window" limitations (e.g. send window, congestion window). The .etl file shows how the congestion window develops over time:

    121636-image.png

    It does not look as though the send/congestion windows are the problem and this is confirmed by the connection summary:

    121701-image.png

    This basically says that the reasons for all the delays in sending data was lack of new data (the "R" before "Snd" in the last two lines is probably a typo in the ETW manifest).

    It seems that your application is only sending a new 16 kibibyte "message" (block of data) when the previous block has been "flushed" (sent and acknowledged), so the only data available for a TLP is old data (hence the retransmissions).

    The Windows side is behaving perfectly correctly and appropriately - it is the long delay before data is acknowledged that is causing the retransmissions.

    Gary

    1 person found this answer helpful.

2 additional answers

Sort by: Most helpful
  1. Gary Nebbett 5,721 Reputation points
    2021-08-05T16:44:02.38+00:00

    Hello @Eggi ,

    I am not sure if it is just a simple error or different use of terminology - I would call 192.168.127.10 the "client" since it sends the original "SYN" and 192.168.127.200 the "server" (sending SYN,ACK).

    15:17:20.736945 192.168.127.10.57117 > 192.168.127.200.50101: S 3381389475:3381389475(0) win 64544 <mss 4034,nop,wscale 8,nop,nop,sackOK> (DF)  
    15:17:20.751673 192.168.127.200.50101 > 192.168.127.10.57117: S 19990761:19990761(0) ack 3381389476 win 24192 <mss 4032>  
    15:17:20.751722 192.168.127.10.57117 > 192.168.127.200.50101: . ack 1 win 64544 (DF)  
    

    It would certainly be interesting to see the .etl file (assuming that it was created on the "client", the system that retransmits the data). I am now fairly confident that this is "tail loss probe" behaviour - the .etl file should confirm this.

    The "filtering on capture" possibilities are limited in this case - I would suggest just capturing everything. The events in the trace are not tagged with a network interface; one could imagine developing a program that "approximately" filters the captured data but there are no standard tools (that I am aware of) that do anything like this.

    Gary

    1 person found this answer helpful.

  2. Gary Nebbett 5,721 Reputation points
    2021-08-05T12:28:41.68+00:00

    Hello @Eggi ,

    Making a trace of the Microsoft-Windows-TCPIP ETW provider when performing your tests might be helpful - the events reported by the TCP/IP stack in the trace might give some insight into the retransmission decisions.

    There are a number of features of the trace that seem interesting:

    • MSS of 4034 (what technology is used for the connection?).
    • Client and server appear to be on the same network (192.168.127.200 and 192.168.127.20) and the round trip time might be approaching 15 milliseconds (quite high). The interval between frames 960 and 961, and also 967 and 969 are the basis for my RTT estimate.
    • Frame 968 might be acknowledging the data sent 45 milliseconds earlier; the fact that it arrives 125 microseconds after the retransmit means that it is unlikely to be an acknowledgement for that packet; the duplicate acknowledgement received 17 milliseconds later is probably an acknowledgment triggered by the arrival of the retransmitted data.

    Gary