Intermittent failure to open UDP socket

Phil B 1 Reputation point
2022-06-03T18:31:53.403+00:00

Summary of Problem:
We have a Win 10 application (hereafter, “the application/app”) that communicates with an external embedded Linux system (hereafter “the device”) via Wi-Fi. The Wi-Fi connection to the device is provided by an external AP – the PC is connected to that AP via a local ethernet cable from a PCIe NIC.
The app opens several TCP sockets and one UDP socket to the device.
Occasionally, after suffering an unexpected disconnect from the device (e.g., due to poor Wi-Fi signal strength), the application is unable to open the UDP socket to the device (TCP sockets are not affected). We’ve not made much progress on finding root cause, are unable to recreate the problem internally, and the only “fix” we’ve found is to reboot the PC. After that, the UDP socket can again connect.
Note that a full restart of the application has no effect – once the problem has occurred, it remains until the PC is rebooted. A reboot of the device also does not resolve the issue. For these and other reasons discussed below, we believe the problem is on the PC, not on the device.

Details:
Socket code in both the app and device has been able to reliably maintain and handle asynchronous disconnects (i.e., disconnects don’t prevent reconnection) without issue for years until this recent UDP socket problem. And we’ve been unable to attribute this problem to code changes - none of the relevant code has been modified for at least 6 years (per VCS history).
Our application has a DLL back end that communicates with the device. These DLLs are written in C++ and rely on the winsock API for networking.
The problem of being unable to open the UDP socket has a specific error signature and always occurs in the same way. What follows is an overview of the code path that leads to the error. Note that the C++ back end preforms all available error checking on return values from calls into the winsock library.

  • The socket is created by calling the winsock socket() function and assigning it’s return value to a variable of type SOCKET.
  • The SOL_SOCKET and SO_REUSEADDR socket options are then set on the socket (via a single call to winsock’s setsockopt() function).
  • Winsock’s bind() is then called on the socket.
  • When this problem occurs, it is the always the bind() function that fails and the failure is always the same: bind() returns winsock error code 10013.
    o This is the WSACCESS error, and its description on the internet indicates some sort of permission problem.
    o However, we have eliminated things like permission level of the application from involvement (i.e. the problem occurs even when running the application as an administrator). As described below, firewall rules aren’t the issue, but we have been experimenting with disabling the firewall on the network used by the app.
    To reiterate, once the 10013 error occurs, it continues to occur until the PC is rebooted – application restarts do not help.

Version Information:
Our products use Windows 10’s Long-Term Servicing Channel, and we had no issues when running Windows 10 IoT Enterprise LTSB 2016 (10.0.14393). We first began hearing reports of this issue shortly after we introduced Windows 10 IoT Enterprise LTSC 2019 (10.0.17763) – specifically, 10.0.17763.1757, with ws2_32.dll version 10.0.17763.771.

Misc. Information:

  • This problem is recent; as far as we know, we’ve never seen it before.
  • Current data indicates the problem is relatively rare.
    o This system has an install base of several thousand units across hundreds of different sites but the problem has been observed only at a few sites in a few systems.
    o When the problem does occur on a system, it can occur repeatedly.
     We have resolved the problem at a few sites by making Wi-Fi configuration changes that lead to improved signal strength, thereby eliminating the connection drops that lead to this problem. But that doesn’t address root cause.
    o It seems likely that most sites would have occasional signal drops, yet we don’t yet have evidence of this problem being widespread.
     It’s unclear why it would occur at some sites for some connection drops but not others, though the system is complex and configurable; thus, there could be configuration differences that contribute to this problem (we just haven’t been able to identify any).
  • We have found one document online that may be applicable. Link below in case you may be able to comment on whether it’s relevant (it states it’s for Win Server 2012).
    UDP communication is blocked by the Windows Firewall rule in WSFC - Windows Server | Microsoft Learn
  • We have firewall rules in place to specifically allow this UDP traffic which have not changed recently, and in most cases communication is successful. However, we have not ruled it out as a root cause and have experimented with disabling the firewall at affected sites. Anecdotal reports have been positive, but data is inconclusive (small sample sizes/intermittent issue).
Windows 10
Windows 10
A Microsoft operating system that runs on personal computers and tablets.
10,610 questions
Windows for IoT
Windows for IoT
A family of Microsoft operating systems designed for use in Internet of Things (IoT) devices.
382 questions
C++
C++
A high-level, general-purpose programming language, created as an extension of the C programming language, that has object-oriented, generic, and functional features in addition to facilities for low-level memory manipulation.
3,526 questions
Windows 10 Network
Windows 10 Network
Windows 10: A Microsoft operating system that runs on personal computers and tablets.Network: A group of devices that communicate either wirelessly or via a physical connection.
2,271 questions
Windows 10 Compatibility
Windows 10 Compatibility
Windows 10: A Microsoft operating system that runs on personal computers and tablets.Compatibility: The extent to which hardware or software adheres to an accepted standard.
455 questions
0 comments No comments
{count} votes

3 answers

Sort by: Most helpful
  1. risolis 8,701 Reputation points
    2022-06-03T19:44:21.063+00:00

    Hello @Phil B

    Thank you for that excellent explanation given previously.

    I can not deny that this brought my attention and I would like to add some relevant steps or comment on this.

    • By any chance... Do you have any packet capture on this?
    • Do you test if there is any asymmetric routing issue?
    • Have you checked for any equal cost routing path along the way?
    • For this 2way traffic... Is there any NAT device in between or QoS policy applied to this traffic that is being marked on any forwarding class on QoS?
    • Is there any SSL/TLS certificate using symmetric or asymmetric key exchange method?
    • Have you used Iperf tool to start sending some streaming data to the specific listening port?

    Looking forward to your feedback.

    Cheers,

    Please "Accept the answer" if the information helped you. This will help us and others in the community as well.


  2. risolis 8,701 Reputation points
    2022-06-04T08:19:40.363+00:00

    @Phil B

    Or did you mean a Hub device which is different than a dump switch?

    Looking forward to your feedback.

    Cheers,

    Please "Accept the answer" if the information helped you. This will help us and others in the community as well.


  3. risolis 8,701 Reputation points
    2022-06-04T20:33:49.447+00:00

    Hi @Phil B

    I just wanted to know if further assistance might be required on this : )

    Looking forward to your feedback,

    Please "Accept the answer" if the information helped you. This will help us and others in the community as well.

    0 comments No comments