Summary of Problem:
We have a Win 10 application (hereafter, “the application/app”) that communicates with an external embedded Linux system (hereafter “the device”) via Wi-Fi. The Wi-Fi connection to the device is provided by an external AP – the PC is connected to that AP via a local ethernet cable from a PCIe NIC.
The app opens several TCP sockets and one UDP socket to the device.
Occasionally, after suffering an unexpected disconnect from the device (e.g., due to poor Wi-Fi signal strength), the application is unable to open the UDP socket to the device (TCP sockets are not affected). We’ve not made much progress on finding root cause, are unable to recreate the problem internally, and the only “fix” we’ve found is to reboot the PC. After that, the UDP socket can again connect.
Note that a full restart of the application has no effect – once the problem has occurred, it remains until the PC is rebooted. A reboot of the device also does not resolve the issue. For these and other reasons discussed below, we believe the problem is on the PC, not on the device.
Socket code in both the app and device has been able to reliably maintain and handle asynchronous disconnects (i.e., disconnects don’t prevent reconnection) without issue for years until this recent UDP socket problem. And we’ve been unable to attribute this problem to code changes - none of the relevant code has been modified for at least 6 years (per VCS history).
Our application has a DLL back end that communicates with the device. These DLLs are written in C++ and rely on the winsock API for networking.
The problem of being unable to open the UDP socket has a specific error signature and always occurs in the same way. What follows is an overview of the code path that leads to the error. Note that the C++ back end preforms all available error checking on return values from calls into the winsock library.
- The socket is created by calling the winsock socket() function and assigning it’s return value to a variable of type SOCKET.
- The SOL_SOCKET and SO_REUSEADDR socket options are then set on the socket (via a single call to winsock’s setsockopt() function).
- Winsock’s bind() is then called on the socket.
- When this problem occurs, it is the always the bind() function that fails and the failure is always the same: bind() returns winsock error code 10013.
o This is the WSACCESS error, and its description on the internet indicates some sort of permission problem.
o However, we have eliminated things like permission level of the application from involvement (i.e. the problem occurs even when running the application as an administrator). As described below, firewall rules aren’t the issue, but we have been experimenting with disabling the firewall on the network used by the app.
To reiterate, once the 10013 error occurs, it continues to occur until the PC is rebooted – application restarts do not help.
Our products use Windows 10’s Long-Term Servicing Channel, and we had no issues when running Windows 10 IoT Enterprise LTSB 2016 (10.0.14393). We first began hearing reports of this issue shortly after we introduced Windows 10 IoT Enterprise LTSC 2019 (10.0.17763) – specifically, 10.0.17763.1757, with ws2_32.dll version 10.0.17763.771.
- This problem is recent; as far as we know, we’ve never seen it before.
- Current data indicates the problem is relatively rare.
o This system has an install base of several thousand units across hundreds of different sites but the problem has been observed only at a few sites in a few systems.
o When the problem does occur on a system, it can occur repeatedly.
We have resolved the problem at a few sites by making Wi-Fi configuration changes that lead to improved signal strength, thereby eliminating the connection drops that lead to this problem. But that doesn’t address root cause.
o It seems likely that most sites would have occasional signal drops, yet we don’t yet have evidence of this problem being widespread.
It’s unclear why it would occur at some sites for some connection drops but not others, though the system is complex and configurable; thus, there could be configuration differences that contribute to this problem (we just haven’t been able to identify any).
- We have found one document online that may be applicable. Link below in case you may be able to comment on whether it’s relevant (it states it’s for Win Server 2012).
UDP communication is blocked by the Windows Firewall rule in WSFC - Windows Server | Microsoft Learn
- We have firewall rules in place to specifically allow this UDP traffic which have not changed recently, and in most cases communication is successful. However, we have not ruled it out as a root cause and have experimented with disabling the firewall at affected sites. Anecdotal reports have been positive, but data is inconclusive (small sample sizes/intermittent issue).