"Random" HTTPS issues, possibly related to DNS server

Cheap_Trick 1

Hi,

this is a rather strange problem and I'm stuck at this point.
Several users have issues with HTTPS connections not working or only working after multiple attempts. This may or may not only occur with certain applications and usually is not reproducible on other workstations.

Examples:
One user usually gets connection reset errors connecting to bing.com, everyone else can connect without problem

Another user cannot access stackoverflow or deepl unless refreshing multiple times. It will then at some point eventually load the page wihtout CSS, after a few more tries CSS will work too. Then it usually keeps working for the rest of the day or at least a longer period of time.

Our Jenkins server started having problems connecting to its (https) update servers, while the same URL can opened from a browser on the same machine without issues. Switching the update servers to HTTP will cause fetching update information to work, though the updates itself will work via https and thus fail. (Java error:SSLHandshakeException)

For me, Office 365 outlook decided to lose connection to exchange servers after a while, not being able to reconnect. Additionally, when I sign out of my Office 365 account from outlook (or any other office 365 app), It will fail to sign in again, simply closing the sign in dialog after entering the user name.
This behaviour started when the workstation was joined to the company domain and occurs with any user account, also local ones that didnt have a problem before.

The last issue does not appear when I switch to another network or change the DNS server from our domain controller to a publich one (1.1.1.1 used for reference). Changing DNS and then re-connecting (unplugging and re-plugging cable) to the network will allow me to sign in without issues, even if everything else is still configured via DHCP.

Changing DNS did not help in case of the Jenkins server, although re-connecting was not possible during my test and seemed to be required on my machine for the procedure to work.

I should also mention that many of the other users encountering issues are not yet domain members as it is currently in the process of being rolled out, so it doesnt appear to be (directly) connected to that. It should also rule out faulty GPOs as those workstations dont have any applied yet.

I initially suspected our sonicwall firewall, but it seems i can count it as ruled out (at least for the office365 problem) as cause of the issue.

DHCP config is very basic and just assigns IP, Netmask, Gateway, DNS and DNS Domain name.
DNS has three forward zones which havent changed for a while and were in use when everything still worked as expected.
DNS also has 1.1.1.1 and 8.8.8.8 set s forwarders for all requests that cannot be resolved locally. It is set up to use root hints if no forwarders are available, which shouldnt occur.

I really need to get this fixed soon as it is starting to affect production systems but I cant seem to pinpoint an actual cause.

Help is very appreciated.

15 answers

Cheap_Trick 1 Reputation point

2020-09-11T08:47:24.96+00:00

The applications are definitely not at fault, as stated in my original description of the issue.
Please sign in to rate this answer.

0 comments No comments
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.
Anonymous

2020-09-11T09:46:40.133+00:00

The applications are definitely not at fault

Of course. You can start a case here with product support.
https://support.microsoft.com/en-us/hub/4343728/support-for-business

--please don't forget to Accept as answer if the reply is helpful--
Please sign in to rate this answer.

0 comments No comments
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.
Cheap_Trick 1 Reputation point

2020-09-14T07:42:18.3+00:00

How is the Application supposed to be at fault, if the same application works just fine in another network or on another computer? Plus, its a variety of applications that can have the issues. For some user its all browsers, for me its outlook, and its also Jenkins on Tomcat.
It seems to be some combination of app, workstation and network configuration + DNS server. Maybe a configuration issue. I thought about timeouts, but name resolution is rather fast (<100ms including forwarding, usually around 50ms). I can try opening tickets for the applications, but its like running in circles.
Please sign in to rate this answer.

0 comments No comments
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.
Cheap_Trick 1 Reputation point

2020-09-23T07:54:16.87+00:00

Since I wasn't getting anywhere on the previous approach, I set up a Linux based bind DNS server for testing purposes. It forwards my internal lookup zones to the Windows based DNS (the Domain Controller) and everything else directly to 1.1.1.1. So technically it does what the Windows DNS is supposed to do.
Difference is: now the issues are gone.
So I stand by my statement: the applications are definitely not at fault.

While the Linux DNS is a viable workaround, I still need to find out what causes the Windows based DNS to produce these issues in the first place. Is there anything a Windows DNS server does in the background that is not as obvious as you may think?

I hope this thread is not dead yet. It would be ironic if the final solution to a Windows problem in a Microsoft forum would be to use Linux instead.
Please sign in to rate this answer.

0 comments No comments
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.
Abdul Gafoor 1 Reputation point

2020-09-24T13:03:51.13+00:00

Hello,

I had somewhat similar issue. It all started after installing updates on September 19, 2020. We usually update servers every 3 months. I struggled with it for almost one week before I could find the source of the issue. Going further I came to a conclusion that security update KB4577066 was causing this. Uninstalled that update, now it's working OK so far, but still monitoring.

Behavior was that most SSL sites would take long to resolve, some would never do, some users don't have this issue with the same set of SSL sites others were having issue. Could be DNS cache playing role here. Our internal DNS servers would forward external name resolution to DNS servers in DMZ, which further would forward request to our ISP DNS servers. DNS in DMZ had the issue. Every time we start the server, issue would get resolved for 10 minutes, then it would start giving issue again. After uninstalling above security update, issue is not back yet even after almost 2 hours now.

So, you might want to check that as well.

Regards,
Please sign in to rate this answer.

0 comments No comments
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.

Share via

"Random" HTTPS issues, possibly related to DNS server

15 answers

Your answer