question

53287413 avatar image
0 Votes"
53287413 asked 53287413 published

Server drops ping traffic in certain conditions

I'd like to pose a problem I'm looking at here, and have no real idea of a root cause. We're trying to run a backup from one server (node A) on one network over a firewall to backupserver (node B) in another network. And that started failing for whatever reason.

Long story short: ping traffic between node A and B dies, when node B tries to ping back to node A.

I was under the impression the firewall appliance was to blame (we replaced it with a newer model and thus had to rebuild the config), but testing seems to rule that out.

Basic understanding of the situation:

 NODE A ---------------- FW ---------------- NODE B
          |                              |
 NODE C --                                -- NODE D

All nodes are Windows machines. B needs to connect to A.
Nodes A and C are in the same network, and so are nodes B and D.
FW is the firewall appliance.
Pings below are all done based on IP address, so the DNS isn't a part in this.

When I initiate a perpetual ping from A to B, responses come normally.
If I ping from B to A this also works as expected, but at the point of the first response being received by machine B, the ping from A to B dies (request timed out). Only after a while of no ping there's apparently some 'reset' somewhere, and the ping from A to B is possible again. During this time the ping from B to A remains working.

I've disabled the Windows firewalls on both A an B, to no change in behavior.

As a test I'm also trying to ping from A to D. This works normally, and when pinging from D to A no change in the ping from A to D happens. Either side can ping the other without any issue. So the problem does not seem to be on the side of A.

Node C is a laptop, which I've tried the same with... But any ping to A, B or D works, and does not timeout if I try the ping from A, B or D to C (which also work as expected).

In order to further troubleshoot this, I moved C from the left side to the right side of the firewall, and set it up with the same IP address as B. Disconnected B from the network, plugged in C, and redid the test. Ping from A to C and C to A remains in working order either way. So that kind of narrows it down to an issue on node B, seeing the firewall and rules remain the same for the communication.

If I add a secondary IP to node B, the problem remains with the primary IP address, but the secondary IP address remains working as expected. When I change the IP of node B the whole problem would probably go away, but this would infer a lot of reconfiguration in the backup with replica-partners and source servers and the like... So if possible I'd like to leave the IP as-is.

Something on node B with regards to the IP stack seems to go wrong, but I can't really see anything I might be able to do to further narrow down the cause. Does anyone have any ideas here on what I might try or check to resolve this issue?

windows-server-2019
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

DSPatrick avatar image
0 Votes"
DSPatrick answered DSPatrick commented

tracert 192.168.10.99
may reveal something.

--please don't forget to upvote and Accept as answer if the reply is helpful--



· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Just checking if there's any progress or updates?

--please don't forget to upvote and Accept as answer if the reply is helpful--



0 Votes 0 ·
53287413 avatar image
0 Votes"
53287413 answered

We did test that... But seeing it's Saturday I don't have the test situation handy at the moment, and I'll need to get back to you on those results on Monday.

That being said, I do recall the tracert from node B to A completed normally, wheras the tracert from node A to B died at the gateway (firewall)... So no traffic beyond the firewall.

In the diagnosis with the firewall vendor it was stated that the traffic isn't following the same path either way. That might shed some light on something here.

Node A talks to the firewall which has a direct connection to the same network as node B.
Node B has a default gateway at the WAN firewall within that network, which has a route that points to the firewall pictured above. And that one has a path to node A.

One of the things we did note (and I did point it out in my OP) was that pinging to and from node A to node D (both ways) isn't hampered... So that seems to indicate the paths are solid and the firewall is doing it's job.

As stated, node B is able to ping node A, but once that's done, node A looses the capacity to ping node B.
And when I replace node B with node C (the laptop), there is no issue between A and C or C to A whatever I do.

So it seems node B somehow alters some arptable somewhere, allowing itself to remain pinging, but telling the firewall apparently that there is no node with it's IP address on the network for anything traversing the firewall. Now that might seem farfetched, but it gets worse in so far that node C when located next to node A is fully capable of pinging to node A and D, and both those nodes remain able to ping to node C. No issues whatsoever. So that kind of tells me the arptable on the firewall is sound.

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

53287413 avatar image
0 Votes"
53287413 answered 53287413 published

So, for simplicity sake, if we're talking the routing too, I need to include the default gateway for the network on the node B side too:

 Node A --- FWa --- Node B
                 |
                 -- FWb --- WAN

Basically everything in the network of node B has the default gateway of FWb, which talks to the WAN infrastructure. So any ping going from B to A passes by FWb first (which has a route telling the network A is on is reachable via FWa, so traffic is redirected to FWa). Any ping from A to B is passed directly from FWa to node B since they're on the same network.

That said...

Initiated a perpetual ping on node A to node B. No issues, no timeouts.
At that time I also ran a tracert on A to the IP destination of B:

 Tracing route to Backup.domain.local [x.x.x.x]
 over a maximum of 30 hops:
    
   1    <1 ms    <1 ms    <1 ms  y.y.y.y
   2    <1 ms    <1 ms    <1 ms  Backup.domain.local [x.x.x.x]
    
 Trace complete.

y.y.y.y is the IP address configured on the side of node A on FWa, and x.x.x.x is the destination address of node B.

So, node A talks to FWa, who talks directly to node B.

When initiating a tracert on node B to node A, I get the following result (at hop 1 node A reports it's ping resulting in a 'request timed out').

 Tracing route to Server [z.z.z.z]
 over a maximum of 30 hops:
    
   1     *        *        *     Request timed out.
   2    <1 ms    <1 ms    <1 ms  w.w.w.w
   3    <1 ms    <1 ms    <1 ms  Server [z.z.z.z]
    
 Trace complete.

The fist hop is pointed at FWb, which isn't responding, but is able to point at FWa.
w.w.w.w is the gateway address on the side of node B on FWa.
z.z.z.z is the destination IP of node A.

The routing table on node A has NO knowledge of the network node B resides in, so it uses the default gateway (y.y.y.y in the above results) to point it's traffic at. Which is the FWa.

The routing table on node B has NO knowledge of the network node A resides in, so this also uses the default gateway for it's known network... Which is FWb.

We did try to alter the routing table on node B to point directly at FWa for the network of node A. This cuts out the addional hop on the default gateway when going from node B to node A.

 route add z.z.z.0 mask 255.255.255.0 w.w.w.w

Tracert on node A remains the same (once the ping resumed).

 Tracing route to Server [z.z.z.z]
 over a maximum of 30 hops:
    
   1    <1 ms    <1 ms    <1 ms  w.w.w.w
   2    <1 ms    <1 ms    <1 ms  Server [z.z.z.z]
    
 Trace complete.

So it cuts out the FWb firewall as expected, but again on the first hop here, node A starts telling me the request timed out on the ping.

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

53287413 avatar image
0 Votes"
53287413 answered 53287413 edited

As an addendum... Node B is normally reachable over the WAN (behind FWb) and to all servers in it's own network. So it's just the connection with node A that causes headaches.

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

53287413 avatar image
0 Votes"
53287413 answered 53287413 edited

Nobody having ay insight in why this issue might be occurring? :(

· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hello,

If you are prepared to share trace data, then I would look at it. The commands to start and stop a trace are:

pktmon start --capture --comp nics --flags 0x10 --trace --provider Microsoft-Windows-WFP --provider Microsoft-Windows-TCPIP --keywords 0x3FFFFFFFFFFF --level 17 --file-name why.etl

pktmon stop

The trace should be made on node B and cover the period from some initially successful pings from A to B and a few pings from B to A. The data that could be shared is why.etl.

Gary

1 Vote 1 ·

Since I'm really walking into a wall here, I'm open to any kind of help to see what is causing this behavior.

What I did:

1) Set up a ping from A to B (perpetual).
2) Then verified the PKTMON application was available on node B.
3) started the application, and let it sit for a couple of seconds.
4) Initiated ping (just 4 pings) on node B
5) Saw ping traffic on node A die
6) Stopped the PKTMON application.

Grabbed the why.etl file (500 MB). Zipped the file (password protected) resulting in a ZIP just under 5 MB.

Uploaded to OneDrive. This site doesn't seem to have a Private Messaging bit with which I can provide the link to the ZIP and password tho (I'm a bit reluctant to post the etl in the open).

You can send me an email at -myemailaddress-, and I should be able to provide you with the link, and the password. Once you get the file, I could then remove it from OneDrive.

0 Votes 0 ·
black407325 avatar image
0 Votes"
black407325 answered 53287413 commented

Why not check the firewall settings? I believe it will help
Firewall - firewall advanced settings - outbound rules

· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Even when I completely disable the Windows firewall on either node, even at the same time, the result is the same. So it's not the Windows firewall.

I'm having a mail conversation with GaryNebbett in regards to this, and it seems (for now) that something on NodeB is malforming the ping-reply packets of some sort, but it only does this after NodeB tries to ping NodeA.

As it stands it seems NodeB is somehow doing something it's not supposed to do, and tracing/logging may provide some insight.

Ofcourse when a reason is found, I'm updating this thread for someone to utilize :)

0 Votes 0 ·
53287413 avatar image
0 Votes"
53287413 answered 53287413 published

TL;DR: RRAS (which we don't even use on that machine) being difficult.

After providing Gary with a tracelog through the pktmon application (resulting in an ETL file) for both servers the following was found:

When pinging from NodeA to NodeB, the following showed on the trace on NodeB:

 request  id=0x0001, seq=54370
 reply  id=0x0001, seq=54370

So that's expected.

Once the ping on NodeB was initiated however, the id flipped a bit:

 request  id=0x0001, seq=54376
 reply  id=0x0100, seq=54376

And that flipped bit was causing the issue for some reason. The firewall vendor also noted that happening, but couldn't really indicate a reason for it, or what specifically that entailed. Gary however indicated a suspect of a 3rd party application on NodeB to be potentially the root-cause.

Based on his hunch, I ran through the installs on the machine (both Windows Update and applications, and constructed an exact timeline for when the problems occurred, and how that would reflect on which program was installed when. Unfortunatly this did not yield anything conclusive.

Next up were the NDIS drivers to be looked at (pktmon list -all). A possible cause might be a 3rdparty NDIS driver or something, but this also panned out showing just the standard Microsoft NDIS filter drivers:

 Id Driver      Name
 -- ------      ----
 21 wfplwfs.sys WFP Native Filter
 20 pacer.sys   QoS Packet Scheduler
 19 wfplwfs.sys WFP 802.3 Filter

So that seemed like a deadend too...

Next up were the WFP callout drivers to be investigated. For VPN servers this would provide the shared secrets (maybe that might come in handy sometime down the line :)), but since NodeB is a backup machine with no VPN, this shouldn't yield anything that doesn't need to be shown: netsh wfp show state file=nodeB.xml

The output in this file pointed at the odd 'RRAS NAT Drivers' being in-place. Which is kind of weird for a backup server. My guess would be the use of Microsoft HyperV being present for the backup application to utilize.

As an aside, the backup application can do three things with HyperV, of which we only use one, namely the first:
- Dump the backed up data to a vdisk, mount that and see if the data is readable (verify of the backup)
- Dump the backed up data to a virtual machine, boot it and see if the machine is bootable (verify of the backup or when restoring data to gain access to the machine in order to retrieve specific data from it)
- Dump the backed up data to a virtual machine, and boot that machine in the 'live' environment allowing access to the machine despite it running off data in the backup database.

As stated, we only do the vdisk verification, so the network bit isn't needed, but since HyperV is installed, it's a good assumption that the RRAS is part of that install.

To further test this, we ran this command on a couple of servers: netsh wfp show state file=- | findstr RRAS.NAT

This yields no results if the RRAS isn't installed, and we found on 5 servers (including both NodeA and NodeD in my opening post) that there was no feedback on the command, indicating RRAS was not installed. On NodeB however:

 <name>RRAS NAT Driver</name>
 <name>RRAS NAT Driver</name>
 <name>RRAS NAT Driver</name>
 <name>RRAS NAT Driver</name>
 <name>RRAS NAT Driver</name>

When checking the Server Manager, it was found that RRAS was installed on the machine.

As a further aside, the machine is a Windows machine, but was delivered as an 'appliance', so fully installed. We did not specifically add or remove roles from it. So the RRAS role was added by the backup vendor.

When opening the RRAS manager, and checking under the IPv4 > General Settings, I can open the properties o the affected network card with the IP address of NodeB.

There is a nice checkbox here 'Enable IP Router Manager' which was enabled, and on a whim I checked that one to off / disabled.

When trying the ping then we found there was NO timeout occurring anymore. So this is resolved, with many thanks to Gary.

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.