Server drops ping traffic in certain conditions

Question

I'd like to pose a problem I'm looking at here, and have no real idea of a root cause. We're trying to run a backup from one server (node A) on one network over a firewall to backupserver (node B) in another network. And that started failing for whatever reason.

Long story short: ping traffic between node A and B dies, when node B tries to ping back to node A.

I was under the impression the firewall appliance was to blame (we replaced it with a newer model and thus had to rebuild the config), but testing seems to rule that out.

Basic understanding of the situation:

NODE A ---------------- FW ---------------- NODE B
         |                              |
NODE C --                                -- NODE D

All nodes are Windows machines. B needs to connect to A.
Nodes A and C are in the same network, and so are nodes B and D.
FW is the firewall appliance.
Pings below are all done based on IP address, so the DNS isn't a part in this.

When I initiate a perpetual ping from A to B, responses come normally.
If I ping from B to A this also works as expected, but at the point of the first response being received by machine B, the ping from A to B dies (request timed out). Only after a while of no ping there's apparently some 'reset' somewhere, and the ping from A to B is possible again. During this time the ping from B to A remains working.

I've disabled the Windows firewalls on both A an B, to no change in behavior.

As a test I'm also trying to ping from A to D. This works normally, and when pinging from D to A no change in the ping from A to D happens. Either side can ping the other without any issue. So the problem does not seem to be on the side of A.

Node C is a laptop, which I've tried the same with... But any ping to A, B or D works, and does not timeout if I try the ping from A, B or D to C (which also work as expected).

In order to further troubleshoot this, I moved C from the left side to the right side of the firewall, and set it up with the same IP address as B. Disconnected B from the network, plugged in C, and redid the test. Ping from A to C and C to A remains in working order either way. So that kind of narrows it down to an issue on node B, seeing the firewall and rules remain the same for the communication.

If I add a secondary IP to node B, the problem remains with the primary IP address, but the secondary IP address remains working as expected. When I change the IP of node B the whole problem would probably go away, but this would infer a lot of reconfiguration in the backup with replica-partners and source servers and the like... So if possible I'd like to leave the IP as-is.

Something on node B with regards to the IP stack seems to go wrong, but I can't really see anything I might be able to do to further narrow down the cause. Does anyone have any ideas here on what I might try or check to resolve this issue?

Answer

tracert 192.168.10.99
may reveal something.

--please don't forget to upvote and Accept as answer if the reply is helpful--

Answer

We did test that... But seeing it's Saturday I don't have the test situation handy at the moment, and I'll need to get back to you on those results on Monday.

That being said, I do recall the tracert from node B to A completed normally, wheras the tracert from node A to B died at the gateway (firewall)... So no traffic beyond the firewall.

In the diagnosis with the firewall vendor it was stated that the traffic isn't following the same path either way. That might shed some light on something here.

Node A talks to the firewall which has a direct connection to the same network as node B.
Node B has a default gateway at the WAN firewall within that network, which has a route that points to the firewall pictured above. And that one has a path to node A.

One of the things we did note (and I did point it out in my OP) was that pinging to and from node A to node D (both ways) isn't hampered... So that seems to indicate the paths are solid and the firewall is doing it's job.

As stated, node B is able to ping node A, but once that's done, node A looses the capacity to ping node B.
And when I replace node B with node C (the laptop), there is no issue between A and C or C to A whatever I do.

So it seems node B somehow alters some arptable somewhere, allowing itself to remain pinging, but telling the firewall apparently that there is no node with it's IP address on the network for anything traversing the firewall. Now that might seem farfetched, but it gets worse in so far that node C when located next to node A is fully capable of pinging to node A and D, and both those nodes remain able to ping to node C. No issues whatsoever. So that kind of tells me the arptable on the firewall is sound.

Answer

So, for simplicity sake, if we're talking the routing too, I need to include the default gateway for the network on the node B side too:

Node A --- FWa --- Node B
                |
                -- FWb --- WAN

Basically everything in the network of node B has the default gateway of FWb, which talks to the WAN infrastructure. So any ping going from B to A passes by FWb first (which has a route telling the network A is on is reachable via FWa, so traffic is redirected to FWa). Any ping from A to B is passed directly from FWa to node B since they're on the same network.

That said...

Initiated a perpetual ping on node A to node B. No issues, no timeouts.
At that time I also ran a tracert on A to the IP destination of B:

Tracing route to Backup.domain.local [x.x.x.x]
over a maximum of 30 hops:

  1    <1 ms    <1 ms    <1 ms  y.y.y.y
  2    <1 ms    <1 ms    <1 ms  Backup.domain.local [x.x.x.x]

Trace complete.

y.y.y.y is the IP address configured on the side of node A on FWa, and x.x.x.x is the destination address of node B.

So, node A talks to FWa, who talks directly to node B.

When initiating a tracert on node B to node A, I get the following result (at hop 1 node A reports it's ping resulting in a 'request timed out').

Tracing route to Server [z.z.z.z]
over a maximum of 30 hops:

  1     *        *        *     Request timed out.
  2    <1 ms    <1 ms    <1 ms  w.w.w.w
  3    <1 ms    <1 ms    <1 ms  Server [z.z.z.z]

Trace complete.

The fist hop is pointed at FWb, which isn't responding, but is able to point at FWa.
w.w.w.w is the gateway address on the side of node B on FWa.
z.z.z.z is the destination IP of node A.

The routing table on node A has NO knowledge of the network node B resides in, so it uses the default gateway (y.y.y.y in the above results) to point it's traffic at. Which is the FWa.

The routing table on node B has NO knowledge of the network node A resides in, so this also uses the default gateway for it's known network... Which is FWb.

We did try to alter the routing table on node B to point directly at FWa for the network of node A. This cuts out the addional hop on the default gateway when going from node B to node A.

route add z.z.z.0 mask 255.255.255.0 w.w.w.w

Tracert on node A remains the same (once the ping resumed).

Tracing route to Server [z.z.z.z]
over a maximum of 30 hops:

  1    <1 ms    <1 ms    <1 ms  w.w.w.w
  2    <1 ms    <1 ms    <1 ms  Server [z.z.z.z]

Trace complete.

So it cuts out the FWb firewall as expected, but again on the first hop here, node A starts telling me the request timed out on the ping.

Answer

As an addendum... Node B is normally reachable over the WAN (behind FWb) and to all servers in it's own network. So it's just the connection with node A that causes headaches.

Answer

Nobody having ay insight in why this issue might be occurring? :(

Server drops ping traffic in certain conditions

7 answers