Exchange 2016 CU22 - MS Exchnage replication service terminating constantly

henryl 31 Reputation points
2022-01-07T15:29:49.063+00:00

Hello, we have a problem on both DAG members that the MS replication service crashes about 2~3x per hour with event ID 4399:

The Microsoft Exchange Replication service is terminating without collecting a Watson dump. Error The Microsoft Exchange Replication service is using too much memory (2704.921875 MiB) and will be terminated without collecting a Watson dump. This exceeds the maximum expected value of 2176 MiB. To take a dump along with the Watson report, set registry key 'EnableWatsonDumpOnTooMuchMemory' to 1. Default can be overridden by setting registry key MaximumProcessPrivateMemoryMB, MemoryLimitBaseInMB, MemoryLimitPerDBInMB. EffectiveLimit=Min(MaximumProcessPrivateMemoryMB, MemoryLimitBaseInMB + nCopiesMemoryLimitPerDBInMB).*

The "too much memory" is usually in 2180~2800MiB range. We found our issue matches to this E2k13 Technet article (and then soved by next CU..) :

https://social.technet.microsoft.com/Forums/windowsserver/en-US/59a9de69-01e9-436d-95bb-ff1126c0dd6e/ms-exchange-replication-service-terminating-constantly?forum=exchangesvravailabilityandisasterrecovery

Both Technet article and Event descr. itself suggest the same - to add new reg keys MaximumProcessPrivateMemoryMB and MemoryLimitBaseInMB to increase default limit, eg. event desc. also means that MemoryLimitPerDBInMB can be related.

So we tried to set up those MaximumProcessPrivateMemoryMB and MemoryLimitBaseInMB limits first: No change.
Then I tried to calculate what is the default/current MemoryLimitPerDBInMB limit but the mentioned "EffectiveLimit=Min(MaximumProcessPrivateMemoryMB, MemoryLimitBaseInMB + nCopiesMemoryLimitPerDBInMB)" formula is probably wrong and should be "EffectiveLimit=Min(MaximumProcessPrivateMemoryMB, MemoryLimitBaseInMB, nCopiesMemoryLimitPerDBInMB)" in fact (?) otherwise it leads to "minus" value of PerDB parameter .

I think our situation was:

MaximumProcessPrivateMemoryMB 8192 (extra added as suggested in E2k13 KB)
MemoryLimitBaseInMB 4096 (extra added as suggested in E2k13 KB)
MemoryLimitPerDBInMB ? (unknown/not set up yet)
EffectiveLimit 2176 MiB = 2281.701376 MB (as event said)
nCopies 4 (we have 4x DBs)

So the modified formula says: 2281.701376 = Min ( 8192, 4096, 4*MemoryLimitPerDBInMB ) .... what means MemoryLimitPerDBInMB is 570.425344 MB (544 MiB)

So we also added the next (hopefuly increased) DWORD value reg. as MemoryLimitPerDBInMB = 768 (and then restarted service, server for sure)

But none has changed in fact the repl. service still crashes with "This exceeds the maximum expected value of 2176 MiB" error message...
Can anyone help please? Are those REG settings really manage the "Microsoft Exchange Replication" service's parameters in E2k16 like it did in E2k13? Of course maybe we are generally wrong here... Generally the 1st problem is why it requires so much memory and 2nd one is how to fix it...

FYI: We are currently facing to bigger issue than "only the sometimes crashing repl. service". Due to this we cannot perform new DAG DBs copies for larger DBs because process simply cannot seed xxxGBs before the next service's crash which is happening on both send/receive DAG members. Only 2 of 4 smaller DBs have been already transferred within those short 20~30min windows and then they normally work in synchronized DB active/pasive mode in DAG.

Thank you!

Exchange | Exchange Server | Management
0 comments No comments
{count} votes

Accepted answer
  1. Andy David - MVP 157.9K Reputation points MVP Volunteer Moderator
    2022-01-07T17:14:16.367+00:00
    1 person found this answer helpful.
    0 comments No comments

2 additional answers

Sort by: Most helpful
  1. henryl 31 Reputation points
    2022-01-10T01:26:21.023+00:00

    Hello, many thanks for your quick and relevant tip. There are (were) logged out errors and warnings "Probe result (Name=ClusterNetworkProbe/MSExchangeRepl)".
    Just please be informed that I resolved this issue right now.

    In meantime I found that

    • Windows cluster manager says "failed" and cannot be repaired
    • the "Get-DatabaseAvailabilityGroupNetwork" command frozes (eg. crashes after timeout ) anytime
    • even the DAG configuration in ECP is not available as well..
    • also the "Get-HealthReport -Identity <server> | Where-Object {$_.AlertValue -match "Unhealthy"}" has marked Clustering as "Unhealthy"
    • Then finally get result of "Get-ClusterNetwork" as 3x cluster network with only "Cluster" role, NO ONE with "ClusterAndClient"...

    So visibly something with DAG/cluster netw. was wrong.

    What I had done to resolve completely this issue:

    1. Verified that we use mapi, repl. and loopback interfaces in std. configuration = ok
    2. But there is an additional "loopback" interface used to receive balanced traffic based on MAC address. It is connected to the same VLAN and uses IP address from the same range as mapi one but it was diff. (lower) subnet mask. As I found, this is really not welcomed by automatic cluster set up used since E2k13...
    3. So I changed loopback's netw. mask to match wide mapi interface's mask. The result of "Get-ClusterNetwork" has been aut. corrected to only 2x cluster networks where the mapi (and loopback..) one was assigned with ClusterAndClient role. But nothing else was ok at this point
    4. While I temp. changeg DAG IP to 255.255.255.255 and then back to valid fix mgmt. IP, cluster was bring to online in Windows cluster manager. The DAG config was still unavailable and MS Exch. replication service was still crashing every max. 30minutes with event ID
    5. I finally found that the "Set-DatabaseAvailabilityGroup <DAGname> -DiscoverNetworks" command is required to re-read changed network configuration.

    Since this time I can see DAG config and MS Exch. replication service is no more crashing :-)
    I'm finally able to fully update failed copies of our xxx GB databases.
    I have just to remove that loopback int. completely from cluster network later.

    Thank you again

    1 person found this answer helpful.

  2. henryl 31 Reputation points
    2022-01-17T14:16:21.373+00:00

    Hi, sure I'm going to close this issue. I just did not find how to mark my own answer as a resolution :-)

    To be complete: As I wrote, replication was working well without repl. service crashes, just clustering notified "misconfigured" mapi interface. That was due to loopback and Mapi interfaces registration under the same network made in emergency situation week ago. So I online made changes mentioned below and everything is going 100% well:

    1. Turn off automatic networking in DAG: Set-DatabaseAvailabilityGroup <DAG name> -ManualDagNetworkConfiguration $true
    2. Set up final /32 netw. mask of third server interface (loopback) which is unwanted in cluster networking
    3. Refresh DAG netw. configuration based on changes made at netw. in OS Set-DatabaseAvailabilityGroup <DAG name> -DiscoverNetworks
    4. Finally based on the Get-DatabaseAvailabilityGroupNetwork result, everything is ok:
    1. only two cluster networks are registered for DAG and used for Mapi (1x) and Replication
    2. No any "misconfigured" interface

    Thank you again

    1 person found this answer helpful.

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.