Share via

Expert Review Requested: Root Cause Analysis of AD Replication & LDAP Failure in Split-Brain DNS Environment

Catherine Reilly 40 Reputation points
2026-04-05T03:33:05.38+00:00

The Incident Summary

Our environment uses a Split-Brain DNS setup (AD Domain: corp.com; Public Website: corp.com). We experienced an outage where internal users could not access the public website or perform LDAP password resets. I am seeking a technical verdict on two conflicting theories of the root cause.

Theory A (The Infrastructure Theory):

  1. SOA Corruption: A Wireshark capture showed the SOA Primary Server was pointing to a Pure Storage IP (10.X.X.X) instead of a Domain Controller.
  2. Scavenging Timer: This occurred exactly 7 days before the outage. Because the "Master" was a storage array, DCs could not refresh their records. After the 7-day "No-Refresh" window, the DNS Scavenger deleted the SRV and _msdcs CNAME records.
  3. The "Fix": Toggling the "Register DNS" setting on a NIC appeared to fix the issue only because it triggered a Netlogon service restart, which manually "kicked" the records back into the table temporarily.

Theory B (The NIC/GPO Theory):

  1. NIC Settings: Unchecking "Register this connection's addresses in DNS" on a server NIC caused the LDAP/SRV records to vanish.
  2. Public A-Record: Adding a manual A-record at the zone root (corp.com -> Public IP) broke AD replication and SRV registration.
  3. Firewall: The switch from SSL_TLS1.3 to SSL in the FortiGate logs proves the firewall was resetting connections.

Questions for Microsoft MVPs & Engineers:

  • Question 1 (The SOA): If the SOA Primary Server points to a non-DNS IP (10.X.X.X), will AD-integrated DNS servers fail to replicate and refresh their own SRV/CNAME records?
  • Question 2 (The 7-Day Window): Does a 7-day delay between an SOA change and an LDAP failure align with the standard Windows DNS Scavenging/Aging cycle?
  • Question 3 (The NIC Toggle): Does the "Register this connection's addresses" checkbox on a member server's NIC have any technical authority over the Netlogon service's ability to register forest-wide SRV records?
  • Question 4 (The A-Record): In a split-brain corp.com environment, does adding a root A-record for the public website interfere with or "delete" the _ldap._tcp SRV records?
  • Question 5 (TCP Resets): If a client hits a Storage Array IP on Port 443 (due to bad DNS), would the resulting SYN-ACK-RST behavior and a "downgrade" to a generic SSL log in a firewall be expected?

Conclusion:

I contend that Theory A is the only technically sound explanation and that Theory B confuses symptoms (scavenging, service restarts, and handshake failures) with the root cause. I need an expert opinion to confirm if the SOA corruption is the "Smoking Gun."The Incident Summary

Our environment uses a Split-Brain DNS setup (AD Domain: corp.com; Public Website: corp.com). We experienced a 2-hour outage where internal users could not access the public website or perform LDAP password resets. I am seeking a technical verdict on two conflicting theories of the root cause.

Theory A (The Infrastructure Theory):

  1. SOA Corruption: A Wireshark capture showed the SOA Primary Server was pointing to a Pure Storage IP (10.60.1.80) instead of a Domain Controller.
  2. Scavenging Timer: This occurred exactly 7 days before the outage. Because the "Master" was a storage array, DCs could not refresh their records. After the 7-day "No-Refresh" window, the DNS Scavenger deleted the SRV and _msdcs CNAME records.
  3. The "Fix": Toggling the "Register DNS" setting on a NIC appeared to fix the issue only because it triggered a Netlogon service restart, which manually "kicked" the records back into the table temporarily.

Theory B (The NIC/GPO Theory):

  1. NIC Settings: Unchecking "Register this connection's addresses in DNS" on a server NIC caused the LDAP/SRV records to vanish.
  2. Public A-Record: Adding a manual A-record at the zone root (corp.com -> Public IP) broke AD replication and SRV registration.
  3. Firewall: The switch from SSL_TLS1.3 to SSL in the FortiGate logs proves the firewall was resetting connections.

Windows for business | Windows Server | Directory services | Active Directory

Answer accepted by question author
  1. Tracy Le 5,685 Reputation points Independent Advisor
    2026-04-05T12:52:39.4133333+00:00

    Hi Catherine Reilly,

    Thank you for providing this detailed timeline. Here are the direct answers to your four questions for your incident report:

    1. What caused the initial "Connection Refused"?

    Before any changes were made, internal users typing corp.com were resolving the internal IP addresses of your Domain Controllers (because Active Directory automatically registers DC IPs at the root of the domain). Since your DCs do not host your public website, hitting their IPs on port 80 or 443 naturally resulted in a "Connection Refused" or an unconfigured default IIS page.

    1. Were the DNS changes to restore the website a technically sound step?

    No, they were fundamentally destructive to your infrastructure. In an Active Directory environment, the root domain (corp.com) must resolve to the IP addresses of the Domain Controllers. Manually deleting the DC A-records at the root and replacing them with a Public IP breaks the AD locator process. Furthermore, disabling the "Register this connection's addresses in DNS" checkbox on the DCs explicitly prevented them from attempting to self-heal and rebuild their missing records.

    1. Did these changes cause the LDAP issue?

    Yes, absolutely. By deleting the root A-records and disabling the DNS registration toggle, you started a ticking clock. Because AD replication was broken by the missing root records, the DCs could no longer successfully refresh their timestamps. Exactly 7 days later, the default Windows DNS Scavenging cycle saw all your critical LDAP and SRV records as "stale" and aggressively deleted them. The moment the scavenger wiped the table, all LDAP authentication and password resets collapsed.

    1. Are the A-records and "Register DNS" configurations needed?

    Yes. Domain Controllers strictly require the "Register in DNS" toggle to be enabled, and they must own the root (same as parent folder) A-records to facilitate Kerberos, LDAP, and overall domain health.

    The Solution: To solve the "naked domain" web routing issue (corp.com to www.corp.com) without breaking Active Directory, you cannot alter the root DNS A-records. The industry-standard workaround for this split-brain scenario is to install a lightweight web service (like IIS) directly on your Domain Controllers. You configure this IIS service solely to issue an HTTP 301 Permanent Redirect, forwarding any user who hits corp.com directly to www.corp.com (which then resolves to your public web server).

    I hope this definitively connects the dots between the initial DNS tampering and the resulting LDAP failure for your root cause analysis. If this explanation provides the clarity you needed, please click "Accept Answer". Should you have any questions on how to safely set up the IIS redirect, still feel free to leave a comment!


3 additional answers

Sort by: Most helpful
  1. Tracy Le 5,685 Reputation points Independent Advisor
    2026-04-05T17:30:25.0966667+00:00

    Hi Catherine Reilly,

    If your IT team is absolutely certain there was no IIS redirect configured on the Domain Controllers, then by the strict laws of Active Directory, internal queries for the naked domain corp.com have always resolved to the IP addresses of your Domain Controllers. Therefore, the website "working" previously was actually a fragile illusion. Here are the three most likely technical reasons why users were able to reach the site historically, and why it suddenly resulted in a mass "Connection Refused" event:

    1. Browser Auto-Fallback (The Most Likely Culprit)

    For years, modern browsers (like Chrome and Edge) have employed invisible fallback mechanisms. If a user typed corp.com, the DNS resolved to your DCs, and the connection silently timed out on port 80/443, the browser would automatically append www. and try again in the background, successfully loading the site. If a recent network change, an endpoint security update, or a Windows Firewall policy started actively rejecting the connection (sending immediate TCP RST packets) instead of letting it silently drop, it would instantly break the browser's auto-fallback mechanism, throwing a hard "Connection Refused" error to the user.

    1. DNS Round Robin (The "Ghost" Record)

    It is highly possible there was previously a static A-record for corp.com pointing to your public web server sitting alongside the dynamic Domain Controller IPs in your DNS zone. In this DNS Round Robin scenario, clients would randomly get either the web server IP (and connect successfully) or the DC IP (and fail or silently retry). If that single static web record was accidentally deleted, expired, or scavenged, 100% of user traffic suddenly began hitting the DCs, triggering the immediate wave of helpdesk tickets.

    1. Proxy or VPN Routing Changes

    If your internal users previously routed their web traffic through a proxy server, a PAC file, or a specific security tunnel (like Zscaler), that proxy was likely resolving corp.com using external public DNS, bypassing your internal DCs entirely. If a network engineer made a routing change or added corp.com to a proxy bypass list, internal clients were suddenly forced to resolve the name against your internal AD DNS. They hit the DCs directly, resulting in the immediate "Connection Refused."

    The Verdict:

    The "initial issue" was not a new failure, but rather the sudden exposure of a long-standing architectural flaw in your split-brain setup. When a background safety net (like browser fallback or proxy routing) finally broke, the naked domain traffic hit your DCs as it always theoretically should have. The IT team's reaction to "fix" this exposed flaw by deleting the root DC records is what started the 7-day countdown to your LDAP collapse.

    I hope this final piece of the puzzle gives you the complete, end-to-end picture for your incident report.

    0 comments No comments

  2. Tracy Le 5,685 Reputation points Independent Advisor
    2026-04-05T06:29:47.68+00:00

    Hi Catherine Reilly.

    Tackling a split-brain DNS outage intertwined with Active Directory replication failures is notoriously complex, as symptoms often masquerade as root causes. I have reviewed both of your proposed theories. As my verdict: neither Theory A nor Theory B is entirely correct, as both contain fundamental misunderstandings of how Windows Server architecture handles Active Directory-integrated DNS.

    Here is the technical breakdown of why the "SOA Corruption" is not your smoking gun, and what actually caused your outage.

    1. The SOA Myth (Refuting Theory A)

    In a traditional, file-backed DNS zone, the SOA (Start of Authority) Primary Server is critical because it dictates where dynamic updates are sent and manages zone transfers (AXFR/IXFR). However, in an AD-integrated DNS zone, this architectural rule changes completely. Domain Controllers do not rely on the SOA MNAME to update their records. Instead, the Netlogon service on each DC writes its DC locator records (_ldap._tcp, _msdcs SRV, and CNAME records) directly into the Active Directory partition database, which then replicates via AD RPC topology. Therefore, an SOA pointing to a Pure Storage IP (10.X.X.X) will absolutely not prevent DCs from registering or refreshing their own SRV records.

    2. The NIC Checkbox & Netlogon (Refuting Theory B)

    Unchecking "Register this connection's addresses in DNS" on a server's NIC only disables the DNS Client service from registering the server's basic Host (A) record. It has absolutely zero technical authority over the Netlogon service, which is solely responsible for registering the forest-wide SRV and CNAME records required for LDAP and AD replication. Toggling this checkbox did not fix your issue; it was merely a coincidence if you rebooted the server or restarted services during the process, which forced Netlogon to re-register the missing SRV records.

    3. The Root A-Record & TCP Resets (The Real Catalyst)

    In a split-brain environment (corp.com), manually adding an A-record at the zone root pointing to a Public IP does not delete existing SRV records, but it does break domain location for any process that relies on querying the domain name directly. When clients or other servers queried corp.com expecting a Domain Controller, they were redirected to your Public IP (or the Storage Array). Because they hit a web server or storage appliance on Port 443 instead of an Active Directory Domain Controller, the TLS handshakes failed, resulting in the SYN-ACK-RST behavior and SSL downgrades you saw in your FortiGate firewall logs.

    4. The 7-Day Scavenging Window (The Consequence)

    The 7-day interval perfectly aligns with the default Windows DNS Scavenging "No-Refresh" interval. Because your split-brain DNS was misconfigured (likely due to the root A-record redirecting traffic away from the DCs), the Domain Controllers experienced DNS lookup failures (often logged as Event ID 2087 or Replication Error 8524). Because AD replication broke down, the DCs could not successfully replicate their record timestamps. Once the 7-day no-refresh window expired, the DNS Scavenger aggressively deleted the stale _msdcs and SRV records, transforming a hidden replication issue into a complete LDAP outage.

    The verdict: The SOA pointing to a storage array was a misconfiguration, but it was a symptom, not the cause. The root cause was a mismanaged split-brain DNS setup where root A-records intercepted domain traffic. This broke internal AD replication, which subsequently prevented DCs from refreshing their timestamps, ultimately leading the DNS Scavenger to systematically delete your LDAP and SRV records after 7 days.

    To prevent this from recurring, you should ideally avoid split-brain DNS by using a dedicated internal subdomain (like corp.internal.com) for Active Directory, completely separating your public website namespace from your AD namespace.

    I hope this deep-dive provides the definitive technical clarity you need for your Incident Report. If this analysis answered your questions, please click "Accept Answer". Should you need further clarification on AD-integrated DNS mechanics, feel free to leave a comment!

    Tracy.


  3. Q&A Assist
    2026-04-05T03:33:42.3766667+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more

    Question 1 (The SOA)

    An incorrect SOA primary server value (for example, pointing to a non‑DNS IP such as a storage array) does not, by itself, prevent AD‑integrated DNS zones from replicating or prevent domain controllers from registering or refreshing their SRV/CNAME records.

    Key points from the context:

    • AD DC locator and replication rely on:
      • CNAME records in the _msdcs.<forest root> zone (GUID-based records for DCs).
      • Host (A) records for DCs.
      • LDAP/GC/PDC SRV records.
    • These records are:
      • Registered by the Netlogon service at OS startup, Netlogon startup, and periodically.
      • Stored in AD‑integrated DNS zones and replicated by AD replication.
    • Failures of replication and DC location are tied to:
      • DNS record registration failures (Netlogon / DNS client issues).
      • DNS replication failures between DNS servers / DCs.
      • Misconfigured DNS scopes, delegations, or forwarders.

    The SOA primary server field is a DNS zone metadata field used for zone transfers and some administrative semantics. In an AD‑integrated zone, DCs hosting the zone do not depend on that SOA MNAME value to:

    • Register their own CNAME/SRV/A records.
    • Replicate the zone via AD replication.

    The context attributes DNS replication and record loss to issues such as:

    • Replication failures between DNS servers.
    • Duplicate zones, wrong replication scopes, or aggressive scavenging.
    • Misconfigured delegations or forwarders.

    It does not state that an incorrect SOA primary server IP alone will block DCs from registering or refreshing SRV/CNAME records.

    So, Theory A’s claim that “because the SOA master was a storage array, DCs could not refresh their records” is not supported by the provided material.


    Question 2 (The 7‑Day Window)

    The context confirms that:

    • DNS scavenging and aging can remove critical DC locator records (CNAME, SRV, A) when they are not refreshed.
    • When such records disappear, AD replication and DC location fail with DNS lookup errors (for example, replication error 8524, Event ID 2087).

    However, the context does not specify a 7‑day default or standard Windows DNS aging/scavenging interval. It only notes:

    • Tombstone lifetime for AD objects (for example, 14 days for NTDS Settings object revival) in the context of AD replication, not DNS scavenging.
    • That overly aggressive DNS scavenging can cause DC CNAME records to disappear and lead to replication failures.

    Therefore, the general mechanism “records not refreshed → scavenging removes them → replication/LDAP failures” is supported, but a specific 7‑day timing correlation between an SOA change and LDAP failure is not documented in the provided material.


    Question 3 (The NIC Toggle)

    The context clearly separates two registration mechanisms:

    • DNS Client service:
      • Registers host (A) records for the machine’s IP.
      • Can be forced with ipconfig /registerdns.
    • Netlogon service on a domain controller:
      • Registers DC locator records (SRV, CNAME, etc.) required for domain controller location.
      • Can be forced with net stop netlogon / net start netlogon.

    From the context:

    • “The Net Logon service on a domain controller registers the DNS resource records that are required for the domain controller to be located on the network. The DNS Client service registers the host (A) resource record that the alias (CNAME) record points to.”
    • Restarting Netlogon is the supported way to force registration of DC locator records.

    The “Register this connection’s addresses in DNS” checkbox controls whether the DNS Client service registers the NIC’s A record. It does not control Netlogon’s ability to register forest‑wide SRV/CNAME records for domain controllers.

    So, the technically accurate mapping is:

    • NIC checkbox → affects DNS Client A‑record registration.
    • Netlogon service → registers SRV/CNAME/DC locator records.

    If toggling the NIC setting coincided with a Netlogon restart, that could explain temporary restoration of DC locator records, but the checkbox itself has no documented “authority” over Netlogon’s SRV registration.


    Question 4 (The A‑Record at Zone Root in Split‑Brain corp.com)

    The context describes several DNS record types that are critical for AD:

    • _msdcs.<forest root> CNAME records for DCs.
    • _ldap._tcp and other SRV records for DC/GC/PDC location.
    • Host (A) records for DCs.

    It also notes:

    • Non‑SRV‑aware clients may query the domain’s A record (DomainDNSName) to locate a DC, but SRV‑aware clients use SRV records.
    • Missing DomainDNSName A records can cause certain lookups to fall back to SOA responses, but these A records are not strictly required for SRV‑aware clients.

    The context does not state that adding a root A record for the zone (for example, corp.com → public IP) will:

    • Delete _ldap._tcp SRV records.
    • Prevent their registration.

    SRV records live in their own subdomains (for example, _ldap._tcp.corp.com), and _msdcs CNAMEs live under _msdcs.<forest root>. A root A record at corp.com coexists with these and does not overwrite them.

    What a root A record can affect is:

    • How non‑SRV‑aware clients resolve corp.com when trying to locate a DC.
    • Internal name resolution behavior in split‑brain scenarios if internal and external views are misconfigured.

    But there is no indication in the context that a root A record inherently breaks AD replication or causes SRV records to vanish.


    Question 5 (TCP Resets to a Storage Array IP)

    The context focuses on DNS and AD replication/locator issues and does not describe TCP handshake behavior or firewall logging semantics.

    It does, however, state that when DNS is misconfigured:

    • Clients or DCs may query the wrong DNS servers or zones.
    • DNS lookup failures lead to replication errors (for example, 8524, Event ID 2087).
    • Network traces can show unexpected DNS responses (for example, SOA instead of A records).

    There is no explicit description of:

    • How a storage array would respond on TCP 443.
    • Whether SYN‑ACK‑RST patterns or SSL/TLS “downgrades” in firewall logs are expected.

    Therefore, from the provided material, no conclusion can be drawn about the specific TCP reset and SSL logging behavior.


    Overall alignment with Theory A vs. Theory B

    Supported by the context:

    • DNS misconfiguration and missing DC locator records (CNAME, SRV, A) are primary causes of AD replication and LDAP/DC‑location failures.
    • Netlogon restarts and forced DNS registrations (net stop netlogon / net start netlogon, ipconfig /registerdns) can temporarily restore functionality.
    • Overly aggressive or misconfigured DNS scavenging, or DNS replication issues, can cause critical records to disappear.
    • Split‑brain DNS misconfiguration can break SSO/AD FS behavior and internal vs. external name resolution.

    Not supported by the context:

    • An incorrect SOA primary server IP, by itself, blocking DCs from refreshing SRV/CNAME records or causing AD‑integrated DNS replication to fail.
    • A root A record at corp.com inherently deleting or blocking _ldap._tcp SRV records.
    • The NIC “Register this connection’s addresses in DNS” checkbox directly controlling Netlogon’s SRV registration.
    • Any specific 7‑day timing guarantee tied to SOA changes.

    The context points to DNS record registration and DNS/AD replication health (including scavenging and replication scope issues) as the real technical levers. SOA corruption may be a symptom of broader DNS misconfiguration, but it is not documented here as the direct “smoking gun” that prevents SRV/CNAME registration.


    References:

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.