Hello there, Have you made any hardware diagnostic for your servers? WHEA stands for Windows Hardware Error Architecture. Some of the main hardware problems which cause machine check exceptions include: System bus errors (error communicating between the processor and the motherboard) Memory errors that may include parity and error correction code (ECC) problems. Error checking ensures that data is stored correctly in the RAM; if information is corrupted, then random errors occur. Cache errors in the processor; the cache stores important data and code. If this is corrupted, errors often occur. Detailed description here https://support.microsoft.com/en-us/windows/how-to-fix-whea-uncorrectable-error-7c49d78a-2792-96cf-2268-abbe9d9eb29f Hope this resolves your Query !! --If the reply is helpful, please Upvote and Accept it as an answer--
Server Crashes Randomly showing the following errors with WHEA-Logger ID:18 & ID:46
Khaled El Sewedy
20
Reputation points
I have an issue that server crashes randomly every once in a while showing the following message:
<UEFI0079: One or more uncorrectable Memory errors occurred in the previous boot.
Check the system Event log (SEL) To identify the non functional DIMM and then replace the DIMM.
UEFI0078: One or more Machine check errors occurred in the previous boot.
Check the system Event log (SEL) To identify the source of the machine check error and resolve the issue.>
Then I would choose <F1: to continue and retry boot order> and it starts normally
------------------------------------------------------------------------------------------------------------------------
When I check the event log I find the following errors:
WHEA-Logger Event ID: 18
A fatal hardware error has occurred.
Reported by component: Processor Core
Error Source: Machine Check Exception
Error Type: Cache Hierarchy Error
Processor APIC ID: 2
------------------------------------------------------------------------------------------------------------------------
WHEA-Logger Event ID: 46
A fatal hardware error has occurred.
Component: Memory
Error Source: BOOT
------------------------------------------------------------------------------------------------------------------------
My OS: [Microsoft Windows Server 2016 Standard]
Model: [PowerEdge R740xd]
Processor: [Intel(R) Xeon(R) Silver 4108 CPU @ 1.80GHz, 1796 Mhz, 8 Core(s), 16 Logical Processor(s)]
RAM: 32G
------------------------------------------------------------------------------------------------------------------------
Here is a copy of my latest Dump file too:
************* Preparing the environment for Debugger Extensions Gallery repositories **************
ExtensionRepository : Implicit
UseExperimentalFeatureForNugetShare : false
AllowNugetExeUpdate : false
- Configuring repositories
----> Repository : LocalInstalled, Enabled: true
----> Repository : UserExtensions, Enabled: true
************* Waiting for Debugger Extensions Gallery to Initialize **************
.
----> Repository : UserExtensions, Enabled: true, Packages count: 0
----> Repository : LocalInstalled, Enabled: true, Packages count: 36
Microsoft (R) Windows Debugger Version 10.0.25324.1001 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.
Loading Dump File [C:\Users\XXX\Desktop\042423-20406-01.dmp]
Mini Kernel Dump File: Only registers and stack trace are available
************* Path validation summary **************
Response Time (ms) Location
Deferred srv*
Symbol search path is: srv*
Executable search path is:
Windows 10 Kernel Version 14393 MP (32 procs) Free x64
Product: Server, suite: TerminalServer SingleUserTS
Edition build lab: 14393.1794.amd64fre.rs1_release.171008-1615
Kernel base = 0xfffff802`67476000 PsLoadedModuleList = 0xfffff802`67774040
Debug session time: Tue Apr 4 19:34:16.759 2023 (UTC + 2:00)
System Uptime: 3 days 5:47:31.718
Loading Kernel Symbols
...............................................................
................................................................
.................
Loading User Symbols
Loading unloaded module list
............
For analysis of this file, run !analyze -v
nt!KeBugCheckEx:
fffff802`675c5790 48894c2408 mov qword ptr [rsp+8],rcx ss:0018:ffffc180`933e2500=0000000000000124
2: kd> !analyze -v
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************
WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of error
source that reported the error. Parameter 2 holds the address of the
nt!_WHEA_ERROR_RECORD structure that describes the error condition. Try !errrec Address of the nt!_WHEA_ERROR_RECORD structure to get more details.
Arguments:
Arg1: 0000000000000000, Machine Check Exception
Arg2: ffff9805e4b5b028, Address of the nt!_WHEA_ERROR_RECORD structure.
Arg3: 00000000f2000000, High order 32-bits of the MCi_STATUS value.
Arg4: 0000000000300189, Low order 32-bits of the MCi_STATUS value.
Debugging Details:
------------------
KEY_VALUES_STRING: 1
Key : Analysis.CPU.mSec
Value: 2702
Key : Analysis.Elapsed.mSec
Value: 7960
Key : Analysis.IO.Other.Mb
Value: 0
Key : Analysis.IO.Read.Mb
Value: 0
Key : Analysis.IO.Write.Mb
Value: 0
Key : Analysis.Init.CPU.mSec
Value: 1156
Key : Analysis.Init.Elapsed.mSec
Value: 13196
Key : Analysis.Memory.CommitPeak.Mb
Value: 79
Key : Bugcheck.Code.LegacyAPI
Value: 0x124
Key : Failure.Bucket
Value: 0x124_0_GenuineIntel_PROCESSOR_CACHE_IMAGE_GenuineIntel.sys
Key : Failure.Hash
Value: {b70a049a-4a17-5749-b5df-df070316ca7d}
Key : WER.OS.Branch
Value: rs1_release
Key : WER.OS.Version
Value: 10.0.14393.1794
BUGCHECK_CODE: 124
BUGCHECK_P1: 0
BUGCHECK_P2: ffff9805e4b5b028
BUGCHECK_P3: f2000000
BUGCHECK_P4: 300189
FILE_IN_CAB: 042423-20406-01.dmp
CUSTOMER_CRASH_COUNT: 1
PROCESS_NAME: System
STACK_TEXT:
ffffc180`933e24f8 fffff802`6743727f : 00000000`00000124 00000000`00000000 ffff9805`e4b5b028 00000000`f2000000 : nt!KeBugCheckEx
ffffc180`933e2500 fffff802`6769c800 : ffff9805`e4b5b028 ffff9805`e42e27a0 ffff9805`e42e27a0 ffff9805`e42e27a0 : hal!HalBugCheckSystem+0xcf
ffffc180`933e2540 fffff802`6743776c : 00000000`00000728 00000000`00000002 ffffc180`933e2930 00000000`00000000 : nt!WheaReportHwError+0x258
ffffc180`933e25a0 fffff802`67437ac4 : ffff9805`00000010 ffff9805`e42e27a0 ffffc180`933e2748 ffff9805`e42e27a0 : hal!HalpMcaReportError+0x50
ffffc180`933e26f0 fffff802`674379ae : ffff9805`e3303160 00000000`00000001 00000000`00000002 00000000`00000000 : hal!HalpMceHandlerCore+0xe8
ffffc180`933e2740 fffff802`67437bee : 00000000`00000020 00000000`00000001 00000000`00000000 00000000`00000000 : hal!HalpMceHandler+0xda
ffffc180`933e2780 fffff802`67437d70 : ffff9805`e3303160 ffffc180`933e29b0 00000000`00000000 00000000`00000000 : hal!HalpMceHandlerWithRendezvous+0xce
ffffc180`933e27b0 fffff802`675cf6fb : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : hal!HalHandleMcheck+0x40
ffffc180`933e27e0 fffff802`675cf484 : 00000000`00000000 fffff802`675cf403 00000000`00000000 00000000`00000000 : nt!KxMcheckAbort+0x7b
ffffc180`933e2920 fffff808`fdc91348 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiMcheckAbort+0x184
ffffc180`93507198 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : intelppm!MWaitIdle+0x18
MODULE_NAME: GenuineIntel
IMAGE_NAME: GenuineIntel.sys
STACK_COMMAND: .cxr; .ecxr ; kb
FAILURE_BUCKET_ID: 0x124_0_GenuineIntel_PROCESSOR_CACHE_IMAGE_GenuineIntel.sys
OS_VERSION: 10.0.14393.1794
BUILDLAB_STR: rs1_release
OSPLATFORM_TYPE: x64
OSNAME: Windows 10
FAILURE_ID_HASH: {b70a049a-4a17-5749-b5df-df070316ca7d}
Followup: MachineOwner
---------
Unfortunately I don't understand much from the debugging tool, I hope someone would help me to identify the issue and me fix it.
Thank you.
Accepted answer
1 additional answer
Sort by: Most helpful
-
Docs 15,491 Reputation points
2023-05-02T07:13:10.6+00:00 The thread was marked as answered.
What did you find and how did you fix it?