Server Crashes Randomly showing the following errors with WHEA-Logger ID:18 & ID:46

Khaled El Sewedy 20 Reputation points
2023-04-24T16:01:46.73+00:00
I have an issue that server crashes randomly every once in a while showing the following message:
<UEFI0079: One or more uncorrectable Memory errors occurred in the previous boot.
Check the system Event log (SEL) To identify the non functional DIMM and then replace the DIMM.
UEFI0078: One or more Machine check errors occurred in the previous boot. 
Check the system Event log (SEL) To identify the source of the machine check error and resolve the issue.>

Then I would choose <F1: to continue and retry boot order> and it starts normally
------------------------------------------------------------------------------------------------------------------------
When I check the event log I find the following errors:

WHEA-Logger Event ID: 18
A fatal hardware error has occurred.

Reported by component: Processor Core
Error Source: Machine Check Exception
Error Type: Cache Hierarchy Error
Processor APIC ID: 2
------------------------------------------------------------------------------------------------------------------------
WHEA-Logger Event ID: 46
A fatal hardware error has occurred.

Component: Memory
Error Source: BOOT
------------------------------------------------------------------------------------------------------------------------
My OS: [Microsoft Windows Server 2016 Standard]
Model: [PowerEdge R740xd]
Processor: [Intel(R) Xeon(R) Silver 4108 CPU @ 1.80GHz, 1796 Mhz, 8 Core(s), 16 Logical Processor(s)]
RAM: 32G
------------------------------------------------------------------------------------------------------------------------
Here is a copy of my latest Dump file too:


************* Preparing the environment for Debugger Extensions Gallery repositories **************
   ExtensionRepository : Implicit
   UseExperimentalFeatureForNugetShare : false
   AllowNugetExeUpdate : false

   - Configuring repositories
      ----> Repository : LocalInstalled, Enabled: true
      ----> Repository : UserExtensions, Enabled: true


************* Waiting for Debugger Extensions Gallery to Initialize **************
.
   ----> Repository : UserExtensions, Enabled: true, Packages count: 0
   ----> Repository : LocalInstalled, Enabled: true, Packages count: 36

Microsoft (R) Windows Debugger Version 10.0.25324.1001 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.


Loading Dump File [C:\Users\XXX\Desktop\042423-20406-01.dmp]
Mini Kernel Dump File: Only registers and stack trace are available


************* Path validation summary **************
Response                         Time (ms)     Location
Deferred                                       srv*
Symbol search path is: srv*
Executable search path is: 
Windows 10 Kernel Version 14393 MP (32 procs) Free x64
Product: Server, suite: TerminalServer SingleUserTS
Edition build lab: 14393.1794.amd64fre.rs1_release.171008-1615
Kernel base = 0xfffff802`67476000 PsLoadedModuleList = 0xfffff802`67774040
Debug session time: Tue Apr  4 19:34:16.759 2023 (UTC + 2:00)
System Uptime: 3 days 5:47:31.718
Loading Kernel Symbols
...............................................................
................................................................
.................
Loading User Symbols
Loading unloaded module list
............
For analysis of this file, run !analyze -v
nt!KeBugCheckEx:
fffff802`675c5790 48894c2408      mov     qword ptr [rsp+8],rcx ss:0018:ffffc180`933e2500=0000000000000124
2: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of error
source that reported the error. Parameter 2 holds the address of the
nt!_WHEA_ERROR_RECORD structure that describes the error condition. Try !errrec Address of the nt!_WHEA_ERROR_RECORD structure to get more details.
Arguments:
Arg1: 0000000000000000, Machine Check Exception
Arg2: ffff9805e4b5b028, Address of the nt!_WHEA_ERROR_RECORD structure.
Arg3: 00000000f2000000, High order 32-bits of the MCi_STATUS value.
Arg4: 0000000000300189, Low order 32-bits of the MCi_STATUS value.

Debugging Details:
------------------


KEY_VALUES_STRING: 1

    Key  : Analysis.CPU.mSec
    Value: 2702

    Key  : Analysis.Elapsed.mSec
    Value: 7960

    Key  : Analysis.IO.Other.Mb
    Value: 0

    Key  : Analysis.IO.Read.Mb
    Value: 0

    Key  : Analysis.IO.Write.Mb
    Value: 0

    Key  : Analysis.Init.CPU.mSec
    Value: 1156

    Key  : Analysis.Init.Elapsed.mSec
    Value: 13196

    Key  : Analysis.Memory.CommitPeak.Mb
    Value: 79

    Key  : Bugcheck.Code.LegacyAPI
    Value: 0x124

    Key  : Failure.Bucket
    Value: 0x124_0_GenuineIntel_PROCESSOR_CACHE_IMAGE_GenuineIntel.sys

    Key  : Failure.Hash
    Value: {b70a049a-4a17-5749-b5df-df070316ca7d}

    Key  : WER.OS.Branch
    Value: rs1_release

    Key  : WER.OS.Version
    Value: 10.0.14393.1794


BUGCHECK_CODE:  124

BUGCHECK_P1: 0

BUGCHECK_P2: ffff9805e4b5b028

BUGCHECK_P3: f2000000

BUGCHECK_P4: 300189

FILE_IN_CAB:  042423-20406-01.dmp

CUSTOMER_CRASH_COUNT:  1

PROCESS_NAME:  System

STACK_TEXT:  
ffffc180`933e24f8 fffff802`6743727f     : 00000000`00000124 00000000`00000000 ffff9805`e4b5b028 00000000`f2000000 : nt!KeBugCheckEx
ffffc180`933e2500 fffff802`6769c800     : ffff9805`e4b5b028 ffff9805`e42e27a0 ffff9805`e42e27a0 ffff9805`e42e27a0 : hal!HalBugCheckSystem+0xcf
ffffc180`933e2540 fffff802`6743776c     : 00000000`00000728 00000000`00000002 ffffc180`933e2930 00000000`00000000 : nt!WheaReportHwError+0x258
ffffc180`933e25a0 fffff802`67437ac4     : ffff9805`00000010 ffff9805`e42e27a0 ffffc180`933e2748 ffff9805`e42e27a0 : hal!HalpMcaReportError+0x50
ffffc180`933e26f0 fffff802`674379ae     : ffff9805`e3303160 00000000`00000001 00000000`00000002 00000000`00000000 : hal!HalpMceHandlerCore+0xe8
ffffc180`933e2740 fffff802`67437bee     : 00000000`00000020 00000000`00000001 00000000`00000000 00000000`00000000 : hal!HalpMceHandler+0xda
ffffc180`933e2780 fffff802`67437d70     : ffff9805`e3303160 ffffc180`933e29b0 00000000`00000000 00000000`00000000 : hal!HalpMceHandlerWithRendezvous+0xce
ffffc180`933e27b0 fffff802`675cf6fb     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : hal!HalHandleMcheck+0x40
ffffc180`933e27e0 fffff802`675cf484     : 00000000`00000000 fffff802`675cf403 00000000`00000000 00000000`00000000 : nt!KxMcheckAbort+0x7b
ffffc180`933e2920 fffff808`fdc91348     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiMcheckAbort+0x184
ffffc180`93507198 00000000`00000000     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : intelppm!MWaitIdle+0x18


MODULE_NAME: GenuineIntel

IMAGE_NAME:  GenuineIntel.sys

STACK_COMMAND:  .cxr; .ecxr ; kb

FAILURE_BUCKET_ID:  0x124_0_GenuineIntel_PROCESSOR_CACHE_IMAGE_GenuineIntel.sys

OS_VERSION:  10.0.14393.1794

BUILDLAB_STR:  rs1_release

OSPLATFORM_TYPE:  x64

OSNAME:  Windows 10

FAILURE_ID_HASH:  {b70a049a-4a17-5749-b5df-df070316ca7d}

Followup:     MachineOwner
---------


Unfortunately I don't understand much from the debugging tool, I hope someone would help me to identify the issue and me fix it.

Thank you.
Windows Server 2016
Windows Server 2016
A Microsoft server operating system that supports enterprise-level management updated to data storage.
2,436 questions
Windows Server
Windows Server
A family of Microsoft server operating systems that support enterprise-level management, data storage, applications, and communications.
12,635 questions
Windows Hardware Performance
Windows Hardware Performance
Windows: A family of Microsoft operating systems that run across personal computers, tablets, laptops, phones, internet of things devices, self-contained mixed reality headsets, large collaboration screens, and other devices.Hardware Performance: Delivering / providing hardware or hardware systems or adjusting / adapting hardware or hardware systems.
1,579 questions
{count} votes

Accepted answer
  1. Limitless Technology 44,121 Reputation points
    2023-04-25T14:18:11.8266667+00:00

    Hello there, Have you made any hardware diagnostic for your servers? WHEA stands for Windows Hardware Error Architecture. Some of the main hardware problems which cause machine check exceptions include: System bus errors (error communicating between the processor and the motherboard) Memory errors that may include parity and error correction code (ECC) problems. Error checking ensures that data is stored correctly in the RAM; if information is corrupted, then random errors occur. Cache errors in the processor; the cache stores important data and code. If this is corrupted, errors often occur. Detailed description here https://support.microsoft.com/en-us/windows/how-to-fix-whea-uncorrectable-error-7c49d78a-2792-96cf-2268-abbe9d9eb29f Hope this resolves your Query !! --If the reply is helpful, please Upvote and Accept it as an answer--

    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Docs 15,491 Reputation points
    2023-05-02T07:13:10.6+00:00

    The thread was marked as answered.

    What did you find and how did you fix it?

    0 comments No comments