random crash, no dumps/minidumps - WHEA logged error - but no H/W faults recorded...what could it be?
Hi there - I have a new HP elite mini 800 G9 (i7-12700T) that randomly crashes multiple times a day. This started on my first machine and after 2 x Motherboard replacements, 1 x Processor replacement and a replacement machine, I am still getting the same symptoms... Interesting enough - happens when I run testing from the BIOS, performance testing, sometime just clicking on a menu item.. I can reliably get a crash, but running google earth (both installed or web version). but never get a dump or a minidump (tried both).
I've also used different installs of windows 11 on 2 separate SSD's. Checked all drivers up to date and BIOS as well. I also used Ubuntu on a USB stick and had the system crash there as well.
HP don't seem to have any ideas... I have the WHEA events - but no idea on how to interpret the data string - anyone able to assist?
Windows 10 Hardware Performance
Windows 11
-
Chris 0 Reputation points
2023-10-08T22:48:13.1366667+00:00 Log Name: System Source: Microsoft-Windows-WHEA-Logger Date: 9/10/2023 6:51:30 AM Event ID: 1 Task Category: None Level: Error Keywords: WHEA Error Event Logs User: LOCAL SERVICE Computer: Voyager Description: A fatal hardware error has occurred. A record describing the condition is contained in the data section of this event. Event Xml: <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event"> <System> <Provider Name="Microsoft-Windows-WHEA-Logger" Guid="{c26c4f3c-3f66-4e99-8f8a-39405cfed220}" /> <EventID>1</EventID> <Version>0</Version> <Level>2</Level> <Task>0</Task> <Opcode>0</Opcode> <Keywords>0x8000000000000002</Keywords> <TimeCreated SystemTime="2023-10-08T19:51:30.9442778Z" /> <EventRecordID>542137</EventRecordID> <Correlation ActivityID="{03439592-783b-429f-baa0-ce274735d8e5}" /> <Execution ProcessID="6728" ThreadID="7508" /> <Channel>System</Channel> <Computer>Voyager</Computer> <Security UserID="S-1-5-19" /> </System> <EventData> <Data Name="Length">3552</Data> <Data Name="RawData">435045521002FFFFFFFF03000100000002000000E00D000012331300080A17140000000000000000000000000000000000000000000000000000000000000000BDC407CF89B7184EB3C41F732CB5713166A4613D40AB9A40A698F362D464B38F221A14CD20FAD90102000000000000000000000000000000000000000000000058010000200A00000003000000000000962A2181ED09964994718D729C8E69ED00000000000000000000000000000000010000000000000000000000000000000000000000000000780B0000200200000003000000000000962A2181ED09964994718D729C8E69ED00000000000000000000000000000000010000000000000000000000000000000000000000000000980D0000480000000003000000000000962A2181ED09964994718D729C8E69ED000000000000000000000000000000000100000000000000000000000000000000000000000000000202000000000000000000000000000011F3878F98C99E4DA0C46065518C4F6D05730341730200008CC49F060000000000002A0005000000FF37008062FC000058F60100707D020000000000000000080000C7047F6F80F1FF5B680AB30000E08F00001120002000004000010000000000000000812850001ACF0E005D0300010000000000000031F85B0100080E117FFC018008E0FF31FE7F81170FCE0F7FFFFFFF8918E0FF31F87D1002080F70C9030031F85B0100000006010000000000E0C7B81401020000000000000000000000000000000031F85B8100080F77490300E0FF0E00003C004000000000000000E000000000400000000000000000004000000000004203000000000000000000000000000000000000B060050203FE057F03040008000000000000000000001BA3A0010100000200000000010000001109000000800000000040420400006000B081FFFF00000000000000000000001F001F000000000000000000000000000000000000000000000072010000000000000000000000000000000000000000000000000000000000000000000001FF0F00000000000000000000001F001F000000000000000000000000000000000000000000000072010000000000000000000000000000000000000000000000000000000000000000000001FF0F00000000000000000000001F001F000000000000000000000000000000000000000000000072010000000000000000000000000000000000000000000000000000000000000000000001FF0F00000000000000000000001F001F000000000000000000000000000000000000000000000072010000000000000000000000000000000000000000000000000000000000000000000001FF0F00000000000000000000001F001F000000000000000000000000000000000000000000000072010000000000000000000000000000000000000000000000000000000000000000000001FF0F00000000000000000000001F001F000000000000000000000000000000000000000000000072010000000000000000000000000000000000000000000000000000000000000000000001FF0F00000000000000000000001F001F000000000000000000000000000000000000000000000072010000000000000000000000000000000000000000000000000000000000000000000001FF0F00000000000000000000001F001F000000000000000000000000000000000000000000000072010000000000000000000000000000000000000000000000000000000000000000000001FF0F00000000000000000000001F001F000000000000000000000000000000000000000000000072010000000000000000000000000000000000000000000000000000000000000000000001FF0F00000000000000000000001F001F000000000000000000000000000000000000000000000072010000000000000000000000000000000000000000000000000000000000000000000001FF0F00000000000000000000001F001F000000000000000000000000000000000000000000000072010000000000000000000000000000000000000000000000000000000000000000000001FF0F00000000000000000000001F001F000000000000000000000000000000000000000000000072010000000000000000000000000000000000000000000000000000000000000000000001FF0F00000000000000000000001F001F000000000000000000000000000000000000000000000072010000000000000000000000000000000000000000000000000000000000000000000001FF0F00000000000000000000001F001F000000000000000000000000000000000000000000000072010000000000000000000000000000000000000000000000000000000000000000000001FF0F00000000000000000000001F001F000000000000000000000000000000000000000000000072010000000000000000000000000000000000000000000000000000000000000000000001FF0F00000000000000000000001F001F000000000000000000000000000000000000000000000072010000000000000000000000000000000000000000000000000000000000000000000001FF0F00000000000000000000001F001F000000000000000000000000000000000000000000000072010000000000000000000000000000000000000000000000000000000000000000000001FF0F00000000000000000000001F001F000000000000000000000000000000000000000000000072010000000000000000000000000000000000000000000000000000000000000000000001FF0F00000000000000000000001F001F000000000000000000000000000000000000000000000072010000000000000000000000000000000000000000000000000000000000000000000001FF0F00000000000000000000001F001F000000000000000000000000000000000000000000000072010000000000000000000000000000000000000000000000000000000000000000000001FF0F00000000000000000000001F001F00000000000000000000000000000000000000000000007201000000000000000000000000000000000000000000000000000000000000002B0038332A002A000000000000000000001F001F000000000000000000000000000000000000000000000072010000000000000000000000000000000000000000000000000000000000000000000001FF0F00000000000000000000001F001F000000000000000000000000000000000000000000000072010000000000000000000000000000000000000000000000000000000000000000000001FF0F00000000000000000000001F001F000000000000000000000000000000000000000000000072010000000000000000000000000000000000000000000000000000000000000000000001FF0F00000000000000000000001F001F00000000000000000000000000000000000000000000007201000000000000000000000000000000000000000000000000000000000000002B0038332A002A000000000000000000001F001F000000000000000000000000000000000000000000000072010000000000000000000000000000000000000000000000000000000000000000000001FF0F00000000000000000000001F001F000000000000000000000000000000000000000000000072010000000000000000000000000000000000000000000000000000000000000000000001FF0F00000000000000000000001F001F000000000000000000000000000000000000000000000072010000000000000000000000000000000000000000000000000000000000000000000001FF0F00000000000000000F0F383300000040000001ADAAAABAADAAAABAADAAAABAADAAAABA0500008006011F303000000000400000031F0402CB0360C4000000000000000000008B94444400000000EFBEADDEEFBEADDEEFBEADDEEFBEADDEEFBEADDEEFBEADDEEFBEADDEEFBEADDEEFBEADDEEFBEADDEEFBEADDEEFBEADDE0202000000000000000000000000000011F3878F98C99E4DA0C46065518C4F6D017103118000000000000000483901FE0000000000230000F1E2B82601230000ECE9B826042300007DEDB826022300008DEEB826032300003EF0B8260523000013F9B826002300000914EC2701230000041BEC2704230000951EEC2702230000A51FEC27032300005621EC27052300002B2AEC2700230000C1AA1E2901230000BCB11E29042300004DB51E29022300005DB61E29032300000EB81E2905230000E3C01E2900230000D19D512A01230000C9A4512A042300005AA8512A022300006AA9512A0323000017AB512A05230000ECB3512A00230000498D842B012300004494842B04230000D597842B02230000E598842B03230000969A842B052300006BA3842B00230000A96CB72C01230000A473B72C042300003577B72C022300004578B72C03230000F679B72C05230000CB82B72C0200000000200001F255252D142000006157252D01230000C680ED21042300005784ED21022300006785ED21032300001887ED21052300007590ED2100230000B21E202301230000AA252023042300003A292023022300004A2A202303230000F72B202305230000983520230023000029FF5224012300002406532404230000B509532402230000C50A532403230000760C5324052300004B1553240023000071FB8525012300006C02862504230000FD058625022300000D07862503230000BE08862505230000931186250202000000000000000000000000000011F3878F98C99E4DA0C46065518C4F6D017103170A0000000000000000000000000000000800000000000000EFBEADDEEFBEADDEEFBEADDE</Data> </EventData> </Event>
-
Gary Nebbett 6,086 Reputation points
2023-10-11T12:01:09.72+00:00 Hello Chris,
Is there anything in common between the hardware and software that first incurred this error and the hardware and software now (that is still incurring the error)?
"[...] a replacement machine" indicates that there is no hardware commonality and similar behaviour under Windows and Ubuntu indicates that there is no software commonality.
Dumping your "RawData" with a small amount of interpretation/formatting gives:
=============================================================================== Common Platform Error Record @ ffffb2072434d2d0 ------------------------------------------------------------------------------- Record Id : 01d9fa20cd141a22 Severity : Fatal (1) Length : 3552 Creator : Microsoft Notify Type : BOOT Error Record Timestamp : 10/8/2023 19:51:18 (UTC) Flags : 0x00000002 PreviousError =============================================================================== Section 0 : Firmware Error Record Reference ------------------------------------------------------------------------------- Descriptor @ ffffb2072434d350 Section @ ffffb2072434d428 Offset : 344 Length : 2592 Flags : 0x00000000 Severity : Fatal =============================================================================== Section 1 : Firmware Error Record Reference ------------------------------------------------------------------------------- Descriptor @ ffffb2072434d398 Section @ ffffb2072434de48 Offset : 2936 Length : 544 Flags : 0x00000000 Severity : Fatal =============================================================================== Section 2 : Firmware Error Record Reference ------------------------------------------------------------------------------- Descriptor @ ffffb2072434d3e0 Section @ ffffb2072434e068 Offset : 3480 Length : 72 Flags : 0x00000000 Severity : Fatal
The interesting thing for me is the "Notify Type" ("BOOT Error Record"). This indicates that the system restart was not initiated by the operating system (e.g. Windows) but rather by another system component (perhaps the PCH). The operating system plays no part in the crash/restart; the UEFI specification says:
The BOOT Notification Type represents error conditions which are unhandled by system software and which result in a system shutdown/reset. System software retrieves a BOOT error record during boot by querying the platform for existing BOOT records. As an example, consider an x64 platform which implements a service processor. In some scenarios, the service processor may detect that the system is either hung or is in such a state that it cannot safely proceed without risking data corruption**. In such a scenario the service processor may record some minimal error information in its system event log (SEL) and unilaterally reset the machine without notifying the OS or other system software. In such scenarios, system software is unaware of the condition that caused the system reset. A BOOT error record would contain information that describes the error condition that led to the reset so system software can log the information and use it for health monitoring.
The "RawData" is an entry from the SEL. I am not aware of any publicly (and freely) available specification that describes the format of the data in the sections.
Gary
-
Arnaud N 5 Reputation points
2023-10-11T15:04:55.8766667+00:00 When I read the initial message my 1st question is:
RAM sticks replaced on both motherboards and/or OS ?
You seem to experience the issue from any OS installed, motherboard built-in diagnostic tool and USB live OS. It smells bad or incompatible hardware but I might be wrong.
If you cannot swap/replace the RAM, try testing it with MemTest86 for example.
If the test fails: likely to be bad RAM stick(s) but you need to confirm by replacing it with known good RAM sticks, and run the same test on the same motherboard and CPU combination.
If it fails again: then it is likely a motherboard or/and CPU problem.
If both tests pass, then RAM stick(s) OK.
Then reading Gary's reply, some more came up:
Any motherboard firmware/BIOS updates available ?
About your 2 SSD, same model and firmware version ? or they are different at some levels ?Any SSD firmware update available ?
-
Chris 0 Reputation points
2023-11-20T21:02:40.43+00:00 Hi Gary and Arnaud,
Thanks for your responses... Apologies for the delay in responding.
~Gary - your evaluation of the rawdata makes sense to me as this indicates that when the system freezes, the service processor is detecting this and reboot the machine.
~Arnaud - in response to your questions - yes - I have updates installed for the SSD and the M/B.
The machine was swapped out, but I did have an additional stick of RAM (same specifications, but different brand (I pickup a Kingston 16 GB stick to increase total memory to 32 GB. - I also included an additional Samsung SSD (PCI4) 2TB - that I'm using as a boot driver. Noting that I've tested with single sticks of RAM (both the Hynix and the Kingston, independently) as well as the SSD drives.. (removing one to see if that is the case.)
I wonder if what I'm seeing is due to power issues... with the power supply... not sure how to test that tho.
regards
-
Gary Nebbett 6,086 Reputation points
2023-11-21T10:14:36.0033333+00:00 Hello Chris,
A "power" issue seems to be a quite likely cause. I have seen some discussion of failures of this type being associated with low compute activity - some cores enter a low power mode and sometimes the voltage to the CPU is also lowered (too far), resulting in a CPU hang; disabling the core low power mode sometimes seemed to have helped.
I think (but don't know) that these "BOOT Error Records" are "implied" by the Intel Corporation patent "SUPPORTING HANG DETECTION AND DATA RECOVERY IN MICROPROCESSOR SYSTEMS".
In my (limited) understanding, a microcode bug could also be the cause of this type of crash.
Gary
Sign in to comment