Multiple Random Windows Servers have freezing issue

Question

Multiple Random Windows Servers have freezing issue

Ramagost, Paul 0

Since Feb 2026 we've had around 11 servers running different Windows OS versions go into a not responsive state. The majority of them are running Server 2019 but two of them are running 2016/2025. The issue is not reoccurring on the same systems. We have a little over 1 thousand virtual servers running different apps on multiple different physical hosts and only a very small portion has had this issue occur. All servers run the same AV and other management type software.

When the issue occurs, the server is still Pingable, but services stop working on them. A symptom we use to detect if the issue is occurring is the System Center Operations Manager agent alerts 'Health Service Heartbeat Failure'.

Services for applications that are running on the servers with this issue also stop working as well. When we try to remote desktop, it doesn't work. If we try to access the local console the screen is stuck on black. If we reboot the server, once it boots back up, it always says 'Getting Windows Ready | Do Not Turn Off Your Computer'.

We check to see if there are any updates or installations that occurred the same day of the issue but don't find anything. The event logs show a gap in time between the time the issue started and the time it's resolved.

In one case we left the server running (didn't reboot it) and the issue eventually resolved itself hours later.

These are all VMWare servers. We've tried updating VMWare esxi host software and hardware drivers as well as updating to the newest VMWare Tools version. The issue is still randomly occurring however (we had one do it this morning).

Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.
Ramagost, Paul 0 Reputation points

2026-05-18T15:03:24.7033333+00:00

Hello, I just put a long-detailed post here on the root cause and it was auto moderated out with reason of "This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information."

No idea why but I'm not going to try and repost the whole thing again as it will probably just get moderated out again.

Below is a summary of root cause and possible fix:

We don't currently have confirmation yet, but VMWare support has helped us identify a possible reason for the issue.

Please note that VMWare support was contacted about this issue prior to me posting this thread, but they dismissed it as being anything related to VMWare. We had to open a 2nd ticket with support and prod them to investigate more thoroughly based on entries we saw in the VMWare logs.

Issues with the IOFILTER Inter-Process Communication daemon were identified.

This relates to Pure Storage Active DR with SRM and the Pure VASA provider which is registered on the VMWare host.

We can correlate the daemon crash timing directly with Dell PowerProtect activity.

Possible fix: Upgrade to newer esxi version

Below are from resolved issues section in this article --> https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/vsphere/8-0/release-notes/esxi-update-and-patch-release-notes/vsphere-esxi-80u3i-release-notes.html

• PR 3502820: "A panic event at Dell TSDM during a PPDM-based snapshot synchronization can stop the task in the data protection daemon." The stopped synchronization subsequently affects PPDM backup operations and vSphere vMotion tasks.

• PR 3602560: "If a virtual machine with active Data Protection I/O Filter, such as VMware Live Recovery or Dell PowerProtect, is manually copied or restored from a snapshot outside of a supported workflow, the copy might fail to power on. If the bitmap state is inconsistent, the virtual machine is not allowed to power on."

2 answers

Your answer

Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.

Answer 1

Scott Nguyen 1,470 Independent Advisor

Hello, since the servers remain pingable but the UI and services hang, I would guess that the TrustedInstaller.exe or TiWorker.exe processes have locked the file system or registry hives while performing cleanup, which could cause black screen and gap in the Event Logs.

You should check the %windir%\Logs\CBS\CBS.log directory for entries that align with the SCOM heartbeat failure, specifically looking for worker threads that hang during "Component Cleanup" or "Registry Hive" compaction.

To prevent further occurrences, I suggest you temporarily disable the "StartComponentCleanup" task located under Task Scheduler Library\Microsoft\Windows\Servicing to see if the issue continue to happen.

Ramagost, Paul 0 Reputation points

2026-04-29T13:46:53.83+00:00

Thanks, we would have to disable this scheduled task on all systems as this is happening on random systems and has not reoccurred on the same system again.

You say to disable this temporarily. If this is the cause, wouldn't we need to permanently disable this and if so, what issue would arise if we did permanently disable it?

EDIT: I just checked the Task Scheduler task you mentioned and it is already disabled.
Scott Nguyen 1,470 Reputation points Independent Advisor

2026-04-29T13:54:24.48+00:00

Disabling the "StartComponentCleanup" task permanently will stop the automatic removal of superseded updates and old components from the C:\Windows\WinSxS directory, this may cause disk bloat (unnecessary files that fill up your disk). So that's why I suggest just do it temporarily.

And also you can implement a manual cleanup by using Dism.exe /online /Cleanup-Image /StartComponentCleanup /ResetBase every quarter to purge the files that the task would have otherwise handled.
Ramagost, Paul 0 Reputation points

2026-04-29T13:58:09.79+00:00

I edited my previous post, and you may not have seen it. I just checked the Task Scheduler task you mentioned and it is already disabled.

Answer 2

The behavior matches a generic Windows freeze/hang on virtualized servers. Use the Windows freeze troubleshooting guidance and, critically, collect a memory dump while a server is in the frozen state so the root cause can be identified.

Recommended approach:

Baseline and identify the pattern

Confirm for each affected VM:
- Hypervisor/host (ESXi version, host name) and whether multiple affected VMs share the same host.
- Windows version and patch level.
- What was happening at the time (backup, AV scan, patching, high I/O, etc.).
Note that the gap in Event Logs and the “Getting Windows ready” message after reboot are typical of a system that was hung while Windows was attempting to complete operations (for example, servicing or shutdown-related work) and then resumes/finishes after recovery.

Use the official freeze troubleshooting workflow Follow the “Advanced troubleshooting for Windows freezes” guidance:

Determine scope and frequency:
- Which computers are freezing (specific VMs/hosts or random)?
- What operation is happening when they freeze (backup, patching, heavy load)?
- How often and on how many computers this occurs.
This is explicitly recommended in the “Identify the problem” section.
While a VM is frozen but still running:
- Attempt RDP (already fails in this case, which is a data point).
- Test ping (already succeeds).
- Try administrative access (Computer Management, Server Manager, administrative shares like \\ServerName\c$).
- Try Ctrl+Alt+Delete on the console and note whether there is any response.
These steps help distinguish between full kernel hangs, user-mode deadlocks, or storage/network stalls.
After the VM recovers:
- Review System and Application logs around the time of the freeze for:
  - Application Error events (crashes of critical processes).
  - Service Control Manager errors for critical services.
  - Error Event IDs 2019/2020 with source Srv/Server (resource exhaustion).
- Run perfmon /report to generate a System Diagnostics report for each affected server and compare for common bottlenecks.

Prepare to capture a complete memory dump Because the issue is intermittent and not easily reproducible, configure the VMs now so that the next occurrence produces a kernel dump that can be analyzed.

Configure complete memory dump on each affected VM:
1. Run sysdm.cpl → Advanced → Performance → Settings → Advanced → Change to verify virtual memory.
2. In System Properties → Advanced → Startup and Recovery → Settings.
3. Under “Write debugging information”, select Complete memory dump.
4. Select Overwrite any existing file.
5. Ensure there is a paging file on the system drive at least 100 MB larger than installed RAM.
6. Ensure the system drive has more free space than the physical RAM size.
Enable keyboard-triggered crash (CrashOnCtrlScroll) inside each VM so a dump can be forced when the hang occurs:
1. In Registry Editor, create CrashOnCtrlScroll (REG_DWORD = 1) under both:
  - HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\i8042prt\Parameters
  - HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\kbdhid\Parameters
2. Restart the VM.
3. When the VM exhibits the freeze, hold Right Ctrl and press Scroll Lock twice to force a crash and generate %SystemRoot%\MEMORY.DMP.
This is the recommended method for capturing a dump on a frozen Windows system.
If system drive space is constrained on some servers, configure a dedicated dump file volume using the DedicatedDumpFile registry entry as described in “Configure the destination path for a memory dump” and “How to use the DedicatedDumpFile registry value to overcome space limitations on the system drive”.

VM-specific data collection Since all affected systems are VMware VMs, also prepare a hypervisor-based capture path:

Use VMware snapshots or suspend state and convert to a dump:
- Take a snapshot or suspend the VM while it is frozen.
- Use VMware’s Checkpoint To Core Tool (vmss2core) to convert .vmss or .vmsn to a Windows-compatible dump file.
- Analyze the dump with standard Windows debugging tools or provide it to Microsoft Support.

This is explicitly recommended for VMware environments and is useful if the in-guest CrashOnCtrlScroll cannot be triggered.

Analyze and remediate

Once a dump is captured from a frozen state, analyze it (or provide to Microsoft Support) to determine whether the root cause is:
- Kernel-mode driver deadlock (for example, storage, network, filter drivers, backup agents, AV, or VMware tools components).
- Resource exhaustion (nonpaged pool, handles, etc.).
- A specific Windows component or third-party service.
Based on the dump findings, typical next steps include:
- Updating or removing problematic drivers/agents.
- Adjusting backup or AV scanning schedules or methods.
- Addressing storage or network latency issues at the hypervisor or SAN level.

If the VM is frozen and cannot be interacted with

If the VM is completely unresponsive to keyboard input, use the hypervisor to trigger a dump:
- For Hyper-V, the guidance is to use Debug-VM -InjectNonMaskableInterrupt; in VMware, use the documented snapshot/suspend plus vmss2core method.

Engage support with collected data

With:
- A complete memory dump from a frozen state.
- System/Application event logs.
- perfmon /report output.
- VMware host logs around the same time.
open a support case with Microsoft and/or VMware so they can correlate guest and host behavior.

This workflow is designed specifically for intermittent freeze issues like those described and is applicable to Windows Server 2016/2019/2025 running on VMware.

References:

Advanced troubleshooting for Windows freezes

Share via

Multiple Random Windows Servers have freezing issue

2 answers

Your answer