AVD Hosts - Disk/vhdmp/ESENT errors and VM unavailable randomly

Question

AVD Hosts - Disk/vhdmp/ESENT errors and VM unavailable randomly

Nicholas King 0

Hi there,

We have a small AVD deployment with the following set-up:

Standard_D4as_v5 VMs with 128GB Premium SSDs as OS Disk (accelerated networking enabled)
Windows 11 AVD + M365 Apps Image
4 people per VM session limit
Uses FSLogix Cloud Cache (with Page Blobs) due to cloud-only deployment
We have Intune to manage all settings, has AV exclusions for FSLogix/Teams and FSLogix redirect exemptions for caches

We have started to see, approximately once a week, errors occuring related to vhdmp, disk, ESENT, and Ntfs in the Event Viewer logs that coincide with the AVD Agent and VM Agent being unavailable. Currently connected users are still able to operate fine, but the VM doesn't report any guest metrics to Azure (will have VM level metrics still like CPU, network, etc). Usually one of the sessions gets stuck and that person has to be force logged off via the portal.

This lasts for about 2-3 hours then the VM fixes itself and reports a bunch more disk errors in the Event Viewer log.

An example occurred this morning:

At 8:01 the VM was started, by 8:05 4 people had logged in.

At 8:05 the VM hit 50% IOPS and 100% CPU on Azure metrics, by 8:08 disk metrics had returned to sub 10% and CPU was dropping to 50%. At 8:10 the VM became unresponsive and one of the sessions had crashed. https://imgur.com/Hh1mis8 shows the System Event Log as the VM became unresponsive (Application event log has nothing at all at this time until it is responsive again). The disk and vhdmp warnings continue every 10-20 minutes until 10:49 when the VM is responsive again.

At 10:49 the VM was responsive again and AVD agent was available shortly after. https://imgur.com/efugath (application) and https://imgur.com/N3IbrUt (system) shows the Event Logs. There are a bunch of ESENT, Ntfs and disk errors in the few seconds after it is responding again. There are no errors at the lead-up to the crash, only a bunch of ignorable errors immediately after booting.

Note: the VM has approx 25% spare space in the HDD, it never seems to hit network or IOPS/disk bandwidth limits. CPU regularly hits 100% when multiple people log in but doesn't cause crashes, just initial slowness.

We can mitigate the issue by logging off users and deleting the VM, then remaking a new host, but would like a better understanding of what exactly is happening.

We suspect something related to disk or network latency is affecting the FSLogix process as it tries to get all the profiles at once combined with AV scanning and such, just causing it to hang until it times out. There are numerous Reddit and SU posts with similar vhdmp error 129 where they are causing disk or network latency and temporarily hanging things on their hosts (RDS, Citrix and AVD versions of desktops).

Is there any settings we can change in FSLogix to reduce the impact of a mass influx of users at the start of the day?

Prrudram-MSFT 28,281 Reputation points Moderator

2023-11-14T15:28:14.7733333+00:00

@Nicholas King

Based on the information provided, it seems like you are experiencing issues with the AVD Agent and VM Agent becoming unavailable, and disk errors occurring in the Event Viewer logs. This issue occurs approximately once a week and coincides with a mass influx of users at the start of the day.

It is possible that the issue is related to disk or network latency affecting the FSLogix process as it tries to get all the profiles at once, combined with AV scanning and other processes, causing it to hang until it times out. This can result in the AVD Agent and VM Agent becoming unavailable and disk errors occurring in the Event Viewer logs.

To reduce the impact of a mass influx of users at the start of the day, you can try adjusting the FSLogix settings to optimize performance. Here are some settings you can consider changing:

Adjust the FSLogix profile container size: You can adjust the size of the FSLogix profile container to ensure that it has enough space to store all the user profiles. You can also consider using a larger cache size to improve performance.

Adjust the FSLogix profile container location: You can adjust the location of the FSLogix profile container to ensure that it is located closer to the users to reduce latency.

Adjust the FSLogix profile container replication settings: You can adjust the replication settings for the FSLogix profile container to ensure that it is replicated to multiple locations for redundancy and improved performance.

Adjust the AV scanning settings: You can adjust the AV scanning settings to reduce the impact on performance. For example, you can exclude certain files or folders from scanning, or adjust the scanning schedule to occur during off-peak hours.

Adjust the network settings: You can adjust the network settings to ensure that there is enough bandwidth available for the FSLogix process. For example, you can adjust the QoS settings to prioritize traffic for the FSLogix process.

It is also recommended to monitor the system performance and logs regularly to identify any issues and take appropriate actions to resolve them. You can use Azure Monitor to monitor the system performance and logs, and set up alerts to notify you of any issues.
Carlos Solís Salazar 18,191 Reputation points MVP Volunteer Moderator

2023-11-15T12:29:15.5766667+00:00

The symptoms you've described in your Azure Virtual Desktop (AVD) deployment point towards a few potential areas of concern, primarily revolving around disk performance and FSLogix profile management. Let's break down the possible causes and solutions:

High CPU and IOPS Usage: The spike in CPU and IOPS at login times could be contributing to the disk and system performance issues. This spike is likely due to the simultaneous loading of user profiles and initial application starts. When the CPU hits 100%, it can lead to system instability and performance degradation.

FSLogix Profiles and Disk Performance: FSLogix, especially when using Cloud Cache with Page Blobs, can be sensitive to disk performance issues. The Cloud Cache feature is designed to provide resiliency, but it can also introduce complexity and potential performance bottlenecks, particularly when many users are logging in simultaneously.

Disk Errors and Event Logs: The disk, ESENT, and NTFS errors in the Event Viewer logs suggest that there might be underlying issues with the disk subsystem. These errors can be indicative of problems with the virtual disk, the underlying Azure storage, or the way FSLogix is interacting with the disk.

AVD Agent and VM Agent Unavailability: The fact that these agents become unavailable, and the system eventually recovers, might indicate that the system is overwhelmed but not entirely failing. It suggests a resource bottleneck rather than a complete system failure.

Recommendations:

Optimize FSLogix Performance: Consider adjusting FSLogix profile container settings. This can include enabling concurrent sessions, adjusting the size of the VHDX files, and fine-tuning the cache settings. Microsoft provides guidance on optimizing FSLogix profiles on their documentation page.

Monitor and Analyze Performance Metrics: Keep a close eye on performance metrics, particularly during peak times. Azure Monitor can be useful for this. Look for patterns in CPU, disk I/O, and network usage.

Disk Sizing and Performance: Ensure that the disk sizing and performance are aligned with your needs. Sometimes, upgrading to a higher performance disk or adjusting the caching policies can mitigate these types of issues.

Network Latency and Bandwidth: Since FSLogix heavily relies on network performance for cloud-based profiles, make sure that network latency and bandwidth are not bottlenecks, especially during peak login times.

Review AV Configuration: Although you have AV exclusions for FSLogix, it might be worthwhile to review the antivirus configuration again. Sometimes, AV software can interfere with disk operations, particularly when a system is under heavy load.

Azure Support and Advanced Diagnostics: If these steps do not resolve the issue, consider reaching out to Azure support. They can provide more in-depth diagnostics and may be able to identify issues specific to your environment.

Staggered Logins: As a temporary measure, you might consider staggering logins during peak times to reduce the initial load on the system.

It's important to note that diagnosing these issues can sometimes require a process of elimination and may involve multiple adjustments to find the right balance for your specific deployment. For more detailed information and guidance, I recommend consulting the Azure documentation and support resources.

For further exploration, you can visit the Azure documentation on FSLogix for AVD and Azure Monitor.
Samuel Joines 0 Reputation points

2024-11-12T17:31:56.77+00:00

Hi Nicholas, did you end up finding a solution to this? Our environment is doing word for word what yours is doing and we have been running around for a couple weeks trying to solve the problem.

Your answer

Samuel Joines 0 Reputation points

2024-11-12T17:31:56.77+00:00

Hi Nicholas, did you end up finding a solution to this? Our environment is doing word for word what yours is doing and we have been running around for a couple weeks trying to solve the problem.

Share via

AVD Hosts - Disk/vhdmp/ESENT errors and VM unavailable randomly

Your answer