We have a small AVD deployment with the following set-up:
- Standard_D4as_v5 VMs with 128GB Premium SSDs as OS Disk (accelerated networking enabled)
- Windows 11 AVD + M365 Apps Image
- 4 people per VM session limit
- Uses FSLogix Cloud Cache (with Page Blobs) due to cloud-only deployment
- We have Intune to manage all settings, has AV exclusions for FSLogix/Teams and FSLogix redirect exemptions for caches
We have started to see, approximately once a week, errors occuring related to vhdmp, disk, ESENT, and Ntfs in the Event Viewer logs that coincide with the AVD Agent and VM Agent being unavailable. Currently connected users are still able to operate fine, but the VM doesn't report any guest metrics to Azure (will have VM level metrics still like CPU, network, etc). Usually one of the sessions gets stuck and that person has to be force logged off via the portal.
This lasts for about 2-3 hours then the VM fixes itself and reports a bunch more disk errors in the Event Viewer log.
An example occurred this morning:
At 8:01 the VM was started, by 8:05 4 people had logged in.
At 8:05 the VM hit 50% IOPS and 100% CPU on Azure metrics, by 8:08 disk metrics had returned to sub 10% and CPU was dropping to 50%. At 8:10 the VM became unresponsive and one of the sessions had crashed. https://imgur.com/Hh1mis8 shows the System Event Log as the VM became unresponsive (Application event log has nothing at all at this time until it is responsive again). The disk and vhdmp warnings continue every 10-20 minutes until 10:49 when the VM is responsive again.
At 10:49 the VM was responsive again and AVD agent was available shortly after. https://imgur.com/efugath (application) and https://imgur.com/N3IbrUt (system) shows the Event Logs. There are a bunch of ESENT, Ntfs and disk errors in the few seconds after it is responding again. There are no errors at the lead-up to the crash, only a bunch of ignorable errors immediately after booting.
Note: the VM has approx 25% spare space in the HDD, it never seems to hit network or IOPS/disk bandwidth limits. CPU regularly hits 100% when multiple people log in but doesn't cause crashes, just initial slowness.
We can mitigate the issue by logging off users and deleting the VM, then remaking a new host, but would like a better understanding of what exactly is happening.
We suspect something related to disk or network latency is affecting the FSLogix process as it tries to get all the profiles at once combined with AV scanning and such, just causing it to hang until it times out. There are numerous Reddit and SU posts with similar vhdmp error 129 where they are causing disk or network latency and temporarily hanging things on their hosts (RDS, Citrix and AVD versions of desktops).
Is there any settings we can change in FSLogix to reduce the impact of a mass influx of users at the start of the day?