Deployment of VMs intermittently times out after getting stuck at unattend/specialize phase

asked 2022-06-29T12:26:13.883+00:00
Tom 1 Reputation point

Since a couple months, when rolling out a custom image from a compute gallery about 5-10% of the deployments of these VMs fail. The other VMs in the batch finish within a couple of minutes and will finish deploying successfully. This happens with old versions of the image (that worked fine before) as well as with newly created versions of the image. The issue does not occur when we use a vanilla gallery image directly from Microsoft.

The custom image is based on Windows 10 multi-session and is meant for rollout to an AVD hostpool.

The issue occurs in bot large batches and small batches and always seems to have around a 5-10% failure rate. The issue is not unique to deployments to hostpools in AVD. When batch deploying VMs outside of the AVD wizard to "regular" VMs, the issue also occurs.

One such example of a (failed) deployment is shown below.

216076-image.png

When enabling boot diagnostics, the screen below shows in the screenshot. It may take the VM up to 45 minutes to get past that screen and will then boot normally afterwards. The agent status will show as "not ready" while the screen shows and go to "ready" once the VM has finished booting normally. However, after that time, the deployment will have timed out and show as failed for the VM.

216172-image.png

Going through the Panther logs of the affected machines does not reveal any error, except an exceptionally long specialize phase during the unattended install. It is during this phase that the time out occurs. The setuperr.log log is completely empty. The WaSetup.xml log shows the content below, revealing that the specialize phase took just over 40 minutes.

<Event time="2022-05-23T13:53:27.474Z" category="INFO" source="Unattend"><UnattendStart><Parameters ConfigurationPass="specialize"/></UnattendStart></Event>  
<Event time="2022-05-23T14:34:06.987Z" category="INFO" source="Unattend"><DeleteKvpRegistry><specialize Message="success"/></DeleteKvpRegistry></Event>  
<Event time="2022-05-23T14:34:07.002Z" category="INFO" source="Unattend"><DeleteKvpRegistry><provisioning Message="success"/></DeleteKvpRegistry></Event>  
<Event time="2022-05-23T14:34:07.221Z" category="INFO" source="Unattend"><DeleteKvpRegistry><oobeSystem Message="success"/></DeleteKvpRegistry></Event>  

The setupact.log log paints the same picture:

2022-05-23 13:53:26, Info                         [setup.exe] [Action Queue] : Executing command "C:\WINDOWS\SYSTEM32\RUNDLL32.EXE" cmifw.dll,unattend specialize  
2022-05-23 13:53:26, Info                         [setup.exe] [Action Queue] : process exit code = 0  
2022-05-23 13:53:26, Info                         [setup.exe] [Action Queue] : Unattend action returned operation complete.  
2022-05-23 13:53:26, Info                         [setup.exe] [Action Queue] : Executing command "C:\WINDOWS\SYSTEM32\SETUPUGC.EXE"  specialize  
2022-05-23 14:22:12, Info                  CBS    Trusted Installer is shutting down because: SHUTDOWN_REASON_AUTOSTOP  
2022-05-23 14:22:12, Info                  CBS    Ending the TrustedInstaller main loop.  
2022-05-23 14:22:12, Info                  CBS    Starting TrustedInstaller finalization.  
2022-05-23 14:22:12, Info                  CBS    Lock: Lock removed: WinlogonNotifyLock, level: 8, total lock:6  
2022-05-23 14:22:12, Info                  CBS    Ending TrustedInstaller finalization.  
2022-05-23 14:34:10, Info                         [setup.exe] [Action Queue] : process exit code = 0  
2022-05-23 14:34:10, Info                         [setup.exe] [Action Queue] : Unattend action returned operation complete.  
2022-05-23 14:34:10, Info                         [setup.exe] [Action Queue] : Executing command "C:\WINDOWS\SYSTEM32\RUNDLL32.EXE" shsetup.dll,SHUnattendedSetup specialize  
2022-05-23 14:34:10, Info                         [setup.exe] [Action Queue] : process exit code = 0  
2022-05-23 14:34:10, Info                         [setup.exe] [Action Queue] : Unattend action returned operation complete.  
2022-05-23 14:34:10, Info                         [setup.exe] [Action Queue] : ProcessActionQueue finish (hr = 0x0)  
2022-05-23 14:34:10, Info                         [setup.exe] Successfully completed execution of GC queue  

Any help or pointers would be of immense help, we are banging our heads against the wall here, especially since it used to work flawlessly and now even image versions that used to work, no longer work when trying to deploy them. Using a completely new compute gallery and uploading our image there yields the same result.

Not Monitored
Not Monitored
Tag not monitored by Microsoft.
23,654 questions
{count} votes