Troubleshooting Scenario 1 – Role Recycling
Continuing from the diagnostic information at Windows Azure PaaS Compute Diagnostics Data, this blog post will describe how to troubleshoot a role that fails to start. This particular scenario is fairly easy to troubleshoot, but serves well to show the beginning steps in troubleshooting in Azure. Future scenarios will be more complex and build on these troubleshooting techniques.
If you have been developing for very long in Azure you have had a deployment that seemed to work fine on your local development machine, but after deploying to Azure it cycled (cycling between the Starting, Busy, Recycling, etc states).
RDP To The VM
The first step to troubleshoot a recycling role is to RDP to the VM hosting that instance. If you have already turned on RDP prior to deploying the solution then you can just click the Connect button at the bottom of the Management Portal screen. If you have not, then switch to the Configure tab and click Remote to turn on RDP access to the VM. Switch back to the Instances tab and select the instance that is recycling, and after a few minutes the RDP access will be enabled and the Connect button will be enabled.
Get the Big Picture
I usually start out by trying to figure out roughly where in the role startup process the failure is occurring. The quick and easy way to do this is to reference the diagram at Windows Azure Role Architecture and compare that to what processes are running in Task Manager. Open Task Manager, switch to the Details tab, and sort descending by Process name. Assuming this is a webrole, then per the role architecture diagram I would expect to see WaAppAgent.exe, WindowsAzureGuestAgent.exe, WaHostBootstrapper.exe, WaIISHost.exe, RemoteAccessAgent.exe, RemoteForwarderAgent.exe, and maybe w3wp.exe (depending on if any HTTP requests have come in).
If I don’t see one or more of those processes running then I will watch task manager for a minute or two to see which processes startup. I would always expect WaAppAgent and WindowsAzureGuestAgent to be running since those are Azure owned processes, and if they aren’t running then something pretty significant is wrong. But beyond the guest agent processes, I am looking to see which of these processes is the last to startup:
- WindowsAzureGuestAgent – If this and WaAppAgent are the only processes running, and I never see WaHostBootstrapper startup, then there must be a failure in the guest agent. Per the role architecture blog post we know that the guest agent is responsible for setting up things like the firewall, LocalStorage resources, etc. A common error here is when the guest agent is trying to delete LocalStorage resources (when CleanOnRoleRecycle=true) but fails due to a file lock. The guest agent logs are a good place to start troubleshooting.
- WaHostBootstrapper – We know that WaHostBootstrapper is responsible for startup tasks, so if this process starts, but you don’t see the WaIISHost (or WaWorkerHost) processes start then it is most likely a startup task that is failing. WaHostBoostrapper logs are a good place to start troubleshooting.
- WaIISHost/WaWorkerHost – The role host process runs your role entrypoint code (WebRole.cs or WorkerRole.cs) so if we see the role host process start and then exit (ie. a recycling role) then we can be fairly certain that there is a bug in that code throwing an exception, or perhaps a missing dependency causing the process to fail on startup. The Azure and Application event logs are a good place to start troubleshooting.
In this particular example when I opened task manager I saw the following:
Notice that only WindowsAzureGuestAgent.exe is running. At this point I will usually watch for a minute or two to see what else happens. A few seconds later I saw the WaHostBootstrapper process, and then within a few seconds I saw the WaIISHost process. Then a few seconds after that the WaHostBootstrapper and WaIISHost processes were gone and the only thing left was WindowsAzureGuestAgent.exe.
At this point I can be fairly certain that something within my role entry point code is throwing an exception and causing WaIISHost.exe to crash.
Check the Logs
Now that I know roughly where to begin my investigation I can start looking at specific logs for error messages. Given the information from Windows Azure PaaS Compute Diagnostics Data I know which logs should be of interest in this scenario and I will start with the Windows Azure Event Logs.
Right away you can tell from this log exactly what the problem is and where the error is coming from:
I can see an Error from the Azure Runtime, with a System.Exception being thrown from my CrashInOnStart.WebRole.OnStart() code. At this point I would know to go review that code to see how it could be throwing that type of exception, which would hopefully lead me to the root cause. If you don’t immediately know what the root cause is then you need to do basic troubleshooting just like you would do on-prem – add tracing code, use Intellitrace, attach a debugger, etc.