Troubleshooting Scenario 3 – Role Stuck in Busy
Continuing from the troubleshooting series at Windows Azure PaaS Compute Diagnostics Data, this blog post will describe how to troubleshoot a role that is stuck in the Busy state with the error message “Busy (Waiting for role to start... Application startup tasks are running.)”.
Symptom
You are deploying a service to Azure and your service never starts correctly and never gets to the Ready state. It is stuck forever in the Busy state.
Get the Big Picture
Just like in Scenario 1 we want to RDP onto the Azure VM and get a high level overview of what is happening on the VM. We see WaHostBootstrapper running, but we do not see WaWorkerHost.exe and watching task manager for a minute or two we don’t see any other changes. From the notes in Scenario 1 we know that this probably means a problem with a startup task.
Check the logs
We know from Windows Azure PaaS Compute Diagnostics Data that the WaHostBootstrapper logs are in C:\Resources\WaHostBootstrapper.log. Checking that folder we see only one WaHostBootstrapper.log file, and no instances of WaHostBootstarpper.log.old.<index>. This means that WaHostBootstrapper has only tried to start one time and it isn’t recycling due to an error.
WaHostBootstrapper log:
I am going to skip over all of the stuff that we expect to see in every host bootstrapper log (see Scenario 2 for more info) and go straight to the bottom of the log because I am interested in whatever host bootstrapper is doing right now, not what it did in the past. These are the last two lines in the file:
[00002968:00001680, 2013/09/05, 21:47:15.487, INFO ] Executing Startup Task type=0 rolemodule=(null) cmd="E:\approot\StartupTasks\Startup.cmd"
[00002968:00001680, 2013/09/05, 21:47:15.487, INFO ] Executing "E:\approot\StartupTasks\Startup.cmd" .
We can compare the timestamp in host bootstrapper (2013/09/05, 21:47:15.487) to the current system time (2013/09/05, 22:13:34.581) to see that the host bootstrapper has been trying to run E:\approot\StartupTasks\Startup.cmd for about 30 minutes. Notice that the task type is 0 (“Executing Startup Task type=0”) which is a simple startup task which causes the host bootstrapper to wait until the task finishes before continuing execution (see the taskType entry at https://msdn.microsoft.com/en-us/library/windowsazure/gg557552.aspx).
Note that we can also get this same information from the event viewer. We can see that the last thing that Azure is doing is trying to run the startup task:
Check the startup task and test the fix
Now that we know where we are failing lets go check the startup task. Here are the contents of e:\approot\StartupTasks\Startup.cmd:
ECHO Press Enter When Ready
PAUSE
REM Other commands here...
This one is pretty simple and obvious to tell what is wrong, but if this were more complicated logic then we could run the startup task manually, attach a debugger to it, etc. Now it would be nice to test a quick fix to the startup task to make sure it fixes the problem without having to completely redeploy the solution. Using the information at How to Modify a Running Azure Service we know that we can terminate WaHostBootstrapper, replace the Startup.cmd, and then wait.
I modified Startup.cmd to remove the PAUSE line, and terminated WaHostBootstrapper, and waited for a minute, and now I see WaWorkerHost.exe running:
And the WaHostBootstrapper.log shows that my startup task finished correctly and the role host process (WaWorkerHost) was started:
[00002844:00002244, 2013/09/06, 14:28:55.386, INFO ] Executing Startup Task type=0 rolemodule=(null) cmd="E:\approot\StartupTasks\Startup.cmd"
[00002844:00002244, 2013/09/06, 14:28:55.386, INFO ] Executing "E:\approot\StartupTasks\Startup.cmd" .
[00002844:00002244, 2013/09/06, 14:28:55.401, INFO ] Program "E:\approot\StartupTasks\Startup.cmd" exited with 0. Working Directory = E:\approot\
[00002844:00002244, 2013/09/06, 14:28:55.401, ERROR] <- GetDebugger=0x1
[00002844:00002244, 2013/09/06, 14:28:55.401, ERROR] <- GetRoleHostDebugger=0x1
[00002844:00002244, 2013/09/06, 14:28:55.401, INFO ] Executing base\x64\WaWorkerHost.exe .
[00002844:00002244, 2013/09/06, 14:28:55.401, INFO ] Role host process PID: 4036.
[00002844:00003348, 2013/09/06, 14:28:55.479, INFO ] Registering client with PID 4036.
[00002844:00003348, 2013/09/06, 14:28:55.479, INFO ] Client WaWorkerHost.exe (4036) registered.
[00002844:00003348, 2013/09/06, 14:28:55.479, INFO ] Client process 4036 is the role host.
[00002844:00003348, 2013/09/06, 14:28:55.479, INFO ] Role host process registered.
Comments
Anonymous
February 05, 2015
The comment has been removedAnonymous
February 05, 2015
Andrew, you are 100% correct. There have been significant improvements in this area since Azure was released and these types of difficult to troubleshoot problems are far less frequent today than they used to be. We have also identified several more common failure paths and have some nice improvements in the release pipeline that should further help with this scenario.Anonymous
June 18, 2015
This is a very nice post especially the troubleshooting section , it has helped me a lot .Thanks Kevin Also kevin, as part of startup tasks does cloud services connect to db also?Anonymous
June 19, 2015
Thanks for the feedback Ismail. I am not sure what you mean when you ask if a cloud service connects to db in a startup task? Your startup task can be anything you want it to be (batch file, powershell, custom EXE, etc), and within the startup task you can do anything you want to do (connect to a DB, download files, install MSIs, etc).Anonymous
July 09, 2015
I agree with Andrew, this is ridiculous. The portal or studio should report the problem, what you have presented here is unacceptable.Anonymous
November 24, 2015
Which part of the solution (which files) might be "throwing unhandled exceptions"? Thanks!