Keep your Windows Azure applications running with custom health checks

Summary: Even though Windows Azure does a great job of keeping your VMs running, only you know exactly what it means for your own apps to be healthy. This post and sample code shows a pattern for implementing custom health checks that can report the health of your application, recover from failures if possible, and bring an instance offline if not.

The big selling point of Platform as a Service (PaaS) is that managing things like the hardware, network and operating system is someone else’s problem. So if a Windows Azure server spontaneously combusts in the middle of the night, the fabric will automatically detect this and move any affected VMs onto servers not currently engulfed by flames. However keeping your applications healthy is still your responsibility. While a simple web application is almost certain to keep running if the underlying OS and hardware are healthy, this isn’t necessarily true for a complex enterprise application. A complex app could depend on various Windows Services running, files and registry keys being in a certain state, and connectivity to local and remote services being in place. If any of these things break it could be disastrous for your application, yet as far as the Windows Azure fabric is concerned, everything is running perfectly.

Fortunately, Windows Azure provides you with a couple of hooks that let you implement health checks into your deployments. The first of these is Load Balancer Probes. These can be specified in your ServiceDefinition.csdef file, and instruct the Windows Azure load balancer to regularly ping URLs of your choice to check that your service is still alive. If the response is anything but a 200 OK, your VM is deemed unhealthy and it’s removed from the load balancer. This mechanism is useful as it’s completely declarative, works on both PaaS (cloud services) and IaaS (virtual machines), and can be hooked into existing production or health check pages built into a web app. On the downside, when a Load Balancer Probe detects your instance is unhealthy it doesn’t apparently change the VM state or otherwise indicate to operations staff that something is wrong. Also, web pages are not always the best launching point for checking health of things like Windows Services.

The other option is to subscribe to the RoleEnvironment.StatusCheck event, which gets fired every time the Windows Azure Guest Agent needs to report its status to the fabric. In the event handler you can implement any logic you wish, and if you deem the instance unhealthy you can request the instance’s status be changed to Busy until the next status check, resulting in it being removed from the load balancer. Since this approach leverages the RoleEnvironment it’s PaaS only, but I found it more useful than Load Balancer Probes so it’s the approach I’ll use in this post and accompanying sample.

First, the mandatory disclaimer. I’ve made some source code available that shows the solution described in the post, but this is sample, not production quality code. It hasn’t been extensively tested and may well contain bugs. For enterprise use you should look at incorporating dependency injection and possibly a plug-in framework like MEF to provide more flexibility and testability. And the two health check classes I’ve supplied are designed as examples of what kind of thing is possible, rather than being complete implementations that will solve real problems. However I’m confident the pattern is sound, so feel free to build upon this (with appropriate testing) in your own solutions.

With that out of the way, let’s talk about the solution. My code contains a project called AzureHealthCheck which contains both the framework classes and a couple of sample health checks. This can be wired up into any web or worker roles in your PaaS cloud services. You do this by creating an instance of HealthCheckController, starting it when the role starts and stopping it when the role stops:

 public class WebRole : RoleEntryPoint
{
    private HealthCheckController _healthCheckController = new HealthCheckController();

    public override bool OnStart()
    {
        _healthCheckController.Start();
        return base.OnStart();
    }

    public override void OnStop()
    {
        _healthCheckController.Stop();
        base.OnStop();
    }       
}

I won’t show all the code for HealthCheckController here, but here’s what it does:

  • Retrieves the list of health checks and their settings from the ServiceConfiguration setting called “HealthCheckers”, instantiates the classes and stores them in a list
  • Subscribes to the RoleEnvironment.Changed and RoleEnvironment.StatusCheck events
  • In the RoleEnvironment_Changed event handler, reloads the configuration and refreshes the list of health checkers
  • In the RoleEnvironment_StatusCheck event handler:
    • Iterates through all configured health checks
      • Calls the CheckHealth() method
      • If the instance is deemed unhealthy, and the health checker wasn’t configured to keep the instance running anyway, calls SetBusy() to remove the instance from the load balancer
      • For any unhealthy or error conditions, logs the results to Windows Azure Diagnostics (which could be read by System Center Operations Manager or similar tools to perform further actions such as rebooting or reimaging the instance)

The health checkers are custom classes that implement the IHealthChecker interface:

 public interface IHealthChecker
{
    void Initialize(string initData);
    HealthCheckResult CheckHealth();
}

public class HealthCheckResult
{
    public bool IsHealthy { get; set; }
    public bool RemediationAttempted { get; set; }
    public string StatusMessage { get; set; }
}

As you can see, this is a pretty simple contract. Each health checker can be initialised with a custom string, it can check health, and it can report the result: whether the check showed the instance as healthy, whether remediation was performed, and any status messages that should be logged.

In my code I have two sample health checkers. The first, PingUrlHealthChecker, is a lot like a Load Balancer Probe in that it calls a URL in your main web site and reports unhealthy if the response is anything but 200 OK. This health checker does not attempt any remediation:

 public class PingUrlHealthChecker : IHealthChecker
{
    private Uri _pingUrl;

    private string GetWebUrlBase()
    {
        var serverManager = new ServerManager();
        var siteName = RoleEnvironment.CurrentRoleInstance.Id + "_Web"; // TODO: allow user to specify which site to use
        var site = serverManager.Sites[siteName];
        var binding = site.Bindings[0]; // TODO: allow user to specify which binding to use
        return  String.Format("{0}://{1}:{2}/", binding.Protocol, binding.EndPoint.Address, binding.EndPoint.Port);
    }

    public void Initialize(string initData)
    {
        // initData is a relative URL to ping, e.g. "/foo/bar.aspx"
        var urlBase = new Uri(GetWebUrlBase());
        _pingUrl =  new Uri(urlBase, initData);
    }

    public HealthCheckResult CheckHealth()
    {
        var result = new HealthCheckResult
        {
            IsHealthy = true,
        };

        HttpWebResponse response; 
        try
        {
            var client = WebRequest.Create(_pingUrl) as HttpWebRequest;
            response = client.GetResponse() as HttpWebResponse;
        }
        catch (WebException wex)
        {
            response = (HttpWebResponse) wex.Response;
        }
        if (response.StatusCode != HttpStatusCode.OK)
        {
            result.IsHealthy = false;
            result.StatusMessage += String.Format("HTTP Response '{0} {1}' returned from URL {2}.", (int)response.StatusCode, response.StatusDescription, _pingUrl.ToString());
        }
        
        return result;
    }
}

The second sample health checker, WindowsServiceHealthChecker does just what it says on the box: it checks if a specified Windows Service is running. However this one supports remediation, in that it will attempt to restart the service if it isn’t running. If it’s successful, the instance will still be reported as healthy (although a warning is still logged). If not, the instance will be reported as unhealthy.

 public class WindowsServiceHealthChecker : IHealthChecker
{
    private string _serviceName;

    public void Initialize(string initData)
    {
        _serviceName = initData;
    }

    public HealthCheckResult CheckHealth()
    {
        var result = new HealthCheckResult();
        var sc = new ServiceController(_serviceName);

        if (sc.Status == ServiceControllerStatus.Stopped)
        {
            result.RemediationAttempted = true;
            result.StatusMessage = String.Format("Service '{0}' was restarted.", _serviceName);
            try
            {
                sc.Start();
                DateTime start = DateTime.Now;
                while (sc.Status != ServiceControllerStatus.Running && (DateTime.Now - start).TotalSeconds < 10)
                {
                    sc.Refresh();
                }
            }
            catch (Exception ex)
            {
                result.StatusMessage = String.Format("Service '{0}' could not be restarted: '{1}'. ", _serviceName, ex.Message);
            }

        }

        if (sc.Status == ServiceControllerStatus.Running)
        {
            result.IsHealthy = true;
        }
        else
        {
            result.IsHealthy = false;
            result.StatusMessage += String.Format("Service '{0}' is in state {1}.", _serviceName, sc.Status);
        }

        return result;
    }
}

Hopefully these two sample health checkers will give you an idea of what’s possible, so you can come up with more interesting and sophisticated ones for your own application. Once your health checkers are defined, the last step is to configure your cloud service with a setting called “HealthCheckers” in ServiceConfiguration.cscfg. The value is a JSON-formatted array of health checkers, each of which you must specify the typeName (assembly-qualified), the initData, and optionally a Boolean value keepRunningIfUnhealthy (default is false). An example setting is shown below, although keep in mind that if you edit it using Visual Studio’s role configuration dialog you don’t need to escape the quotes yourself.

 <Setting name="HealthCheckers" value="[{&quot;typeName&quot;: &quot;AzureHealthCheck.PingUrlHealthChecker, AzureHealthCheck&quot;, &quot;initData&quot;: &quot;FileProbe.txt&quot;},
{&quot;typeName&quot;: &quot;AzureHealthCheck.WindowsServiceHealthChecker, AzureHealthCheck&quot;, &quot;initData&quot;: &quot;W3SVC&quot;},
{&quot;typeName&quot;: &quot;AzureHealthCheck.WindowsServiceHealthChecker, AzureHealthCheck&quot;, &quot;initData&quot;: &quot;W32Time&quot;, &quot;keepRunningIfUnhealthy&quot; : &quot;true&quot;}]" />

Note that since this setting is in CloudConfiguration.cscfg, you can change it after deployment, although of course any health check assemblies you use need to be deployed with your solution.

Running the Sample

If you like the look of this and would like to try it out, please download the sample code. Here’s what you need to do to get it running and test it out:

  1. If you don’t already have a Windows Azure subscription, sign up now and download the .NET SDK for Visual Studio 2012
  2. Open the sample code solution in Visual Studio 2012
  3. Update the diagnostics connection string for your cloud storage account
  4. Publish the solution to the cloud, ensuring you enable remote desktop and configure your credentials
  5. Once the solution is up and running in the cloud, remote desktop into any one of the instances and try:
    1. Browsing to E:\sitesroot\0 and deleting/renaming the file called ProbeFile.txt. You should see the instance status change to Busy and removed from the load balancer, and error logs written to the WADLogsTable. Once you’re done, restore the file to make the instance healthy again.
    2. Opening the Services console and stopping the “World Wide Web Publishing Service”. You should see the service automatically restart, and a warning log written to the WADLogsTable, but the instance will remain healthy as the issue was remediated.
    3. Opening the Services console, stopping and disabling the “Windows Time” service (disabling it prevents the remediation from succeeding). You should see logs in the WADLogsTable saying the instance is unhealthy, however it will not be set to Busy since this health checker has keepRunningIfUnhealthy set to true.

As I’ve said before, this is just a sample—but I hope that it will serve as a useful staring point that will help you build rock-solid PaaS solutions on top of Windows Azure.