App Service Health Check Alert

DavidL 11 Reputation points
2020-10-28T20:22:55.103+00:00

We have a number of App Services where we want to use health checks. In the first step we want to use them in alerts and later for auto-healing / load-balancing etc. I have observed a number of things that do not appear to be correct.

I have implemented a switch in a service that will start reporting bad health when a certain env variable is set in order to test. We also tested it with simply stopping the service.

Steps:

  1. Set up a service with a health endpoint on /health
  2. Enabled health check under "Monitoring > Health Check" and set the correct path
  3. I can see the metrics when I click on the metrics button here
  4. I stop the service and leave for half an hour
  5. No alert is fired
  6. I start the service again
  7. Now an alert is triggered
  8. In the second test later I set the above env variable and validate that the endpoint starts reporting bad health (503)
  9. After a while an alert is triggered (which is ok)
  10. I leave the service in bad health state
  11. After 10 mins the alert is deactivated automatically
  12. In the metrics graph mentioned above the value is again at 100 (which is incorrect)
  13. After a further 15 mins the alert fires again
  14. Now I see 2 dips about 20mins apart and a plateau at 100 between them
  15. This behavior continues in a cycle (alert and 10 minutes later automatic deactivation, 20 minutes later again alerted)

Expected behaviour:
I expect at point 5 that an alert is sent when the service is stopped. When the service is stopped there are no entries in the AzureMetrics Log Table.

I do not expect the alert to resolve (point 11) although the service is unhealthy. The metric should not report good health in the intervening time (point 12).

Further question:
In the AzureMetrics Log Table the Metric HealthCheck Status has the following values:

Total 100
Count 1
Maximum 100
Minimum 100
Average 100
TimeGrain PT1M
UnitName Count

This doesn't make any sense to me. The value of count is 1 but the total is 100? Is it %? Which of these values should I alert? Minimum < 100?

Azure Monitor
Azure Monitor
An Azure service that is used to collect, analyze, and act on telemetry data from Azure and on-premises environments.
2,875 questions
{count} vote