App Service Health Check Alert
We have a number of App Services where we want to use health checks. In the first step we want to use them in alerts and later for auto-healing / load-balancing etc. I have observed a number of things that do not appear to be correct.
I have implemented a switch in a service that will start reporting bad health when a certain env variable is set in order to test. We also tested it with simply stopping the service.
Steps:
- Set up a service with a health endpoint on /health
- Enabled health check under "Monitoring > Health Check" and set the correct path
- I can see the metrics when I click on the metrics button here
- I stop the service and leave for half an hour
- No alert is fired
- I start the service again
- Now an alert is triggered
- In the second test later I set the above env variable and validate that the endpoint starts reporting bad health (503)
- After a while an alert is triggered (which is ok)
- I leave the service in bad health state
- After 10 mins the alert is deactivated automatically
- In the metrics graph mentioned above the value is again at 100 (which is incorrect)
- After a further 15 mins the alert fires again
- Now I see 2 dips about 20mins apart and a plateau at 100 between them
- This behavior continues in a cycle (alert and 10 minutes later automatic deactivation, 20 minutes later again alerted)
Expected behaviour:
I expect at point 5 that an alert is sent when the service is stopped. When the service is stopped there are no entries in the AzureMetrics Log Table.
I do not expect the alert to resolve (point 11) although the service is unhealthy. The metric should not report good health in the intervening time (point 12).
Further question:
In the AzureMetrics Log Table the Metric HealthCheck Status has the following values:
Total 100
Count 1
Maximum 100
Minimum 100
Average 100
TimeGrain PT1M
UnitName Count
This doesn't make any sense to me. The value of count is 1 but the total is 100? Is it %? Which of these values should I alert? Minimum < 100?