We have a number of App Services where we want to use health checks. In the first step we want to use them in alerts and later for auto-healing / load-balancing etc. I have observed a number of things that do not appear to be correct. I have implemented a switch in a service that will start reporting bad health when a certain env variable is set in order to test. We also tested it with simply stopping the service. Steps: Set up a service with a health endpoint on /health Enabled health check under "Monitoring > Health Check" and set the correct path I can see the metrics when I click on the metrics button here I stop the service and leave for half an hour No alert is fired I start the service again Now an alert is triggered In the second test later I set the above env variable and validate that the endpoint starts reporting bad health (503) After a while an alert is triggered (which is ok) I leave the service in bad health state After 10 mins the alert is deactivated automatically In the metrics graph mentioned above the value is again at 100 (which is incorrect) After a further 15 mins the alert fires again Now I see 2 dips about 20mins apart and a plateau at 100 between them This behavior continues in a cycle (alert and 10 minutes later automatic deactivation, 20 minutes later again alerted) Expected behaviour: I expect at point 5 that an alert is sent when the service is stopped. When the service is stopped there are no entries in the AzureMetrics Log Table. I do not expect the alert to resolve (point 11) although the service is unhealthy. The metric should not report good health in the intervening time (point 12). Further question: In the AzureMetrics Log Table the Metric HealthCheck Status has the following values: Total 100 Count 1 Maximum 100 Minimum 100 Average 100 TimeGrain PT1M UnitName Count This doesn't make any sense to me. The value of count is 1 but the total is 100? Is it %? Which of these values should I alert? Minimum < 100?

App Service Health Check Alert

DavidL 11

We have a number of App Services where we want to use health checks. In the first step we want to use them in alerts and later for auto-healing / load-balancing etc. I have observed a number of things that do not appear to be correct.

I have implemented a switch in a service that will start reporting bad health when a certain env variable is set in order to test. We also tested it with simply stopping the service.

Steps:

Set up a service with a health endpoint on /health
Enabled health check under "Monitoring > Health Check" and set the correct path
I can see the metrics when I click on the metrics button here
I stop the service and leave for half an hour
No alert is fired
I start the service again
Now an alert is triggered
In the second test later I set the above env variable and validate that the endpoint starts reporting bad health (503)
After a while an alert is triggered (which is ok)
I leave the service in bad health state
After 10 mins the alert is deactivated automatically
In the metrics graph mentioned above the value is again at 100 (which is incorrect)
After a further 15 mins the alert fires again
Now I see 2 dips about 20mins apart and a plateau at 100 between them
This behavior continues in a cycle (alert and 10 minutes later automatic deactivation, 20 minutes later again alerted)

Expected behaviour:
I expect at point 5 that an alert is sent when the service is stopped. When the service is stopped there are no entries in the AzureMetrics Log Table.

I do not expect the alert to resolve (point 11) although the service is unhealthy. The metric should not report good health in the intervening time (point 12).

Further question:
In the AzureMetrics Log Table the Metric HealthCheck Status has the following values:

Total 100
Count 1
Maximum 100
Minimum 100
Average 100
TimeGrain PT1M
UnitName Count

This doesn't make any sense to me. The value of count is 1 but the total is 100? Is it %? Which of these values should I alert? Minimum < 100?

bharathn-msft 5,086 Reputation points Microsoft Employee

2020-10-31T02:17:09.027+00:00

@DavidL Welcome to Microsoft Q&A.

Apologies for the delay in getting to this thread. We will review your scenario and get back to you at the earliest. Thank you.
SwathiDhanwada-MSFT 18,031 Reputation points

2020-11-09T06:49:52.963+00:00

@DavidL Can you please provide below information to understand your scenario better ?

Point 5 : Kindly note that Data from resource logs take 2-15 minutes, depending on the Azure service. As you mentioned you have stopped the service, have you checked if the logs were ingested into Azure Monitor ? For log ingestion information , refer this link.

Point 11 : Can you please share the alert details that you have configured ?

Point 12 : Can you please elaborate on this?
DavidL 11 Reputation points

2020-11-09T14:36:23.047+00:00

Hi Swathi,

thanks for the answer. I'll break the answer into separate posts because of the 1000 chars limit.

Regarding Point 5 - I am not using logs, as described in detail it is based on the health check feature which results in the HealthCheckStatus metric. I have set an alert on this metric.
But even so, as stated in Point 4: I left it for half an hour and on another occassion I left it all night and it still did not fire. And like I said after restarting the service it then created an alert and sent an email.

DavidL 11

Regarding point 11 -

"criteria": {
  "allOf": [
    {
      "criterionType": "StaticThresholdCriterion",
      "metricName": "HealthCheckStatus",
      "metricNamespace": "Microsoft.Web/sites",
      "operator": "LessThan",
      "threshold": 100.0,
      "timeAggregation": "Average"
    }
  ],
},

I also tried it with timeAggregation Min.

DavidL 11 Reputation points

2020-11-09T14:39:03.593+00:00

Regarding point 12 - summarized, the behaviour was as follows:
The service health endpoint was continuously reporting down / 503 without change for a whole night. At the end it had created around 60 alerts and the alert was also automatically resolved also 60 times / deactivated. This resulted in in total 130 Emails.

I have a screen shot showing the "flapping" of the metric over a period of several hours.
Piekstra, Caleb 1 Reputation point

2021-04-01T16:28:02.617+00:00

I am also having difficulties getting a basic health check alert setup following the same process. It is disappointing that this issue has not been prioritized as there are many other platforms with easier alert setups. I should not have to burn so many engineering hours trying to get something so simple to work when third-party observability solutions are simply "plug and play".

Share via

App Service Health Check Alert