Alerts are not resolved when web tests turns green again
We have set up availability alert rules for our web tests. We see very often when an alert is fired due to a failed web test then it is not resolved again when the web test run successfully again.
Our web test setup:
- We have single App Service Plan that consist of 5 app services and 5 function apps.
- We have set up a availability test of type "URL ping test" that runs every 15 minutes for each app service and function app.
- We have set up two test location for these tests and also enabled retries.
Our alert setup:
- Each availability test has an alert rule with following configuration:
- Alert location threshold: 1
- Period (grain): 15 minutes (same as test frequency)
- Frequencey: Every 1 minute
When one of the services/function apps goes down temporarily and the associated web test fails then an alert is raised as expected.
Our assumption here is that as long as the service is down, and its web test is failing every 15 minutes it runs, then the alert will continue in the "Fired" state.
When the service is back up again and the following web tests are green we expect the alert eventually to go in the "Resolved" state.
In most cases, this is what happens. But, sometimes, and this randomly as far as we can see, the alert is not resolved. It "hangs" in the "Fired". Then if the app service fails again afterwards, a new alert is raised.
We have tried with alert rules created automatically through the portal and with alert rules created via ARM as explained https://learn.microsoft.com/en-us/azure/azure-monitor/platform/alerts-metric-create-templates#template-for-an-availability-test-along-with-a-metric-alert