For context: On Wednesday at 4:20am, one of our Azure SQL db's was unavailable for a few minutes. I was alerted to this when reviewing some Application Insight Exceptions that stated ('db-name-example' on server 'db-server-example' is not currently available. Please retry the connection later. If the problem persists, contact customer support, and provide them the session tracing ID of 'Example-ID')
Upon reviewing the portal Activity Log for this timeframe, I found "Health Event" for the db (severity=informational).
"details": "Your database was moved to a different machine to ensure it has the resources required for its compute size. This is an occasional transient operation. Currently, Azure shows the downtime for your SQL database resource at a two-minute granularity. The actual downtime may be less than that. Please also note the outage window may be shifted by around 5 minutes.",
Earlier this year, we encountered a more critical situation. As a response, I created Action Group Notification for SQL_Database_Alerts to get an SMS and an Email, supposedly when any db in a particular resource group is unavailable for any reason regardless of severity.
The elephant in the room here is to enable zone redundancy to avoid this issue... but the team would still like to receive the alert.
I have read up on Action Groups, Metric Alerts, and Transient connection errors... While it is not explicitly stated, I am reading in between the lines that transient connection errors do not seem to be considered abnormal.. just very very... and consequently will not trigger Action Group SQL_Database_Alerts.
Are my suspicions correct in assuming the above? Or is there a way to get notices for these DB transient errors?
In response, I am considering setting up a metric alert for failed connections to suffice in this situation. It will not tell me the db was down for transient connection errors... but it should alert the team for specific failed connections which when investigated would reveal the cause for that particular situation.