Monitor Azure SQL Database with metrics and alerts

Article
05/21/2024

You can use Azure Monitor metrics to monitor database and elastic pool resource consumption and health. You can use alerts to send notifications when metric values indicate a potential problem.

Metrics

A metric is a series of numeric value measured at regular time intervals, often using units such as count, percent, bytes, etc. Depending on the nature of the metric, you can use aggregations such as total, count, average, minimum, maximum to calculate metric values over a duration of time. You can split some metrics by dimensions. Each dimension provides an additional context to the numeric values.

Examples of available Azure SQL Database metrics are: CPU percentage, Data space used, Deadlocks, Tempdb Percent Log Used.

See database metrics and elastic pool metrics for all available metrics in Azure SQL Database.

Note

Some metrics apply only to specific types of databases or elastic pools. The description of each metric mentions if its use it limited to a specific database or elastic pool type, such as vCore, Hyperscale, serverless, etc.

In Azure SQL Database portal, several commonly used metrics are charted on the Monitoring tab of the Overview page. The metrics let you assess resource consumption and health of a database or an elastic pool at a glance.

Under Key metrics, select See all metrics or anywhere inside the chart to open metrics explorer. In the Metrics page, you can view all other available metrics for the database or elastic pool resource. In metrics explorer, you can change the time range, granularity, and aggregation type for the chart, change the type of chart, expand the scope to include metrics from other Azure resources, create alert rules, etc. You can also open metrics explorer by selecting Metrics menu item, under Monitoring in the resource menu.

Use metrics to monitor databases and elastic pools

You can use metrics to monitor database and elastic pool resource consumption and health. For example, you can:

Right-size the database or elastic pool to your application workload
Detect a gradual increase in resource consumption, and proactively scale up the database or elastic pool
Detect and troubleshoot a performance problem

The following table describes commonly used metrics in Azure SQL Database.

Metric name	Metric ID	Description
CPU percentage	`cpu_percent`	This metric shows CPU consumption toward the user workload limit of a database or an elastic pool, expressed as a percentage. For more information, see Resource consumption by user workloads and internal processes.
SQL instance CPU percent	`sql_instance_cpu_percent`	This metric shows the total CPU consumption by the user and system workloads, expressed as a percentage. Because this metric and the CPU percentage metric are measured on different scales, they are not directly comparable with each other. For more information, see Resource consumption by user workloads and internal processes.
Data IO percentage	`physical_data_read_percent`	This metric shows the data file IO consumption toward the user workload limit of a database or an elastic pool, expressed as a percentage. For more information, see Data IO governance.
Log IO percentage	`log_write_percent`	This metric shows the transaction log write throughput consumption toward the user workload limit of a database or an elastic pool, expressed as a percentage. For more information, see Transaction log rate governance.
Workers percentage	`workers_percent`	This metric shows the consumption of worker threads toward the user workload limit of a database or an elastic pool, expressed as a percentage.
DTU percentage	`dtu_consumption_percent`	This metric shows DTU consumption toward the user workload limit of a database or an elastic pool, expressed as a percentage. DTU percentage is derived from three other metrics: CPU percentage, Data IO percentage, and Log IO percentage. At any point in time, DTU percentage matches the highest value among these three metrics.
CPU used	`cpu_used`	This metric shows CPU consumption toward the user workload limit of a database or an elastic pool, expressed as the number of vCores. For more information, see Diagnose and troubleshoot high CPU on Azure SQL Database.
DTU used	`dtu_used`	This metric shows the number of DTUs used by a database or an elastic pool.
App CPU billed	`app_cpu_billed`	For serverless databases, this metric shows the amount of compute (CPU and memory) billed, expressed in vCore seconds. For more information, see Billing in the serverless compute tier.
App CPU percentage	`app_cpu_percent`	For serverless databases, this metric shows CPU consumption toward the app package maximum vCore limit, expressed as a percentage. For more information, see Monitoring in the serverless compute tier.
App memory percentage	`app_memory_percent`	For serverless databases, this metric shows memory consumption toward the app package maximum memory limit, expressed as a percentage. For more information, see Monitoring in the serverless compute tier.
Sessions count	`sessions_count`	This metric shows the number of established user sessions for a database or an elastic pool.
Data space used	`storage`	For databases, this metric shows the amount of storage space used in the data files of a database.
Data space used	`storage_used`	For elastic pools, this metric shows the amount of storage space used in the data files of all databases in an elastic pool.
Data space allocated	`allocated_data_storage`	This metric shows the amount of storage space occupied by the data files of a database, or by the data files of all databases in an elastic pool. Data files might contain empty space. Because of this, Data space allocated if often higher than Data space used for the same database or elastic pool. For more information, see Manage file space for databases in Azure SQL Database.
Data space used percent	`storage_percent`	For databases, this metric shows the amount of storage space used in the data files of a database toward the data size limit of a database. For elastic pools, it shows the amount of storage space used in the data files of all databases in an elastic pool, expressed as a percentage toward the data size limit of an elastic pool. The data size limit for a database or an elastic pool might be configured lower than the maximum data size limit. To find the maximum data size limit, see resource limits for vCore databases, vCore elastic pools, DTU databases, and DTU elastic pools.
Data space allocated percent	`allocated_data_storage_percent`	For elastic pools, this metric shows the amount of storage space occupied by the data files of all databases in an elastic pool toward the data size limit of the pool, expressed as a percentage.
Tempdb Percent Log Used	`tempdb_log_used_percent`	This metric shows the consumption of the transaction log space in the `tempdb` database toward the maximum log size, expressed as a percentage. For more information, see tempdb in Azure SQL Database.
Successful Connections	`connection_successful`	This metric shows the number of successfully established connections to a database. This metric can be split by two dimensions, `SslProtocol` and `ValidatedDriverNameAndVersion`, to see the number of connections using a specific encryption protocol version, or using a specific client driver.
Failed Connections : System Errors	`connection_failed`	This metric shows the number of connection attempts to a database that failed because of internal service errors. Most commonly, such errors are transient. This metric can be split by two dimensions, `Error` and `ValidatedDriverNameAndVersion`, to see the number of failed connection attempts due to a specific error, or from a specific client driver.
Failed Connections : User Errors	`connection_failed_user_error`	This metric shows the number of connection attempts to a database that failed because of user-correctable errors, such as an incorrect password or connection being blocked by firewall. This metric can be split by two dimensions, `Error` and `ValidatedDriverNameAndVersion`, to see the number of failed connection attempts due to a specific error, or from a specific client driver.
Deadlocks	`deadlock`	This metric shows the number of deadlocks in a database.
Availability	`availability`	Availability is determined based on the database being operational for connections. For each one-minute data point, the possible values are either `100%` or `0%`. For more information, see Availability metric.

Availability metric

The Availability metric tracks availability at individual Azure SQL Database level. This feature is currently in preview.

Availability is granular to one minute of connection outage. Availability is determined based on the database being operational for connections. A minute is considered as downtime or unavailable if all continuous attempts by users to establish connection to the database within the minute fail due to a service issue. If there is intermittent unavailability, the duration of continuous unavailability must cross the minute boundary to be considered as downtime. Typically, the latency to display availability is less than three minutes.

Here's the logic used for calculating Availability for every one-minute interval:

If there is at least one successful connection, then availability is 100%.
If all connections fail due to user errors, availability is 100%.
If there are no connection attempts, availability is 100%.
If all connections fail due to system errors, availability is 0%.
Currently, availability metric data is not yet supported for the serverless compute tier and will be displayed as 100%.

Availability metric is therefore a composite metric derived from following existing metrics:

Successful Connections
Failed Connections : User Errors
Blocked by Firewall
Failed Connections : System Errors

User errors include all connections that fail due to user configuration, workload, or management. System errors include all the connections that fail due to transient issues related to Azure SQL Database service.

Examples of errors caused by user configuration:
Examples of errors caused by user workload:
- Connection failure due to resource governance
Examples of errors caused by user management:
- Scaling up or down the database or elastic pool
- Geo replication planned or unplanned failover
- Failover group planned or unplanned failover
- Geo secondary database in seeding state
- Database that is in restoring state due to Point In Time Restore (PITR), Long Term Restore (LTR), or restore from a deleted database
- Database that is not yet finished being copied (Database Copy)

Alerts

You can create alert rules to notify you that the value of one metric or multiple metrics is outside of an expected range.

You can set the scope of an alert rule in multiple ways to suit your needs. For example, alert rule scope can be set to:

A single database
An elastic pool
All databases or elastic pools in a resource group
All databases or elastic pools in a subscription within an Azure region
All databases or elastic pools in a subscription within all regions

Alert rules periodically evaluate aggregated metric values over a lookback period, comparing them to a threshold value. You can configure the threshold value, evaluation frequency, and lookback period.

If an alert rule is triggered, you are notified according to your notification preferences, which you specify in the action group linked to the alert rule. For example, you can receive an email, an SMS, or a voice notification. An alert rule can also trigger actions such as webhooks, automation runbooks, functions, logic apps, etc. You can integrate alerts with supported IT service management products.

To learn more about Azure Monitor alerts, see Azure Monitor alerts overview. To get familiar with metric alerts, review Metric alerts, Manage alert rules, and Action groups.

Recommended alert rules

The metrics and optimal thresholds to use in alert rules vary across the wide spectrum of customer workloads in Azure SQL Database.

The recommended alerts in the following table are a starting point to help you define the optimal alerting configuration for your Azure SQL Database resources. Depending on your requirements, your configuration might differ from this example. You might use different thresholds, evaluation frequencies, or lookback periods. You might choose to create additional alerts, or use different alert rule configurations for different applications and environments.

Here are examples of typical alert rule configurations.

Alert rule name	Metric (signal)	Alert logic	When to evaluate	Suggested severity
High user CPU usage	CPU percentage	Threshold: `Static` Aggregation: `Average` Operator: `Greater than` Threshold value: `90`	Check every: `1 minute` Lookback period: `10 minutes`	2 - Warning
High total CPU usage	SQL instance CPU percent	Threshold: `Static` Aggregation: `Average` Operator: `Greater than` Threshold value: `90`	Check every: `1 minute` Lookback period: `10 minutes`	2 - Warning
High worker usage	Workers percentage	Threshold: `Static` Aggregation: `Minimum` Operator: `Greater than` Threshold value: `60`	Check every: `1 minute` Lookback period: `5 minutes`	1 - Error
High data IO usage	Data IO percentage	Threshold: `Static` Aggregation: `Average` Operator: `Greater than` Threshold value: `90`	Check every: `1 minute` Lookback period: `15 minutes`	3 - Informational
Low data space	Data space used percent	Threshold: `Static` Aggregation: `Minimum` Operator: `Greater than` Threshold value: `95`	Check every: `15 minute` Lookback period: `15 minutes`	1 - Error
Low `tempdb` log space	Tempdb Percent Log Used	Threshold: `Static` Aggregation: `Minimum` Operator: `Greater than` Threshold value: `60`	Check every: `1 minute` Lookback period: `5 minutes`	1 - Error
Deadlocks	Deadlocks	Threshold: `Dynamic` Aggregation: `Total` Operator: `Greater than` Threshold sensitivity: `Medium`	Check every: `15 minutes` Lookback period: `1 hour`	3 - Informational
Failed connections (user errors)	Failed Connections : User Errors	Threshold: `Dynamic` Aggregation: `Total` Operator: `Greater than` Threshold sensitivity: `Medium`	Check every: `5 minutes` Lookback period: `15 minutes`	2 - Warning
Failed connections (system errors)	Failed Connections : System Errors	Threshold: `Static` Aggregation: `Total` Operator: `Greater than` Unit: `Count` Threshold value: `10`	Check every: `1 minute` Lookback period: `5 minutes`	2 - Warning
Anomalous connection rate	Successful Connections	Threshold: `Dynamic` Aggregation: `Total` Operator: `Greater or Less than` Threshold sensitivity: `Low`	Check every: `5 minutes` Lookback period: `15 minutes`	2 - Warning

Some of the recommended alert rules use dynamic thresholds to detect anomalous metric patterns that might require attention. Alert rules based on dynamic thresholds do not trigger until sufficient historical data has been collected to establish normal patterns. For more information, see Dynamic thresholds in metric alerts.

By default, metric alerts are stateful. This means that once an alert rule is triggered, the alert is fired only once. The alert remains in the fired state until it is resolved, at which point a resolved notification is sent. An alert rule triggers a new alert only once the previous alert is resolved. Stateful alerts avoid frequent notifications about an ongoing condition. For more information about stateful and stateless alerts, see Alerts and state.

Share via

Monitor Azure SQL Database with metrics and alerts

Metrics

Use metrics to monitor databases and elastic pools

Availability metric

Alerts

Recommended alert rules

Feedback

Additional resources

Share via

Monitor Azure SQL Database with metrics and alerts

Metrics

Use metrics to monitor databases and elastic pools

Availability metric

Alerts

Recommended alert rules

Related content

Feedback

Additional resources