Troubleshoot high CPU utilization in Azure Database for PostgreSQL flexible server

2024-12-17

APPLIES TO: Azure Database for PostgreSQL - Flexible Server

This article describes how to identify the root cause of high CPU utilization. It also provides possible remedial actions to control CPU utilization when using Azure Database for PostgreSQL Flexible Server.

In this article, you can learn:

About troubleshooting guides to identify and get recommendations to mitigate root causes.
About tools to identify high CPU utilization such as Azure Metrics, query store, and pg_stat_statements.
How to identify root causes, such as long running queries and total connections.
How to resolve high CPU utilization by using EXPLAIN ANALYZE, connection pooling, and vacuuming tables.

Troubleshooting guides

Using the Troubleshooting guides you can identify the probable root cause of a high CPU scenario, and can read through recommendations to mitigate the problem found.

To learn how to set up and use the troubleshooting guides, follow setup troubleshooting guides.

Tools to identify high CPU utilization

Consider the use of the following list of tools to identify high CPU utilization.

Azure Metrics

Azure Metrics is a good starting point to check the CPU utilization for a specific period. Metrics provide information about the resources utilized during the period in which CPU utilization is high. Compare the graphs of Write IOPs, Read IOPs, Read Throughput Bytes/Sec, and Write Throughput Bytes/Sec with CPU percent, to find out times when the workload caused high CPU.

For proactive monitoring, you can configure alerts on the metrics. For step-by-step guidance, see Azure Metrics.

Query store

Query store automatically captures the history of queries and runtime statistics, and it retains them for your review. It slices the data by time, so that you can see temporal usage patterns. Data for all users, databases, and queries is stored in a database named azure_sys in the Azure Database for PostgreSQL flexible server instance.

Query store can correlate wait event information with query run time statistics. Use query store to identify queries that have high CPU consumption during the period of interest.

For more information, see query store.

pg_stat_statements

The pg_stat_statements extension helps identify queries that consume time on the server. For more information about this extension, see its documentation.

For Postgres versions 13 and above, use the following statement to view the top five SQL statements by mean or average execution time:

SELECT userid::regrole, dbid, query, mean_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 5;

For Postgres version 12, use the following statement to view the top five SQL statements by mean or average execution time:

SELECT userid::regrole, dbid, query
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 5;

Total execution time

Execute the following statements to view the top five SQL statements by total execution time.

Postgres v13 & higher
Postgres v12

For Postgres versions 13 and above, use the following statement to view the top five SQL statements by total execution time:

SELECT userid::regrole, dbid, query
FROM pg_stat_statements
ORDER BY total_exec_time
DESC LIMIT 5;

For Postgres version 12, use the following statement to view the top five SQL statements by total execution time:

SELECT userid::regrole, dbid, query,
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 5;

Identify root causes

If CPU consumption levels are high in general, the following ones could be possible root causes:

Long-running transactions

Long-running transactions can consume CPU resources that can lead to high CPU utilization.

The following query helps identify connections running for the longest time:

SELECT pid, usename, datname, query, now() - xact_start as duration
FROM pg_stat_activity
WHERE pid <> pg_backend_pid() AND state IN ('idle in transaction', 'active')
ORDER BY duration DESC;

Total number of connections and number of connections by state

A large number of connections to the database might also lead to increased CPU and memory utilization.

The following query gives information about the number of connections by state:

SELECT state, count(*)
FROM  pg_stat_activity
WHERE pid <> pg_backend_pid()
GROUP BY state
ORDER BY state ASC;

Resolve high CPU utilization

Use EXPLAIN ANALYZE, consider using the built-in PgBouncer connection pooler, and terminate long running transactions to resolve high CPU utilization.

Use EXPLAIN ANALYZE

Once you know the queries that are consuming more CPU, use EXPLAIN ANALYZE to further investigate and tune them.

For more information about the EXPLAIN ANALYZE command, review its documentation.

PgBouncer, a built-in connection pooler

In situations where there are many short-lived connections, or many connections that remain idle for most of their life, consider using a connection pooler like PgBouncer.

For more information about PgBouncer, see connection pooler and connection handling best practices with PostgreSQL

Azure Database for PostgreSQL flexible server offers PgBouncer as a built-in connection pooling solution. For more information, see PgBouncer.

Terminate long running transactions

You could consider killing a long running transaction as an option.

To terminate a session's PID, you need to find its PID by using the following query:

SELECT pid, usename, datname, query, now() - xact_start as duration
FROM pg_stat_activity
WHERE pid <> pg_backend_pid() AND state IN ('idle in transaction', 'active')
ORDER BY duration DESC;

You can also filter by other properties like usename (user name), datname (database name), etc.

Once you have the session's PID, you can terminate it using the following query:

SELECT pg_terminate_backend(pid);

Monitor vacuum and table stats

Keeping table statistics up to date helps improve query performance. Monitor whether regular autovacuuming is being carried out.

The following query helps to identify the tables that need vacuuming:

SELECT schemaname,relname,n_dead_tup,n_live_tup,last_vacuum,last_analyze, last_autovacuum,last_autoanalyze
FROM pg_stat_all_tables
WHERE n_live_tup > 0;

last_autovacuum and last_autoanalyze columns give the date and time when the table was last autovacuumed or analyzed. If the tables aren't being vacuumed regularly, take steps to tune autovacuum.

For more information about autovacuum troubleshooting and tuning, see Autovacuum Troubleshooting.

A short-term solution would be to do a manual vacuum analyze of the tables where slow queries are seen:

VACUUM ANALYZE <table>;

Share via