Canceling statement due to conflict with recovery
APPLIES TO: Azure Database for PostgreSQL - Flexible Server
This article helps you solve a problem that occurs during executing queries against read replica.
Symptoms
- When attempting to execute a query on a read replica, the query is unexpectedly terminated.
- Error messages such as "Canceling statement due to conflict with recovery" appear in the logs or in the query output.
- There might be a noticeable delay or lag in replication from the primary to the read replica.
In the provided screenshot, on the left is the primary Azure Database for PostgreSQL flexible server instance, and on the right is the read replica.
- Read replica console (right side of the screenshot above)
- We can observe a lengthy
SELECT
statement in progress. A vital aspect to note about SQL is its consistent view of the data. When an SQL statement is executed, it essentially "freezes" its view of the data. Throughout its execution, the SQL statement always sees a consistent snapshot of the data, even if changes are occurring concurrently elsewhere.
- We can observe a lengthy
- Primary console (left side of the screenshot above)
- An
UPDATE
operation has been executed. While anUPDATE
by itself doesn't necessarily disrupt the behavior of the read replica, the subsequent operation does. After the update, aVACUUM
operation (in this case, manually triggered for demonstration purposes, but it's noteworthy that an autovacuum process could also be initiated automatically) is executed. - The
VACUUM
's role is to reclaim space by removing old versions of rows. Given that the read replica is running a lengthySELECT
statement, it's currently accessing some of these rows thatVACUUM
targets for removal. - These changes initiated by the
VACUUM
operation, which include the removal of rows, get logged into the Write-Ahead Log (WAL
). As Azure Database for PostgreSQL flexible server read replicas utilize native PostgreSQL physical replication, these changes are later sent to the read replica. - Here lies the crux of the issue: the
VACUUM
operation, unaware of the ongoingSELECT
statement on the read replica, removes rows that the read replica still needs. This scenario results in what's known as a replication conflict.
- An
The aftermath of this scenario is that the read replica experiences a replication conflict due to the rows removed by the VACUUM
operation. By default, the read replica attempts to resolve this conflict for a duration of 30 seconds, since the default value of max_standby_streaming_delay
is set to 30 seconds. After this duration, if the conflict remains unresolved, the query on the read replica is canceled.
Cause
The root cause of this issue is that the read replica in Azure Database for PostgreSQL flexible server is a continuously recovering system. This situation means that while the replica is catching up with the primary, it's essentially in a state of constant recovery. If a query on a read replica tries to read a row that is simultaneously being updated by the recovery process (because the primary has made a change), Azure Database for PostgreSQL flexible server might cancel the query to allow the recovery to proceed without interruption.
Resolution
- Adjust
max_standby_streaming_delay
: Increase themax_standby_streaming_delay
parameter on the read replica. Increasing the value of the setting allows the replica more time to resolve conflicts before it decides to cancel a query. However, this might also increase replication lag, so it's a trade-off. This parameter is dynamic, meaning changes take effect without requiring a server restart. - Monitor and Optimize Queries: Review the types and frequencies of queries run against the read replica. Long-running or complex queries might be more susceptible to conflicts. Optimizing or scheduling them differently can help.
- Off-Peak Query Execution: Consider running heavy or long-running queries during off-peak hours to reduce the chances of a conflict.
- Enable
hot_standby_feedback
: Consider settinghot_standby_feedback
toon
on the read replica. When enabled, it informs the primary server about the queries currently being executed by the replica. This prevents the primary from removing rows that are still needed by the replica, reducing the likelihood of a replication conflict. This parameter is dynamic, meaning changes take effect without requiring a server restart.
Caution
Enabling hot_standby_feedback
can lead to the following potential issues:
- This setting can prevent some necessary cleanup operations on the primary, potentially leading to table bloat (increased disk space usage due to unvacuumed old row versions).
- Regular monitoring of the primary's disk space and table sizes is essential. Learn more about monitoring for Azure Database for PostgreSQL flexible server here.
- Be prepared to manage potential table bloat manually if it becomes problematic. Consider enabling autovacuum tuning in Azure Database for PostgreSQL flexible server to help mitigate this issue.
- Adjust
max_standby_archive_delay
: Themax_standby_archive_delay
server parameter specifies the maximum delay that the server will allow when reading archivedWAL
data. If the replica of the Azure Database for PostgreSQL flexible server instance ever switches from streaming mode to file-based log shipping (though rare), tweaking this value can help resolve the query cancellation issue.