Replication lag in PostgreSQL Replay Lag is extremely high (236:41:55.093431

2025-11-04T14:34:03.2933333+00:00

User's image

Replica showing Active.One more repliucation slot showing delay PostgreSQL Replay Lag is extremely high (236:41:55.093431

Azure Database for PostgreSQL
{count} votes

Answer accepted by question author
  1. Kalyani Kondavaradala 4,595 Reputation points Microsoft External Staff Moderator
    2025-11-06T15:39:24.4833333+00:00

    Hi Sivangnanam, Kuraliniyan (AWM, IRL),

    Thank you for reaching on Microsoft Q&A! Welcome

    Here are some possible causes and suggestions to address the high PostgreSQL replay lag:

    1. Check Replication slot status:

    First, use the following command to check the status of your replication slots:

    SELECT * FROM pg_replication_slots;
    

    This will help identify any inactive replication slots that might be causing issues.

    2.Monitor workloads

    High replication lag can often result from an increased write workload on your primary server. If your primary server is experiencing heavy write activity:

    • Consider reducing this load if possible.
    • Alternatively, you may want to scale your server up to a higher SKU with more resources (vCores and Memory).

    3.Scale up: If you confirm high load, scaling up your primary server may help improve performance. You can check the scaling options in the Azure portal here.

    4.Try Re-creating Replica: If the lag persists after handling workloads, you can delete and recreate the read replica after the write-heavy workload is completed. This will reset it to a good state regarding lag.

    Consider setting up alerts for replication lag and storage usage. For instance, set alerts if replication lag exceeds 5 minutes or if storage usage exceeds 80%. This way, you can proactively manage the issue in future scenarios.

    • Ensure that the parameters max_prepared_transactions, max_locks_per_transaction, and max_worker_processes are set appropriately on both the primary and replica instances.
    • If conflicts during recovery are a concern, enabling the hot_standby_feedback parameter on the replica can help prevent cancelation of long-running queries due to replication conflicts.

    Please go through the following document for more details:

    I hope this helps, please let us know if you have further queries on this.

    Thanks!

    Kalyani

    0 comments No comments

2 additional answers

Sort by: Most helpful
  1. Sivangnanam, Kuraliniyan (AWM, IRL) 20 Reputation points
    2025-11-10T11:36:30.92+00:00

    Replica slot showing Active but Replay Lag still high on primary


  2. Sina Salam 26,661 Reputation points Volunteer Moderator
    2025-11-10T14:13:54.63+00:00

    Hello Sivangnanam, Kuraliniyan (AWM, IRL),

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand you are having extreme PostgreSQL replay lag despite the replication slot showing active.

    To confirm the lag source, start by inspecting pg_stat_replication on the primary to compare sent_lsn vs replay_lsn, and use pg_replication_slots to check if WAL is being retained excessively – https://www.postgresql.org/docs/current/monitoring-stats.html#PG-STAT-REPLICATION. On the standby, run SELECT now() - pg_last_xact_replay_timestamp() to measure actual clock delay. If replay lag is high but write/flush lag is low, the issue is likely on the standby’s apply path.

    Long-running read queries on the standby can block WAL replay. Use pg_stat_activity to identify and terminate such queries, and check pg_stat_database_conflicts for lock or snapshot conflicts. Enabling hot_standby_feedback = on can reduce these conflicts, though it may increase bloat on the primary – https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-HOT-STANDBY-FEEDBACK. Also, tune max_standby_streaming_delay to limit how long replay waits for queries to finish. If I/O is the bottleneck, monitor wait events and disk throughput using tools like iostat or pg_stat_io, and consider upgrading standby storage or increasing shared_buffers and effective_io_concurrency.

    In cases where WAL generation outpaces standby capacity often due to bulk operations or index rebuilds, throttle write-heavy jobs and enable wal_compression = on to reduce WAL size. Adjust checkpoint settings like checkpoint_timeout and max_wal_size to smooth out WAL bursts. If the standby is hopelessly behind, resync using pg_basebackup or a backup tool like pgBackRest, and drop the old slot only if it's abandoned – https://www.postgresql.org/docs/current/app-pgbasebackup.html.

    For ongoing observability, create views to monitor replication lag and replay delay, and integrate them into your monitoring stack.

    I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.