PostgreSQL Flexible Server Performance Degradation After Azure Maintenance and HA Failover

Question

PostgreSQL Flexible Server Performance Degradation After Azure Maintenance and HA Failover

Dhanushkumar J 40

We are running PostgreSQL Flexible Server on version 15. Recently, during the scheduled Azure maintenance window, Azure performed a minor version upgrade and also executed an HA failover to minimize downtime.

After the maintenance activity (minor version upgrade 15.15 to 15.16 / HA failover), we observed significant performance degradation on the server. Initially, we waited for three days assuming PostgreSQL might require time to stabilize after the maintenance activity, but there was no improvement in performance.

As part of troubleshooting, we manually executed VACUUM ANALYZE on the databases to refresh statistics and also performed REINDEX operations on major application tables. However, these activities did not improve performance.

We then suspected that the issue might be related to the new HA zone where the server was running after failover. To validate this, we manually performed an HA failback to the original zone. After failback, we again executed VACUUM ANALYZE on the tables and immediately noticed a drastic improvement in server performance.

However, when we raised a support case, the Azure engineering team mentioned that everything appeared normal from their side and advised against performing HA failback to the old zone. Despite this recommendation, we proceeded with the failback, and the server has been working normally since then.

We would like to understand:

Why is the server performing well in the old zone but not in the new zone?

What could have caused this sudden performance degradation after the maintenance activity?

We have previously gone through multiple PostgreSQL minor version upgrades after the major version upgrade without facing any performance issues. During those earlier upgrades, we did not even perform VACUUM ANALYZE, yet the server performance remained stable. Why did this issue occur only now?

We are looking for clarification on whether this behavior could be related to the underlying infrastructure, storage latency, zone-specific issues, caching behavior, or any changes introduced during the HA failover/maintenance process.

0 comments

2 answers

Your answer

Answer 1

Hi Dhanushkumar J,
it sounds like you’ve pinpointed that the tiny performance hit only happens when your primary lives in the “new” zone after the maintenance-failover, and vanishes the moment you fail back to the original zone. That strongly hints at an infrastructure difference rather than a bug in Postgres 15.16 itself. Here’s what’s most likely going on and some things you can try:

Underlying storage or network latency differences
- When you fail over to a new zone, Azure may place your primary on a different storage cluster or host VM family. Even within the same SKU, different zones can have different contention levels or hardware generations, which shows up as higher I/O latency or lower throughput.
- On your original zone, the storage and OS caches were already warmed, and the host perhaps had more headroom. After failover, cold caches + a different storage backend can yield slower reads/writes until things settle.
Cold caches and reset runtime stats
- As part of an HA failover, in-memory stats (pg_stat_* views) and OS file caches all reset. Although pg_statistic (the optimizer histograms) persist, a cold page cache means your first runs will hit disk harder.
- Running VACUUM ANALYZE refreshes histogram stats but doesn’t refill the OS-level page cache for your production workload. It also won’t repopulate runtime stats like activity counters that drive autovacuum thresholds.
Why it didn’t show up on earlier upgrades
- On past minor-version maintenance windows, the standby you failed over to may have lived on a host with identical storage performance or simply better cache warm-up from previous failovers. This particular maintenance window may have landed you on a more congested host in that zone.

What you can do next

Compare Azure Monitor metrics (DiskReadLatency, DiskWriteLatency, IOPS, CPU%, memory pressure) before/during/after the failover in both zones to see where the bottleneck truly is.
Check the exact compute/storage SKU and underlying host series in each zone (sometimes Azure will use a different hardware generation if capacity is tight).
If you confirm higher storage latency in the new zone, you could: • Open a ticket to have Azure investigate underlying disk performance in that zone, or • Consider moving permanently to the zone with consistently better I/O characteristics, or • Scale up your tier (vCores/IOPS) to absorb the added latency.
After any failover or maintenance, a full ANALYZE (and even a brief warm-up test load) can help restore optimal query plans and warm OS caches.

Follow-up questions to nail down details

Which compute and storage tier (GP, MO, # of vCores, disk size) are you on, and did it differ between the two zones?
Can you share the Azure Monitor charts for storage latency/throughput and CPU% in both zones?
Are you on a burstable (B-series) SKU that might have exhausted credits after failover?
Did you notice any error or throttling alerts in the portal around the time of the maintenance?
What’s your autovacuum configuration—could the reset pg_stat_* counters have delayed autovacuum activity after you landed on the new node?

Hope this helps you narrow it down to zone-specific infrastructure characteristics. Let me know what you find!

Reference docs:

Manage scheduled maintenance settings: https://learn.microsoft.com/azure/postgresql/flexible-server/how-to-maintenance-portal
Failover support & why pg_stat_* resets: https://learn.microsoft.com/azure/postgresql/high-availability/concepts-high-availability#failover-support
Scheduled maintenance overview: https://learn.microsoft.com/azure/postgresql/flexible-server/concepts-maintenance

Hope this helps!

Answer 2

kagiyama yutaka 3,605

I think HA failover keeps the same server SKU and that performance validation is done through the published metrics (CPU, storage IO‑latency, connections).

0 comments

Share via

PostgreSQL Flexible Server Performance Degradation After Azure Maintenance and HA Failover

2 answers

Your answer