An Azure managed PostgreSQL database service for app development and deployment.
Hi Dhanushkumar J,
it sounds like you’ve pinpointed that the tiny performance hit only happens when your primary lives in the “new” zone after the maintenance-failover, and vanishes the moment you fail back to the original zone. That strongly hints at an infrastructure difference rather than a bug in Postgres 15.16 itself. Here’s what’s most likely going on and some things you can try:
- Underlying storage or network latency differences
- When you fail over to a new zone, Azure may place your primary on a different storage cluster or host VM family. Even within the same SKU, different zones can have different contention levels or hardware generations, which shows up as higher I/O latency or lower throughput.
- On your original zone, the storage and OS caches were already warmed, and the host perhaps had more headroom. After failover, cold caches + a different storage backend can yield slower reads/writes until things settle.
- Cold caches and reset runtime stats
- As part of an HA failover, in-memory stats (pg_stat_* views) and OS file caches all reset. Although pg_statistic (the optimizer histograms) persist, a cold page cache means your first runs will hit disk harder.
- Running VACUUM ANALYZE refreshes histogram stats but doesn’t refill the OS-level page cache for your production workload. It also won’t repopulate runtime stats like activity counters that drive autovacuum thresholds.
- Why it didn’t show up on earlier upgrades
- On past minor-version maintenance windows, the standby you failed over to may have lived on a host with identical storage performance or simply better cache warm-up from previous failovers. This particular maintenance window may have landed you on a more congested host in that zone.
What you can do next
- Compare Azure Monitor metrics (DiskReadLatency, DiskWriteLatency, IOPS, CPU%, memory pressure) before/during/after the failover in both zones to see where the bottleneck truly is.
- Check the exact compute/storage SKU and underlying host series in each zone (sometimes Azure will use a different hardware generation if capacity is tight).
- If you confirm higher storage latency in the new zone, you could: • Open a ticket to have Azure investigate underlying disk performance in that zone, or • Consider moving permanently to the zone with consistently better I/O characteristics, or • Scale up your tier (vCores/IOPS) to absorb the added latency.
- After any failover or maintenance, a full ANALYZE (and even a brief warm-up test load) can help restore optimal query plans and warm OS caches.
Follow-up questions to nail down details
- Which compute and storage tier (GP, MO, # of vCores, disk size) are you on, and did it differ between the two zones?
- Can you share the Azure Monitor charts for storage latency/throughput and CPU% in both zones?
- Are you on a burstable (B-series) SKU that might have exhausted credits after failover?
- Did you notice any error or throttling alerts in the portal around the time of the maintenance?
- What’s your autovacuum configuration—could the reset pg_stat_* counters have delayed autovacuum activity after you landed on the new node?
Hope this helps you narrow it down to zone-specific infrastructure characteristics. Let me know what you find!
Reference docs:
- Manage scheduled maintenance settings: https://learn.microsoft.com/azure/postgresql/flexible-server/how-to-maintenance-portal
- Failover support & why pg_stat_* resets: https://learn.microsoft.com/azure/postgresql/high-availability/concepts-high-availability#failover-support
- Scheduled maintenance overview: https://learn.microsoft.com/azure/postgresql/flexible-server/concepts-maintenance
Hope this helps!