PostegreSQL Flexible Server stuck in Updating state

Tino Merl 0

We have a PostegrSQL Flexible Server which is stuck in Update State for about 7 hours now. I tried restarting and stopping it via az cli and it didn't work. The Updating state was initiated by a runbook to scale down the Database after a heavy load of data processing.

Oury Ba-MSFT 16,636 Reputation points Microsoft Employee

2023-08-17T22:12:48.48+00:00

@Tino Merl Thank you for reaching out.

We recommended to perform scaling operations when there are no load/long running transactions on the server. Long running transactions need to be rolled back when server restart initiates as a part of scaling operation.

Could you please open a support ticket so we can get this issue fixed at the back end. Please let us know if you don't have a support plan.

Regards,

Oury
Tino Merl 0 Reputation points

2023-08-18T06:08:25.7933333+00:00

@Oury Ba-MSFT thanks for reaching out, there were no long running transactions, as the scale down is triggered by our Orchestration Tool after all the data loading and processing happened. Sadly we do not have any support plan on this subscription. Is there any other way to create a support ticket? Because the Flexible Server is still stuck in the Updating State which is now running for more than 22 Hours.
Oury Ba-MSFT 16,636 Reputation points Microsoft Employee

2023-08-18T12:06:25.2466667+00:00

Tino Merl

Please send an email to azcommunity@microsoft.com with your subscription name. Please mention in the subject line ATTN: Oury so I don't miss it.

Regards,

Oury
Tino Merl 0 Reputation points

2023-08-21T07:23:12.1666667+00:00

@Oury Ba-MSFT thanks again for reaching out. It seems that over the weekend the error resolved itself. This morning the Flexible Server wasn't hanging in the Updating State anymore. I'll test if everything is working again. If not i'll write a mail to the Address you mentioned. Thanks a lot!
Oury Ba-MSFT 16,636 Reputation points Microsoft Employee

2023-08-21T16:13:21.8333333+00:00

Tino Merl

Thank you for letting us know about the status of this issue.

Regards,

Oury
Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.

2 answers

Rahul Randive 8,521 Reputation points Microsoft Employee

2023-08-17T22:27:30.88+00:00

Hi @Tino Merl

As you mentioned, updating state was initiated by a runbook to scale down the Database after a heavy load of data processing.

In those case, Recovery time depends on how recent the last checkpoint was and the amount of data inside those log files, that said, the best practice is that application developer needs to avoid log running transactions and tune checkpoint frequency to avoid long recovery.

What causes long recovery on Azure Database for PostgreSQL
Recent checkpoints are critical for fast server recovery. Once a restart happens, either it was a new instance (failover to healthy instance) or same instance (in-place restart) will connect to disk that has all logs, all WAL logs after the last successful checkpoint need to be applied to the data pages before the server starts to accept connections again. Those logs are called REDO logs and will be applied via the recovery operation.

While performing upscale or descale, please Ensure no long running transactions happening on the server and Stop or reduce the application intensity workload.

If your server is still stuck in updating state, as Oury mentioned, please open a support ticket and share the case number so we can get this issue fixed at the back end.

Thank you!
Please sign in to rate this answer.
Tino Merl 0 Reputation points

2023-08-18T06:17:45.58+00:00

@Rahul Randive thanks for reaching out. So our setup is that we have a Flexible Server and a VM at the core. On the VM runs an Orchestrator which calls data from several APIs, puts it in the DB and transforms it afterwards. The transformed Data is then pushed to a Production schema, but only if everything ran correctly. After the push to Production, the Scale Down runbook is triggered via a webhook by the Orchestrator. So everything is happening automatically and only if the steps before that were successfull.

So from our side there shouldn't have been any long running transactions, since everything depends on the steps before.

If i understand you correctly it would be advisable to create a checkpoint before rescaling the Flexible Server, is that correct?

I also have the feeling, that when triggering a Scale Up or Down for the flexible Server through a Runbook the Runtime is extremly long in comparison to triggering it manually via the Azure Portal. The Rescaling via Runbook takes around 45 Minutes, while in the Azure Portal it only needs about 5 Minutes. Does this als stem from the checkpoint fact?

Oury Ba-MSFT 16,636 Reputation points Microsoft Employee

2023-08-22T23:01:24.14+00:00

Tino Merl

Thank you for detailed information.

Let us check internally and let you know our findings.

Regards,

Oury

Oury Ba-MSFT 16,636 Reputation points Microsoft Employee

2023-08-30T17:39:04.0566667+00:00

Tino Merl

Could you please open a support ticket so we can further investigate. From Azure PostgreSQL standpoint there are currently no reported issues regarding delay in scaling up or down.

Regards,

Oury

Tino Merl 0 Reputation points

2023-08-31T06:00:58.34+00:00

Hey @Oury Ba-MSFT it was fixed by itself as i mentioned here on the 21st August on this question in direct response to your other comment.
Sign in to comment
Maryna Bohdan 0 Reputation points

2024-03-05T10:19:34.02+00:00

Hello, Oury Ba-MSFT!

I am currently facing the issue, that was discussed on this page.

There was no load while automated scaling (I am using automation tasks(preview) to scale our postgresql up and down), but the error happened during scaling up, then task was rolled back and since then server is in updating state.

Here is the proof of low load at the time of scaling:

Here is the error text:

Could you please fix through backend, when there is an error or 100% db resource involvement, task is rolled back and no logs produced anymore, so server goes back immediately to previous settings and does not stuck in updating state?

Thank you in advance,

please let me know if I can help with more information,

Maryna
Please sign in to rate this answer.
Oury Ba-MSFT 16,636 Reputation points Microsoft Employee

2024-03-08T22:29:18.0766667+00:00

@Maryna Bohdan

Could you please send an email to azcommunity@microsoft.com with your server name, subscription ID and region.

Please mention in the subject Line ATTN: Oury so I don't miss it.

Regards,

Oury
Sign in to comment