SSL Error when rebalancing after scaling Hyperscale cluster out

Question

Title: SSL Error when rebalancing after scaling Hyperscale cluster out

I scaled out an Azure Database for PostgreSQL Hyperscale cluster from 6 to 10 nodes. Scaling was initiated via the Worker Node count slider on the Configure page. There were no errors and the deployment reported success. The new nodes are now active but when using the rebalance_table_shards() function, the SSL cert is refused from one of the new nodes. All worker nodes are default configuration with citus.replication_sslmode = REQUIRE. This is the error received:

could not connect to the publisher: SSL error: tlsv1 alert unknown ca

FATAL: no pg_hba.conf entry for replication connection from host "xx.x.x.33", user "postgres", SSL off

The same settings were in place with 6 nodes and we had no issue using the create_distributed_table() function. Is there an additional step needed after scaling that I missed?

[Note: As we migrate from MSDN, this question has been posted by an Azure Cloud Engineer as a frequently asked question] Source: MSDN

Accepted Answer

We were executing the full statement (SELECT rebalance_table_shards('distributed_table_name');) against the distributed table when we received the error. We ending up opening an Azure Support ticket. They first said there was an issue with ca.pem creation while the new nodes were being deployed and they applied a hot-fix as well as a an actual fix to be rolled across all regions by the end of the week. The same error occurred in spite of this. This was the final fix and so far everything is working now, including rebalancing after scaling out additional nodes again.

At our end, we found that it may happen when it doesn’t requires data transfer.

We did following tests at our end.

SELECT master_move_shard_placement() call doesn't use logical replication by default and it does continue give the same error when we run;

citus=> select master_move_shard_placement(102538, '10.0.0.34', 5432, '10.0.0.33', 5432, 'force_logical');
ERROR:  could not connect to the publisher: SSL error: tlsv1 alert unknown ca
CONTEXT:  while executing command on 10.0.0.33:5432

Weirdly, this happens only when we need to move shards in between newly added workers. So, the following queries just work fine.

select master_move_shard_placement(102538, '10.0.0.34', 5432, '10.0.0.15', 5432, 'force_logical');
master_move_shard_placement
-----------------------------
(1 row)

select master_move_shard_placement(102538, '10.0.0.15', 5432, '10.0.0.33', 5432, 'force_logical');
master_move_shard_placement
-----------------------------
(1 row)

Therefore, for this moment, engineering suggested that you may try once more which you already did and it is working again. Based on engineering, if the rebalancing doesn't require data transfer from another newly added worker to w6 (10.0.0.33), the operation would fail.

We had the same incident before and there is an issue about this in citus-enterprise repository. We do not know the fix for this yet.
Last resort solution if it fails again is that you can use rebalance_table_shards function with the option shard_transfer_mode := 'block_writes' and see if that works for you.

Source: MSDN

SSL Error when rebalancing after scaling Hyperscale cluster out

0 additional answers