Avoiding blocking of REDO thread in readable secondary replica

Question

Hello,

We are currently running into this issue for one of our availability group design:-

On Primary :- Nightly Table switch/Purges are happening (DDL) for approx. 2-3 hours

On Secondary which is configured as readable replica there will be SELECT queries running every now and then , as we do not have a window defined for secondary, REDO thread on secondary tends to get blocked by those SELECT queries which keep coming.

Currently to mitigate this issue, manually kill all select queries which keep blocking REDO thread ( this is in turn block other readers) until we see last commit time or redone time from SSMS AG dashboard sync up with the time DDL operation is completed.

Database in question is RCSI

Questions:-

Is this expected way to keep killing SELECT if you don't have a defined maint'ce window for secondary server

How are you guys dealing with this situation? I mean is there a way or script available online that it is automated to KILL such transactions on secondary in FLY when DDL is running on primary and wait until committed on secondary "

Answer

I don't have a complete answer for you, but I can explain why the SELECT queries blocks the REDO threads.

Since you are doing DDL, these operations requires Sch-M (schema modification) locks, and Sch-M locks are incompatible with everything else. Even if the SELECT queries run with RSCI, they still take Sch-S (schema stability) locks on the tables. This is required, because there is no versioning of the schema.

Exactly what DDL operations are you running?

Answer

A better option may be to run

ALTER DATABASE db SET RESTRICTED_USER WITH ROLLBACK IMMEDIATE

on the secondary. This will still kill the running queries, but in one single go, and not by manually killing. I am assuming here that your ALTER TABLE SWITCH operations can be run in one single batch in your purging scheme and they are not interleaved by operations that take longer time. This minimises the downtime for the readable secondary.

I would not be surprised if there are better alternatives. I don't work much with AGs myself, so my own experience is limited. I'm thinking that if you have more than one secondary, you could more the secondary role to another node in the AG after the purge operation. The node will not block the REDO threads, and once the readers are out of the other node, the node will catch up. But I am not sure if this works out in practice.

Answer

Hi @Kapil ,

Welcome to Microsoft Q&A!

The steps in troubleshooting blocking as below, please get more details from Understand and resolve SQL Server blocking problems:

Identify the main blocking session (head blocker)
Find the query and transaction that is causing the blocking (what is holding locks for a prolonged period)
Analyze/understand why the prolonged blocking occurs
Resolve blocking issue by redesigning query and transaction

About SELECT queries blocks the REDO threads, please refer to Troubleshoot: Availability group exceeded RTO, see below:

Best regards,
Carrin

If the answer is helpful, please click "Accept Answer" and upvote it.
Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

Avoiding blocking of REDO thread in readable secondary replica

3 answers