Replication latency?

Question

Greetings.

SQL 2019, not far behind in patches. Two node transactional replication, with the distributor on the Subscriber.

Recently we've had had sudden replication latency which is discovered through some custom alerting, and confirmed through replication monitor and the usual queries. It's worth noting that the last couple times it started on the same day/ time. However, even with that knowledge I can verify that nothing is long running, no blocking, no errors in the event viewer or sql error logs, no open transactions, no abnormally high replication commands in the distribution DB, no retries in the jobs for the log reader or distribution agents, etc.

The first time it happened I beat my head for hours and we finally rebooted the Subscriber, which fixed it in a couple minutes.

It happened again this morning and a reboot did NOT fix it (1) . Neither did a sql service restart(2), or an agent restart(3), or a second reboot(4). So altogether the log reader and distribution agents were restarted 4 times.

However, the latency suddenly caught up. The one thing I did do right before it caught up was a manual restart of the log reader agent job, as I'd enabled verbose logging, and that's a requirement to make it go.

I'd love to say that's what fixed it, but as mentioned above this guy was already restarted through other actions 4 times without any of them fixing the issue, so it would be tough to jump for joy just yet.

Based on all these circumstances where do I go from here?

Thanks!

Answer

So two things were the problem here, both confirmed by Microsoft:

The replication update sproc was indeed generating execution plans that join to another table. This is because there's a View with SCHEMABINDING between the replicated table and the one that the exec plan is joining to. It seems crazy to me that this is what happens in these circumstances, but MS has a long winded reason as to why this is the case. It should be noted that two different exec plans were being generated (one with 99% SORT), so forcing to the good one was obviously optimal.
Normal heavy hitting queries in another DB pegging disk use. What threw me off one of the times that this happened I was able to stop activity in this DB completely and replication still took a long time to catch up -- this is because # 1 was masking this issue. I hate to do this but Im going to mark this as the Answer due to how rare this scenario likely is and hoping it saves someone some pain in the future. Thanks to all that assisted here!

Answer

Hi,chrisrdba

Have you confirmed that the Log Reader Agent and the Distribution Agent are running properly when delays occur?

When the distribution server and the subscription server are deployed on the same node, the distribution service can cause competition for server resources.

Additionally, large volumes of logs can cause significant delays in the transaction transmission process.

The following article explains the approach to identifying delays and resolving issues. Please refer to it.

https://www.mssqltips.com/sqlservertip/3598/troubleshooting-transactional-replication-latency-issues-in-sql-server/

Best Regards,

Mikey Qiao

If you're satisfied with the answer, don't forget to "Accept it," as this will help others who have similar questions to yours.

Answer

I know it sounds crazy, but you can see for yourself in the attached.

Greetings. May I ask where is the attachment you mentioned? I didn't see it in this post. Out of curiosity I want to see.

Share via

Replication latency?

3 answers

Your answer