Greetings.
SQL 2019, not far behind in patches. Two node transactional replication, with the distributor on the Subscriber.
Recently we've had had sudden replication latency which is discovered through some custom alerting, and confirmed through replication monitor and the usual queries. It's worth noting that the last couple times it started on the same day/ time. However, even with that knowledge I can verify that nothing is long running, no blocking, no errors in the event viewer or sql error logs, no open transactions, no abnormally high replication commands in the distribution DB, no retries in the jobs for the log reader or distribution agents, etc.
The first time it happened I beat my head for hours and we finally rebooted the Subscriber, which fixed it in a couple minutes.
It happened again this morning and a reboot did NOT fix it (1) . Neither did a sql service restart(2), or an agent restart(3), or a second reboot(4). So altogether the log reader and distribution agents were restarted 4 times.
However, the latency suddenly caught up. The one thing I did do right before it caught up was a manual restart of the log reader agent job, as I'd enabled verbose logging, and that's a requirement to make it go.
I'd love to say that's what fixed it, but as mentioned above this guy was already restarted through other actions 4 times without any of them fixing the issue, so it would be tough to jump for joy just yet.
Based on all these circumstances where do I go from here?
Thanks!