We have a custom written software that uses SQL 2016 as the backend. We have this application installed on numerous different windows-based systems for thousands of customers, the majority of which operate without any issues or have causes that are easy enough to track down.
One particular site is having a perplexing issue with a very odd temporary fix.
VM - OS Server 2019, Version 1809, OS Build 17783.3650 (Though this problem has existed for the lifespan of the server across all updates)
Microsoft SQL Server 2016 (SP3) (KB5003279) - 13.0.6300.2 (X64) Aug 7 2021 01:20:37 Copyright (c) Microsoft Corporation Standard Edition (64-bit) on Windows Server 2019 Standard 10.0 <X64> (Build 17763: ) (Hypervisor)
What we are seeing is that SQL performance, as measured by both the end users and the measurable WAITs that have been recorded are increasing almost exactly 3-fold. activities inside the application that used to take 3 seconds are now taking 9 seconds or more, waits that were recorded would jump 3-fold as well.
We configure SQL to have a hard memory limit and this server is no different. However, what I have found is that when the server enters into this sluggish performance, I am able to correct the issue by forcing SQL to have a lower memory limit. Instantly the performance is restored back to normal. I can then set SQL back to its original maximum memory utilization and it will continue to operate without issues for a random period of time.
The amount of RAM assigned to SQL doesn't seem to matter for the issue. For example, if I started with 16gb of maximum RAM assigned the problem still occurs. The only thing I have been able to identify is that reducing the maximum amount enough to cause SQL to release some RAM back to the operating system fixes the performance issue.
To make matters more complex, we have seen that once the SQL server enters this sluggish state that it can persist through restarts of the virtual machine and even the host server.
It is possible that there is something occurring inside our custom software, but the nature of the issue and the temporary solution has us scratching our head.
Is there any other tools or insights that might be helpful in tracking down exactly what is going on inside SQL and why this memory adjustment seems to resolve the issue?
Is there any cached information in SQL that is saved between restarts that might explain how the slow performance can persists through restarting the OS and hypervisor/VM at times?