We have an Azure SQL database that is currently experiencing performance issues. This issue started sometime between 7:00 am CST Saturday (10/23) morning and 2 am CST Sunday (10/24) morning.
Our Normal Activity/Workload
Our system does 3 sets of "heavy" activities overnight. Each set maxes the database for a few minutes and then drops way off. During normal US business hours (7:00 am -7:00 pm CST M-F) there is usually modest activity (~40% max, avg 20%). Very little activity other times.
On Saturday morning, our heavy overnight activities worked as expected.
The database is about 15 GB in size. It is currently running as an S4 (200 DTU) pricing tier.
Since Sunday Morning
On Sunday morning at 2:00 am, the heavy activities ended up having timeout errors trying to bulk load data into the database. The queries that failed typically take 2-5 seconds to execute. On Sunday morning (and again Monday morning) those queries timed out after 30 seconds.
Our normal user traffic on Monday has also been experiencing timeout problems. The queries seem to be randomly timing out. Sometimes a query will work and then it will timeout. So far, I have not been able to determine a pattern to which queries or when they will or won't work. This is happening with a variety of queries across the system. While the individual query behavior seems random, the overall system is having timeouts consistently.
During this entire time, we rarely go above 15% DTU usage on the database.
We restarted all services that connect to the database. This was an attempt to make sure that if there was a rogue process, it was killed. This did not help.
The query performance items in Azure just tell us what we already know, some queries are timing out.
We also attempted to scale up the database. We mostly did this because we wanted to "restart" the database and there is not a direct way to do it in Azure. Unfortunately, we are trying to scale from an S4 to an S6 but it is taking much longer than usual. Currently, we are almost 4 hours into the operation and we are only at 32% complete. When we've done this in the past, it normally takes less than 10 minutes.
What I'd like to know/help with
1) Why is the scaling of the database taking so long?
2) Why suddenly are we having issues with these queries?