Hi All,
Need some guidance here.
We have multiple integrations with our centralized data warehouse database.
Recently, a new application got integrated and management has defined an SLA of 2secs for every query coming from that application.
Its kind of an API call. Now, that particular team is complaining that sometimes the api calls are taking 16 secs and sometimes 25 secs and sometimes finishes fast.
The app team is complaining it is all database issue. A lot of drama is in place as of now. I need some inputs here.
My observations
1. They are tracking overall time taken for the API call and pointing fingers at DB Team.
2. API call not just sends one sql stmts but it does a bunch of things in the call ( eg . . selects , inserts , updates ...).
3. The app team cannot repro at will. API call happens randomly in a week by the external users. There is no pattern or a particular time frame.
4. App Team didnt provide the sql stmts which the call as part of the API call. All they say is, to run continous trace and 1 day it will trace filled like 2.6 TB worth data even after keeping db filter in the server side trace.
we are collecting sql:starting,sql stmt:completed. batch starting , batch completed, RPC starting, RPC completed. events with selected columns like spid,textdata,txnid,dbid,etc..
5. We don't have any monitoring tool in place to really tell whether it is db is the issue or db has completed run queries but it is something else which is slowing down.
They just chasing db team.
6. We are seeing heavy blocking and txn's being held for long durations. A lot of stale connections opened for more than 2-4 days status=sleeping coming from JDNC client app.
This issue is ongoing for past 2 weeks and we are no where. sometimes, we haven't collected the right events and we had to redo our data collections.
Sometimes, the issues doesn't happen at all because external users didn't make that call.
Initially , we thought of collecting statements taking more duration than 2 milli secs and when we show cased those sql stmts, they never belonged to that API call.
The App team were not able to provide us the right info and going through back and forth.
I suggested to call server side trace before API call and stop it after API call. But they are not listening and even management is supporting the App team.
Going through a rough time. We suggested few things like OPTION WITH RECOMPILE for statements sometimes taking more and sometimes taking less time.
Also, we have enabled RCSI and Snapshot isolation levels at database level. However, they are still seeing blocking.
Please suggest on how can we go about troubleshooting this particular issue. we don't really know if sql is problem or their app processing has any problem or any network issues.
We are on SQL 2017 Enterprise Edition.
Regards,
Sam