We have an Azure SQL Database to manage our Customer data.
I have a reasonably complicated query that has been getting stuck with an HTDELETE wait. Most recently it timed out after 4 hours. This was in our QA environment. The same query finishes in less than 1 minute in our Production environment. The former has tables that are all smaller than the latter, e.g. the largest table is 54 million rows vs. 72 million. Proportionally they are in a similar in a relative sense.
Using sp_whoisactive I could see that the execution plan had it using Batch mode even though all the tables are Rowstore.
I took the code that was running on QA and generated an estimated execution plan for it on the Production instance. The execution plan was quite different and estimated using row store not batch mode.
Eventually the ADF that was running the select on QA failed with a timeout. Using what I read about the HTDELETE wait here https://www.sqlskills.com/help/waits/htdelete/, I added OPTION(USE HINT('DISALLOW_BATCH_MODE')) to the end of the query and generated the plan. That matched the one in Production without that hint. I took the hint back off and generated the plan again, and it returned to the batch mode one. I tried running it directly and it hung up on the HTDELETE. Killed it and ran it with the hint and it was finished in 17 seconds.
I've checked all the obvious things like table structure and they are matched, including indices. I will also try rebuilding the indices which I assume will also refresh stats and may cause it to use the non-batch plan.
My main question is, why does the HTDELETE wait never finish? There were no other processes running on the QA server at the time except a couple of always running monitoring tasks that spend most of their time asleep. Using the sp_whoisactive output I can see the time waiting is almost the entire time the query has been running.