SQL Server Query Optimizer estimates ~1 row, but eq_rows statistics shows ~12000

PetervdH 61 Reputation points
2021-12-17T19:45:26.623+00:00

I have a (composite) index consisting of columns A, B, C and D. The statics are up to date and show EQ_ROWS value 12000 for value 'x' (that is the RANGE_HI_KEY) for column A. When I do a SELECT * FROM <table> WHERE a = 'x' the Estimated Query Plan shows for all operators (index seek on the mentioned composite index, key lookup and nested loop) an estimated number of rows of ~1. How is this possible? I would have thought it to be 12000 for the index seek for example.

Note that in reality my column names have decent names, but for company security reason I cannot use real names. I hope my explanation is clear enough.

SQL Server | Other
{count} votes

Answer accepted by question author
  1. Erland Sommarskog 128.9K Reputation points MVP Volunteer Moderator
    2021-12-24T14:47:48.963+00:00

    I am afraid that I don't have a good answer for you. I know that there are imperfections in Showplan and sometimes information does not display fully accurately. In this case, I notice that there are a few more filter conditions on the DISPENSEEVENT table than just the CLIENTID. The remain predicates are filtered in the Key Lookup, which is just below the Index Seek. It is possible that the total estimate for the predicates is 1 row.

    In any case, it seems to be that there is all reason to consider adding an index on (CLIENTID, MDEMEDICALGUIDANCESTATEID, DISPENSEEVENTPROCESSORIGINID, PARENTDISPENSEEVENTID, MDEAPPROVALSTATEID) INCLUDE (STATE). I need to add the caveat that I only see this query, and I don't know how dynamic these conditions are. That is, if this query exists in multiple permutations, it may not be wise to add indexes for all of them.

    0 comments No comments

12 additional answers

Sort by: Most helpful
  1. Erland Sommarskog 128.9K Reputation points MVP Volunteer Moderator
    2022-01-06T13:35:16.07+00:00

    A couple of years back I ran into a somewhat similar situation. In this case, there was a query where the optimizer lost track of three join condition and only looked at a single WHERE filter on a non-key column and ended up with an estimate of milliards of row. The query had been in production at several sites for a few years without issues when one customer complained about poor performance when running a test. When I investigated it, I found that the bad estimate was always there, even when performance was good. We were about to open a support case, with the idea of telling Microsoft that they had to fix this bug. But I was able to make a small tweak to the query which removed the bad estimate.

    So that may be things you should look at. Tweak the query or indexes.

    In the mean while, to avoid pain in production, recall that you can use Query Store to force a good plan.

    0 comments No comments

  2. PetervdH 61 Reputation points
    2022-01-07T19:49:48.067+00:00

    @Erland Sommarskog , I am wondering whether I might have found a clue to this. Today I had twice a bad plan for this query. In both cases the compile value for CLIENTID was different and the EQ_ROWS for INDX3 for these much larger than 1,00xxx, but again the 1,00xxx estimate. However, by looking at the XML I realized that the statistics objects listed for the table are many more than what is used in the final plan. I found out that (among others) another index, let's call it INDX4, was listed as statistics object, but it is not at all used in the plan itself as an object, that is namely the INDX3. However, the INDX4 had in the histogram for exactly these two CLIENTID an EQ_ROWS of 1,00xxx. The reality is not ~1 row, it is caused by bad samples when statistics are calculated overnight. Now, when a full compute is done, all stats objects that start with CLIENTID should have the same (or almost the same) as EQ_ROWS. Could it be, that the optimizer uses just one of the stats objects to get the value for EQ_ROWS and assumes they are the same for all objects starting with CLIENTID? And, that it would explain why it shows 1,00xxx as estimated rows for the CLIENTID (based on that particular stats object) whereas the INDX3 has a very different EQ_ROWS? If this is true, I can imagine the optimizer assumes this, after all, statistics are the thing it uses as being correct. However, if you use samples and need to research a plan, you can be put easily at your wrong foot here.

    Do you know, is this the way the optimizer works? And if so, do you know how it picks the stats object? Or, is that almost like random or order by object id, or ?

    0 comments No comments

  3. Erland Sommarskog 128.9K Reputation points MVP Volunteer Moderator
    2022-01-08T17:05:09.143+00:00

    Could it be, that the optimizer uses just one of the stats objects to get the value for EQ_ROWS and assumes they are the same for all objects starting with CLIENTID? And, that it would explain why it shows 1,00xxx as estimated rows for the CLIENTID (based on that particular stats object) whereas the INDX3 has a very different EQ_ROWS? If this is true, I can imagine the optimizer assumes this, after all, statistics are the thing it uses as being correct.

    Yes, I think you are on to something there.

    In the case there are multiple indexes with the same leading column, I would expect that the optimizer looks at all statistics for the density information. But I would only expect it to read one historgram. Reading more would only be confusing. (Just think if you were to optimize something, and you were given multiple histograms for the same data!)

    One thing we should keep in mind: the optimizer reads statistics before it decides on indexes. so it may well read statistics for one index and then use a different one.

    So which one statistics does it read? I don't know. I did some googling and checked some books in my bookshelf, but came back empty-handed. To compensate for this, I did some testing. By no means any thorough testing, so my results may be due to happenstance, and my conclusions may be incorrect.

    What I seemed to see, though, is that if there are two statistics for the same column, it picks the one with the lowest modification counter (that is, the most recent), and if they are the same, it picks the one with the highest sample rate. This certainly makes sense. The fact that a statistics was created with FULLSCAN is of little relevance if data a lot has changed since it was produced.

    I was actually able to compose a repro. To try it, you need the database BigDB, which you find at https://www.sommarskog.se/present/BigDB.bak. (It's a compress backup for SQL 2016 which expands to 20 GB.) First run this:

    CREATE INDEX empid_shipper_ix ON TallOrders(EmployeeID, ShipVia)
    CREATE INDEX empid_custid_ix ON TallOrders(EmployeeID, CustomerID)
    
    UPDATE STATISTICS TallOrders empid_custid_ix
    

    Note here that EmployeeID and ShipVia are int, while CustomerID is nchar(5), so for a query with only EmployeeID like the one below, the optimizer will always pick empid_shipper_ix since is more slender.

    Run this query:

    SELECT SUM(Freight) FROM TallOrders WHERE EmployeeID = 39
    OPTION (RECOMPILE)
    

    When I run it, the estimate matches the actual rows perfectly, because there is a step with 39 as RANGE_HI_KEY. Thus, statistics is taken from empid_shipper_ix. And, thus, it seems that the optimizer does not care that the statistics on empid_custid_ix was created later.

    Now run:

    INSERT TallOrders (OrderID, CustomerID, EmployeeID, OrderDate, RequiredDate, ShippedDate,
                       ShipVia, Freight, ShipName, ShipAddress, ShipCity, ShipRegion, 
                       ShipPostalCode, ShipCountry, Discount, Status, TotalAmount)
       SELECT TOP 10000 OrderID + 1000000000, CustomerID, EmployeeID, OrderDate, RequiredDate, ShippedDate,
                       ShipVia, Freight, ShipName, ShipAddress, ShipCity, ShipRegion, 
                       ShipPostalCode, ShipCountry, Discount, Status, TotalAmount
       FROM   TallOrders
    
    
    UPDATE STATISTICS TallOrders empid_custid_ix
    

    Now try the query again:

    SELECT SUM(Freight) FROM TallOrders WHERE EmployeeID = 39
    OPTION (RECOMPILE)
    

    The exact outcome depends on how the empid_custid_ix was sampled, but right now I see an estimate of 115138 rows which matches very well with RANGE_ROWS for a histogram step with 41 as RANGE_HI_KEY. Depending on your sampled statistics turns out, you may want to try a different value.

    0 comments No comments

  4. PetervdH 61 Reputation points
    2022-01-09T14:56:06.9+00:00

    Hi @Erland Sommarskog , thank you for this. I have downloaded your database and have exactly the same behavior as you described. All this confirms indeed the mysterious cause of the ~1 row estimate we have experienced. Now I know where it is coming from. We have smaller and larger clients, so the real number of rows for a small and large CLIENTID can be quite different. The fact that sampling is used for the statistics updates every night sometimes cause 0, ~1 or very few rows in the EQ_ROWS for a smaller CLIENTID, even though a smaller CLIENTID has thousands of rows in reality. I have noticed now that when EQ_ROWS is a few thousand, the right query plan is created. Or, at least, a query plan that works for all CLIENTID.

    I will play around with your database a bit further to see whether I can find out more about the rules it uses to pick statistics objects. One question, you mentioned "the optimizer will always pick empid_shipper_ix since is more slender." Do you mean more slender purely based on the definition of the data types? So, suppose the more slender index would be much larger in number of pages due to high fragmentation, it would not consider that, but still the definition of the data types?


Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.