SELECT *, Indexes, Data Access, and the Optimizer!
Another great “Database Byte” by our US PFE guest writer Susan Van Eyck! This time on the pitfalls of using SELECT * on huge tables and the effects on the optimizer.
A customer of mine recently had a problem – a legitimate need to use a SELECT * query (mind you, they had written out all of the column names!)
Why is this a problem? Because it’s hard to execute efficiently – especially if it’s run against a large table or returns a high row count. Let’s look at why.
Here are our options for retrieving data from a table with a clustered index. [Note that, under the hood, they’re all some combination of Index Seeks and/or Scans.]
Clustered Index Scan
The I/O cost is the number of data pages used to store the table’s data. Every page in the table is read looking for column values that match our search arguments. This is often how data is accessed for a SELECT * query.
This approach can actually be an efficient means to read data from small tables, but it’s quite costly for larger tables, especially if data needs to be read from disk (physical reads) to get the entire table into memory (the buffer pool). A side effect of the physical reads is often the eviction of data needed for other processes. This manifests as a plunge in Page Life Expectancy and poor performance server wide.
Clustered Index Seek +/- Range Scan
Where data can be retrieved using a table’s Clustered Index Key(s) (as in Joins or Key Lookups) the I/O cost is that of the depth of the Clustered Index’s B-Tree structure. If we’re retrieving a range of values (WHERE CustomerId BETWEEN 1 and 1000) query execution starts with a Clustered Index Seek to find the page containing the first row then does a mini Clustered Index Scan to retrieve the additional rows from adjacent pages adding a few more I/Os to the query cost.
You can use this query to see the depths and page counts for your Clustered Indexes (type = 1):
SELECT OBJECT_SCHEMA_NAME( i.object_id ) + N'.' + OBJECT_NAME( i.object_id ) AS table_name, i.name,
s.index_depth, s.page_count
FROM sys.indexes i
CROSS APPLY sys.dm_db_index_physical_stats( DB_ID(), i.object_id, i.index_id, NULL, NULL ) s
WHERE i.type = 1 AND s.index_level = 0
ORDER BY table_name, i.index_id;
Nonclustered Index Seek + Key Lookup (into the Clustered Index)
When using a non-covering nonclustered Index (one that doesn’t contain all the columns needed to execute a query) the I/O cost is the depth of the nonclustered Index (plus a few pages if we’re after a range of values) plus (the number of rows returned from the nonclustered Index Seek * the depth of the Clustered Index). That last bit is important because it means that the I/O cost will increase non-linearly as each Key Lookup requires a separate Clustered Index Seek.
Row count I/O Cost if Clustered Index depth = 20
1 20
10 200
100 2,000
1000 20,000
That makes this a cost-effective approach for lower row count queries, but as the row count increases you will hit a “tipping point” where the I/O cost of many Key Lookups becomes greater than the cost of a single Clustered Index Scan. At that point the Optimizer will choose the Scan over the more costly Lookups. This is often what’s happening when the Optimizer refuses to use an index you’ve created.
Nonclustered Index Seek +/- Range Scan
If you can create a covering nonclustered index (one containing all the columns needed for the query) the I/O cost is the depth of the index plus any additional pages you need to scan if you’re querying for a range of values. Period. You have reached index nirvana. Of course, there’s no free lunch! The trade-off here is that adding columns to make the index covering increases its size: larger footprint on disk, longer duration index maintenance and a higher I/O cost when reading from it. To optimize it’s performance, add only the necessary columns and leverage the INCLUDE clause for less-selective columns and those not needed to locate the data in the index:
CREATE INDEX ix_yuckky ON (ColA, ColB, ColC, ColD, ColE, ColF);
CREATE INDEX ix_smaller ON (ColA, ColB) INCLUDE (ColC, ColD, ColE, ColF)
Nonclustered Index Scan
Lastly , if you have a nonclustered Index that contains all the right columns, but perhaps not in the right order the Optimizer might still opt to Scan it. As with the Clustered Index Scan the I/O cost is the number of pages at level 0 of the index (index pages). Since a nonclustered Index is typically a lot smaller than its clustered parent this reduces the I/O cost.
So – back to my customer’s problem. The table was quite wide, not partitioned and contained 10 years’ data. They needed to grab all columns, but only recently inserted rows to export them to another system. Someone had created a nonclustered Index on the InsertedDate column, but the row counts were high enough (and the Clustered Index deep enough) that the Optimizer chose to perform Clustered Index Scans resulting in long-duration, high I/O (logical and physical) queries that thrashed the buffer pool impacting all database users.
Remediation options we discussed included…
- Changing the Clustered Index keys so the data was sorted by Inserted Date. This would have introduced all sorts of other complications.
- Creating a covering nonclustered Index. The index would be huge, being essentially a copy of the Clustered Index with the data sorted differently.
- Deleting older data that was no longer needed, or moving it out to separate Archive tables for safe-keeping.
- Partitioning the table on InsertedDate to leverage “partition elimination”. A partitioned table is actually multiple, physically separate tables that are logically and transparently joined. While partitioning is primarily about managing really large tables (divide and conquer), you can get performance benefits if queries are constructed so only relevant partitions need be read.
- Creation of an Indexed View containing only recently inserted data. This smaller view could then be scanned instead of the table, but would require rewriting code.
- Creation of a Filtered Index that contained all columns but only recently inserted rows looked really promising but turned out to have a few challenges as 1) the filter clause must use a defined value (‘20180201’) – no sliding windows – and 2) the Optimizer won’t use a Filtered Index in parameterized queries (look for UnmatchedIndexes warnings in Execution Plans).
Our adventures are continuing as we try and find a suitable solution