Distinct Aggregation Considered Harmful
Distinct aggregation (e.g. select count(distinct key) …) is a SQL language feature that results in some very slow queries. It’s particularly frustrating that you can take a perfectly efficient query with multiple aggregates, and make that query take forever just by adding a distinct keyword to one of the aggregates. For instance, it often makes logical sense to compute distinct counts “along the way” while you are computing other aggregates. Recently, I've seen a number of customer queries like this:
select
sum(salary),
max(salary),
count(employeeid),
count(distinct employeeid),
count(distinct salarygrade)
In practice, however, SQL Server 2008 does not compute these distinct aggregates “along the way.” Mixing one or more distinct aggregates with non-distinct aggregates in the same select list, or mixing two or more distinct aggregates, leads to spooling and rereading of intermediate results – which can be more expensive than computing the distinct aggregates in a separate query! In the worst case, such as the fragment of code above, this is sort of inevitable. The SQL Server 2008 query processor cannot compute aggregates on a stream without destroying the stream. So computing the distinct aggregate consumes the stream, which has to be recomputed for the other aggregates.
Fortunately, for select lists with only one distinct aggregate, you can rewrite the input SQL in a way that does not have this problem. SQL Server 2008’s optimizer does not do this rewrite for you. (Disclaimers: all the aggregates have to be algebraic [Gray et al 1996], and no guarantees are made that the behavior of the rewrite is exactly the same, particularly when there are arithmetic overflows.)
Getting Rid of Distinct Aggregates
The code below shows an example rewrite using the AdventureWorksDW database distributed with SQL Server 2008. The rewrite gives me 5x speedup on my desktop even on this small database! What’s going on is that we are breaking the aggregation into two aggregation steps. First, we build an intermediate result called PartialSums. We group together all the values for the distinct aggregate (adding the distinct aggregate’s column to the GROUP BY list), while building partial aggregations for all the non-distinct aggregates (in this example, just count(*)). Then in the second step, we use the original GROUP BY list, which in this example is empty, and collect the partial aggregates together. Note how the aggregate functions used depend on the initial aggregates: count becomes count, followed by sum. Note that in sum(1), the 1 is actually a constant number value, not a column reference.
As far as I can tell, this rewrite is never worse than the spooled plan. It parallelizes better, uses tempdb better, and does less logical I/O.
set statistics profile on
set statistics time on
go
use adventureworksdw
go
--
-- Use a distinct aggregate and a normal aggregate in the same select list
-- over some complex (one or more joins) derived table
--
with FISinFRS as (
select * from factinternetsales FIS
where ProductKey in
(select ProductKey from FactResellerSales)
)
select
count(*) as CountStar,
count(distinct ProductKey) as CountProductKeys
from FISinFRS
go
--
-- Now use partial aggregations in a new derived table
-- This is equivalent, but SQL Server 2008 does not do this transformation
-- for you
--
with FISinFRS as (
select * from factinternetsales FIS
where ProductKey in
(select ProductKey from FactResellerSales)
)
, PartialSums as (
select
count(*) as CountStarPartialCount
from FISinFRS
group by ProductKey
)
select
sum(CountStarPartialCount) as CountStar,
sum(1) as CountProductKeys
from PartialSums
References
[Gray et al 1996] Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals (1996), ftp://ftp.research.microsoft.com/pub/tr/tr-95-22.pdf.
By Marc Friedman, 9/2008
Marc.friedman@donotreply.microsoft.com
Comments
Anonymous
September 23, 2008
The comment has been removedAnonymous
September 23, 2008
Apologies, Marc Friedman, for addressing my previous comment to the wrong person. For some reason I thought this was Conor Cunningham's blog.Anonymous
September 25, 2008
The comment has been removedAnonymous
September 29, 2008
196 Microsoft Team blogs searched, 97 blogs have new articles in the past 7 days. 218 new articles found...Anonymous
July 09, 2009
How about using DENSE_RANK() select DENSE_RANK() over ( Order by a.col1, a.col2 ) as Ranky, a.col1, a.col2 from Emp where ... group by a.col1, a.col2 --the above query style is giving distinct rows.