Self Joins…ROW_NUMBER()…LAG()! … Oh My!

 

In this post by our US based Senior PFE, Susan Van Eyck goes through different query patterns to achieve cumulative data while keeping the query performant.  Check it out !

 

Today’s Byte was inspired by a question from a colleague.  He had a bunch of data from a monitoring system that sampled cumulative data every 5 minutes.  It looked a bit like this:

SampleTime      EventCount

12:00:00            2656

12:05:00            3310

12:10:00            3876

12:15:00            4561

12:20:00            5276

12:25:00            6759

12:30:00            8065

 

The customer he was working with wanted to see the data displayed as Events/Second.  To do that he needed to start with calculating the differences in EventCount between each pair of rows in the table.

Back in the old days (I know some of you remember), we would have had to tackle this with a self join something like this:

 

[sql] SELECT a.SampleTime AS IntervalStart, b.SampleTime AS IntervalEnd,

( b.EventCount - a.EventCount ) AS IntervalEvents

FROM EventTracking a

INNER JOIN EventTracking b ON b.SampleTime = DATEADD( MINUTE, 5, a.SampleTime ); [/sql]

 

It’s expensive since the table must be read twice, and not very robust.  Consider what would happen if our timestamps had a little variability looking more like 12:00:03, 12:05: 01, 12:09:59.  Our date math would no longer be reliable.  I remember writing some crazy code to try cleaning up data like this!

Clearly we need a better column to join on – something like a row number – which we can easily generate using the ROW_NUMBER function introduced in SQL Server 2008!  So my second iteration looked something like this:

 

[sql] WITH NumberedRows AS

( SELECT ROW_NUMBER() OVER ( ORDER BY SampleTime ) AS RowNum, SampleTime, EventCount

FROM EventTracking )

SELECT a.SampleTime AS IntervalStart, b.SampleTime AS IntervalEnd,

( b.EventCount - a.EventCount ) AS IntervalEvents

FROM NumberedRows a

INNER JOIN NumberedRows b ON b.RowNum = ( a.RowNum + 1 ); [/sql]

 

It’s really a lot like the first query, but I’m using a CTE (Common Table Expression) to tack a row number onto each row in the table then using that to calculate an offset to join adjacent rows.  It’s more reliable, but just as expensive as the first query since the table is still being read twice.

Well, my colleague had proposed a solution using the LAG() function introduced with SQL Server 2012, and it turns out he was on the right track.  Here’s what our query looks like using LAG():

 

[sql] SELECT SampleTime AS IntervalStart, LAG( SampleTime, 1 ) OVER ( ORDER BY SampleTime ),

( EventCount - LAG( EventCount, 1 ) OVER ( ORDER BY SampleTime )) AS IntervalEvents

FROM EventTracking; [/sql]

 

The snippet of code below  translates to “relative to the current row, if we order the data by SampleTime, what is the value of EventCount on the row that lags behind the current row by 1 (i.e., the previous row)”.

 

[sql] LAG( EventCount, 1 ) OVER ( ORDER BY SampleTime ) [/sql]

 

There’s some SQL Server magic that happens under the hood so the table only has be read once thereby cutting the I/O cost of the query in half giving us a great performance boost!

In the attached script (below), we’ll generate some bogus data then look at the I/O cost of each query – and add the finishing touches to meet the customer ask that the data be displayed as Events per second.

 

20180228_SelfJoins_RowNumber_Lag

 

Happy Exploring