Alternative to using CROSS APPLY (SELECT TOP 1 .... ORDER BY DtTm Desc)

Michael MacGregor 86 Reputation points
2020-11-05T15:52:19.11+00:00

I will try my best to provide some example tables and as detailed an explanation as possible. If anything is not clear, please ask. The design is not in 3NF, don't blame me for it, I didn't design it and it would be a mammoth task to redesign it, and I'm not allowed to anyway.

So this particular issue is one that tends to occur a lot due to the nature of the data which is time-based and there is a lot of it, hundreds of millions of rows that are sliced and diced many different ways, in many different tables.

The base data is captured at particular times throughout the day for each product, let's call these ProductEvents, and there is an EventWindow for each event, that is actually the time in seconds between the EventDtTm and the next - pre-calculated by the process that creates the events. N.B. The times of events and the length of their windows are not consistent throughout the day nor for different products:

CREATE TABLE dbo.Products
(
ProductID INT IDENTITY(1,1) PRIMARY KEY,
ProductName VARCHAR(10)
);

CREATE TABLE dbo.ProductEvents
(
EventID INT IDENTITY(1,1) PRIMARY KEY,
ProductID INT,
EventTypeID INT,
EventDtTm DATETIME2(7),
EventWindow INT
);

Although there is an EventTypeID and therefore an EventType table, I'm only interested in one particular event out of about a dozen or so, and this is the case in almost all of the various instances of this coding issue, and the tables and data do vary somewhat but this is the most difficult of them, the rest are easier, anyway read on and it should become clear, I hope.

Now there is another set of events that are strictly taken, or rather, pre-calculated for every minute, some may be missing depending on the availability of the raw data that we obtain from various external sources, and is designed such that there is meta data for each product per day providing a MetaID, and then there is the detail data for minutes per product per event type, again only interested in one particular event.

N.B. There are multiple sources for all of this data but it doesn't really affect the overall query as the sources have to match.

CREATE TABLE dbo.MinuteEventsMetaData
(
MetaID INT IDENTITY(1,1) PRIMARY KEY,
ProductID INT,
MinuteEventDt DATE
);

CREATE TABLE dbo.MinuteEventsDetail
(
MetaID INT PRIMARY KEY,
ValueTypeID INT,
EventMinute TIME(2),
EventValue FLOAT
);

So the objective is to find the value for a particular product event at percentages of the event window - the percentages can change each time but let's stick with 25, 50, & 75 for the sake of simplicity.

If we have a ProductEvent for which the time is 8:30am, and a time window of 3600 seconds, then we would calculate that the window ends at 9:30am, and the times for the percentages would be 8:45am, 9am, and 9:15am respectively. Nice and simple.

Well now, there are ~400million rows in the ProductEvents table, of which only ~7million are of interest, but combine that with the percentages and that results in 7 * 3 = 21million. Actually it's worse than that as the scenario I'm actually looking at right now is for each 5% interval so the total number of rows is 7 * 19 = 133million.

In this resulting data, let's save the intermediate data in a table (variable or temp) as:

DECLARE @ProductPercentages TABLE
(
ProductID INT,
EventDtTm DATETIME2(7),
EventDate DATE, -- calculated from ProductEvents.EventDtTm
EventTime TIME(2), -- calculated from ProductEvents.EventDtTm
Percentage INT,
PercentageTm TIME(2),
MetaID INT,
PRIMARY KEY
(
ProductID,
EventDtTm,
Percentage
)
);

Now we need to get the value from the MinuteEventsDetail that is the closest to but less than or equal to the calculated time for the percentage, and it must be more than the time for the ProductEvent, i.e. EventDtTm. First we need to get the MetaID based on the ProductID and the MinuteEventDt for the EventDtTm, in order to query the MinuteEventsDetail table:

So I'd tried this:

SELECT PP.ProductID,
PP.EventDate,
PP.EventTime,
PP.Percentage,
PP.PercentageTm,
PP.MetaID,
vME.EventValue AS PercentageValue
FROM ProductPercentages as PP
CROSS APPLY
( SELECT TOP 1
EventMinute,
EventValue
FROM dbo.MinuteEvents AS ME
WHERE ME.MetaID = PP.MetaID
AND ME.ValueType = 3
AND ME.EventMinute BETWEEN PP.EventTime AND PP.PercentageTm
ORDER BY EventMinute DESC) AS vME;

As mentioned the ProductPercentages could hold at least 21million rows, and the MinuteEvents table has over 60billion rows. So the above query performs horrendously.

I've tried to break it down, and use an alternative approach but so far all I've achieved is to get frustrated.

Please help.

Thanks,

Michael MacGregor
Senior DBA

SQL Server
SQL Server
A family of Microsoft relational database management and analysis systems for e-commerce, line-of-business, and data warehousing solutions.
13,051 questions
Transact-SQL
Transact-SQL
A Microsoft extension to the ANSI SQL language that includes procedural programming, local variables, and various support functions.
4,576 questions
0 comments No comments
{count} votes

Accepted answer
  1. Jeffrey Williams 1,891 Reputation points
    2020-11-05T20:51:40.21+00:00

    I have no idea if this will work any better - you may have to adjust the partition:

     Select Top 1 With Ties
            PP.ProductID
          , PP.EventDate
          , PP.EventTime
          , PP.Percentage
          , PP.PercentageTm
          , PP.MetaID
          , PercentageValue = vME.EventValue
       From ProductPercentages                      As PP
      Inner Join dbo.MinuteEvents                   As ME On ME.MetaID = PP.MetaID
      Where ME.ValueType = 3
        And ME.EventMinute <= PP.PercentageTm
      Order By
            row_number() over(Partition By PP.ProductID, PP.EventDate, PP.EventTime Order By ME.EventMinute desc);
    

    I don't think you need to include the check on EventMinute against PP.EventTime - the partition should only include the Event Minutes associated with that Event Date and Event Time. You say those values are calculated...which may need to be computed and persisted in the source table - as well as indexed (not sure).


2 additional answers

Sort by: Most helpful
  1. MelissaMa-MSFT 24,176 Reputation points
    2020-11-06T05:57:09.143+00:00

    Hi @Michael MacGregor ,

    Thank you so much for posting here.

    We recommend that you could also post INSERT statements with some sample data and the expected result of the sample.

    Per my knowledge, you could try with ROW_NUMBER() OVER (PARTITION BY..ORDER BY ...) or LEFT JOIN ...where ...IS NULL instead of CROSS APPLY (SELECT TOP 1 .... ORDER BY DtTm Desc).

    Best regards
    Melissa


    If the answer is helpful, please click "Accept Answer" and upvote it.
    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.
    Hot issues October--Users always get connection timeout problem when using multi subnet AG via listener. Especially after failover to another subnet

    0 comments No comments

  2. Erland Sommarskog 103.2K Reputation points MVP
    2020-11-06T07:55:21.953+00:00

    As Jeffery says, a solution with row_number may work better. Generally there are two patterns for "give the most recent row". One is the cross apply pattern, the other is like this:

    ; WITH CTE AS (
        SELECT ..., row_number() OVER(PARTTION BY ... ORDER BY ... DESC) AS rowno
      FROM ...
    )
    SELECT ... 
    FROM   CTE
    WHERE  rowno = 1
    

    The CROSS APPLY pattern often ends up with a loop join which runs the subquery for every row. Good if there are few partitions, but less so if there are many. The solution with row_number() typically results in a full scan of the inner table, in this case MinuteEvents, and if there are many partitions this is often the better solution.

    The CROSS APPLY pattern can sometimes backfires, because SQL Server applies the row goal optimistically, so it runs a CI scan on the inner table in a CROSS APPLY, thinking that it statistically it will find a matching row faster than if it used a non-clustered index + key lookup. But that plan is a very big gamble, cause if there is no match at all, the entire inner table has to be scanned, and if that happens more than once... it's a disaster. Here various hints can help.

    Since you don't share any plans, I can't make any specific comments, but I hope this input can be of some guidance. I will have to apologise that I did not really have the time to grasp your exact scenario in detail, so this answer is more of a general discussion about this type of queries.