Count and running total of records with consecutive dates (window?)

PeterSh 176 Reputation points
2022-08-25T06:02:17.7+00:00

I have a record set of around 10 million rows or so with some timestamp for a range of items. A row is added when a value is detected on a given day, meaning days with no values are missing.

For simplicity, let's use this data as an example.

234772-image.png

We have two meters, 5555 and 6666, recording values over four days. Meter 5555 had no data on the third day, so has no corresponding row.

What I'm trying to get to is something like the following.

234773-image.png

When I load the most recent row for a meter, I need to be able to see how many consecutive days I have values for and what the cumulative value is across those days.

I think I should be able to use window action to do this, but I'm not having any luck getting my head around it. Anything I come up with seems to need to reference the row being generated by my window function in previous rows - but I don't think it works that way.

Perhaps I need two window functions? I'm not sure.

Any help on this would be great, thanks.

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,196 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. PeterSh 176 Reputation points
    2022-08-26T06:20:05.363+00:00

    I managed to get something working, though I don't know how 'best practice' it is.

    Essentially, I'm passing the rows through three consecutive window transformations.

    235105-2022-08-26-15-54-49-window.png

    So, based on a test dataset with the data from the example:

    235141-2022-08-26-15-56-18-window.png

    Window1: Over meterId, Sort date, Range by Unbounded, Window columns as below (generates consecutiveFlag).

    235112-2022-08-26-15-58-46-window.png

    To break that down, if the date from the previous row plus one day matches the date of this row, return 0, else return 1. The trick here is to give us something to sum that wont change the result if the date is consecutive.

    Window1 preview:

    235055-2022-08-26-16-05-07-window.png

    Window2: Over meterId, Sort date, Range by Unbounded, Window consecutiveGroup = sum(consecutiveFlag)

    Window2 preview:

    235096-2022-08-26-16-06-38-window.png

    Because the flag only changes consecutiveGroup when the row is not consecutive, this gives us something else to group on.

    Window3: Over meterId & consecutiveGroup, Sort date, Range by Unbounded, Window as below.

    235151-2022-08-26-16-11-26-window.png

    Simple stuff, but only possible because we generated another criteria to group.

    Window3 preview:

    235048-2022-08-26-16-13-14-window.png

    And that's it, but just to match the sample output, I sort on date & meterId, then use select to drop the interim rows to get the final output.

    235113-2022-08-26-16-15-22-window.png

    I hope someone finds this helpful. I was struggling to find a solution without taking it out of data factory.

    1 person found this answer helpful.