Plausibility check in Azure Data Factory

Poel van der, RE (Ron) 426 Reputation points
2022-10-25T15:38:40.543+00:00

Hi

I am curious to know if anyone out there build in 'plausibilty checks' in ADF.

On a daily base we load Parquet files from various producers in our SQLDB. Let me give two examples.
Example 1: Daily we get a snapshot of the payment agreements. Mostly that file contains around 8 million rows.
Example 2: Also daily we get the transactions done on those agreements. One day these might be 4 million transactions, another day 4.3 million transactions etcetera.
But around Christmas or other special events it can be more than 12 million rows.

Suppose on a normal day we only get 1 million transactions or 16 million agreements. Then almost for sure something went wrong on the producer side. And taking in that data might make a mess of our data warehouse.

So how do you, in a generic way, check that the number of records you receive are plausible?
And fail the pipeline if it is not?
How do you take care that it will take into consideration that on certain days/events like Christmas the number might be much higher?

Looking forward to your solutions

Regards
Ron

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,892 questions
0 comments No comments
{count} votes

Accepted answer
  1. ShaikMaheer-MSFT 38,301 Reputation points Microsoft Employee
    2022-10-28T13:17:06.793+00:00

    Hello @Poel van der, RE (Ron) ,

    Thanks for posting queries in Microsoft Q&A Platform.

    You can consider using dataflows and add your source file as source transformation and then you can get count of rows using aggregate transformation and then finally use cache sink and write output to activity. There by in pipeline you will get count and make a decision to perform copy or not.

    Kindly check below gif where it shows how to output count to activity.
    255057-1062260.gif

    Kindly check below video too where its explained in detail about writing output from dataflows to activities.
    Write Cache Sink to Activity Output in Azure Data factory

    Hope this helps. Please let me know if any further queries.

    ----------------

    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Pratik Somaiya 4,201 Reputation points
    2022-10-26T06:24:48.293+00:00

    Hello @Poel van der, RE (Ron)

    If you have a list of events when you expect the record count to be higher then you can have a metadata or a date table where you can mark those events with a flag and then lookup those values

    In a switch activity you can have cases like if activity count (considering you are using a copy activity which can give us the record count) is between so and so range and if it satisfies the flag value then execute that case, else go to next one

    To fail the pipeline you can add a fail activity inside your case

    Does this sound feasible to you?

    0 comments No comments