databricks vs ADF count rows of delta parquet files

arkiboys 9,696 Reputation points
2022-06-03T08:34:21.53+00:00

Hello,
Using ADF, each day I load data into day folder with delta parquet files...
To count the number of rows in sink delta parquet I use a dataflow sink to get the count. I also run databricks pyspark to get the count of rows in the delta parquet files.
For the very first run, I get the same count for both ADF and databricks...

Then, for testing I update the sink delta files by re-running the same load for the day and then I carry-out the above process to count the rows in delta parquet files using ADF and databricks.
It turns out that ADF seems to show the correct count in the sink delta parquet but databricks shows double the numbers.
If I rerun the load for the day for the 3rd time, databrivcks shows triple count whereas ADF shows still the original count which is correct.
Note, for testing I use the same load and so there are no new updates or inserts.

Any thoughts why databricks shows duplicated rows?

Thanks you

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,160 questions
0 comments No comments
{count} votes

Accepted answer
  1. HimanshuSinha-msft 19,476 Reputation points Microsoft Employee
    2022-06-06T20:19:50.237+00:00

    Hello @arkiboys ,
    Thanks for the question and using MS Q&A platform.
    As we understand the ask here is why the counts are shown different for ADF vs ADb , please do let us know if its not accurate.
    When you say ADF , for clarity i think you means Data flow . I have been following your threads and as I think you did used UPSERT logic in the transformation , if yes then are those transformation also being implemented in ADB notebook scripts ?
    To me it looks like ADB is just loading all the source data without transformation .

    Please do let me if you have any queries.
    Thanks
    Himanshu


    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
      • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators
    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.