question

RaGus-6891 avatar image
0 Votes"
RaGus-6891 asked RaGus-6891 answered

push down filter when reading delta folder using datafactory dataflow

Hi ADF users,

I am dealing with a not obvious behaviour using datafactory dataflow with a source folder that contains a delta table.

As far as I know dataflow engine is based on apache spark so if I have the following setup

source1: deltalake folder SRC with 3M rows and 3 partitions (country=IT,ES,UK) with 1M rows or each folder
filter1: country='IT'
sink1: deltalake folder DEST

In the stage section within debug dataflow i can see in the lineage section

  • source1: 3M rows

  • filter1: 1M rows

  • sink1: -


I am curious about the real meaning of these numbers:

  • If spark push down the filter the engine should read only 1 M rows, isn't it ?

  • Or ADF dataflow engine is not able to combine the full dag and it execute every task separately?

If someone can help me to understand how the engine behind the scenes works it would be really appreciated.

Thanks
Ra



azure-data-factory
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

HimanshuSinha-MSFT avatar image
0 Votes"
HimanshuSinha-MSFT answered

Hello @RaGus-6891,
Thanks for the question and using MS Q&A platform.
As we understand the ask here is If spark push down the filter the engine should read only 1 M rows , please do let us know if its not accurate.
Spark is smart enough to the determine and send the data for each partion to each executers ( if they exists ) . In this sceanrio I am sure that the 1 M record is what is sent to the executers . Spark also does something called lazy evaluation which means that if you put multiple filters ( internally its a where clause ) , it waits till the last minute and creates a plan to takes care of all all the filter and execute only once , this makes the whole operation effecient . I think this is also refred to as push down predicate .

Please do let me if you have any queries.
Thanks
Himanshu


  • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how

  • Want a reminder to come back and check responses? Here is how to subscribe to a notification

  • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators


5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

RaGus-6891 avatar image
0 Votes"
RaGus-6891 answered

Hi,

you are right, I am trying to find out a specific answer in order to be sure we are not scanning 3M rows nut 1M rows. Within this case I am sure due to the fact I am using a filter on a partition (country='IT')


My concern here is that the summary on dataflow lineage is reporting "source1: 3M rows" so it sounds the underlying engine is scanning all the rows.

Please can you point out a doc link or a feedback from product group in order to be sure we are not reading all the rows ?
We need to be sure (101%) we are really pushing down the predicates.

Can you provide a feedback ?

R.

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.