Hi ADF users,
I am dealing with a not obvious behaviour using datafactory dataflow with a source folder that contains a delta table.
As far as I know dataflow engine is based on apache spark so if I have the following setup
source1: deltalake folder SRC with 3M rows and 3 partitions (country=IT,ES,UK) with 1M rows or each folder
sink1: deltalake folder DEST
In the stage section within debug dataflow i can see in the lineage section
source1: 3M rows
filter1: 1M rows
I am curious about the real meaning of these numbers:
If spark push down the filter the engine should read only 1 M rows, isn't it ?
Or ADF dataflow engine is not able to combine the full dag and it execute every task separately?
If someone can help me to understand how the engine behind the scenes works it would be really appreciated.