Thanks for your reply.
For Parallel Data Flows, it is recommended at concepts-data-flow-performance-pipelines :
Additionally, as suggested by Expert @MarkKromer-MSFT here:
1: If you execute data flows in a pipeline in parallel, ADF will spin-up separate Spark clusters for each based on the settings in your Azure Integration Runtime attached to each activity.
2: If you put all of your logic inside a single data flow, then it will all execute in that same job execution context on a single Spark cluster instance.
3: Another option is to execute the activities in serial in the pipeline. If you have set a TTL on the Azure IR configuration, then ADF will reuse the compute resources (VMs) but you will still a brand-new Spark context for each execution.
All are valid practices and which one you choose should be driven by your requirements for your ETL process.
No. 3 will likely take the longest time to execute end-to-end. But it does provide a clean separation of operations in each data flow step.
No. 2 could be more difficult to follow logically and doesn't give you much re-usability.
No. 1 is really similar to #3, but you run them all in parallel. Of course, not every end-to-end process can run in parallel. You may require a data flow to finish before starting the next, in which case you're back in #3 serial mode.
Another useful Stackoverflow thread adf-best-practice-for-dataflow-in-parallel
Depending upon the other factors, you may go through Granular Billing for Azure Data Factory to analyze costs involved.
Video link demonstrating Granular Billling
Let me know if this helps. Please let us know in case of further queries.
------------------------------
- Please don't forget to click on
or upvote
button whenever the information provided helps you.
Original posters help the community find answers faster by identifying the correct answer. Here is how - Want a reminder to come back and check responses? Here is how to subscribe to a notification