Understanding and Optimizing the Synapse Notebook (Apache Spark Pool)

Question

Hello Team,

We have a Synapse pipeline which has Notebook as an activity.

Tried pipeline with multiple SKUs and here is our observation, however unable to understand which is the best SKU to select for Production?

Pool Size Time Taken
Large (3-200 Nodes) – Auto Scale enabled =>17 mins~
Large (3 Nodes) – Auto Scale disabled =>30 mins~
Medium (3 Nodes) – Auto Scale disabled =>45 mins~

Questions/Suggestions:

From above time taken for different Pool size – which one is best suggested for Production work loads? How to choose best suitable?
For one of the Notebook session When I see the execution details, there is a mismatch in “Total Duration” vs “Playback” duration, why there is so much difference? Is this expected?
Is it good practice to customize the number of executors? And what is the best way to do it?

Answer

Hello @Anonymous ,
Thanks for the ask and using Microsoft Q&A platform .

I will start with what is the workload which we are trying to process and is the data which spark is consuming is paritioned or not . For example if you are processing 100GB of csv file on a small cluster ( without partition) adding executor will not help . I wil also go ahead and put the autoscale ON in production . Also I think you will also have to look into the internal details as to how the data is processed . Have you gone through this link .
https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-history-server

Please do let me know how it goes .
Thanks
Himanshu

-------------------------------------------------------------------------------------------------------------------------

Please don't forget to click on or upvote button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
Want a reminder to come back and check responses? Here is how to subscribe to a notification
If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators

Understanding and Optimizing the Synapse Notebook (Apache Spark Pool)

1 answer