Hello,
There is a 50 vCore limit for the workspace maybe that is causing your issue. when you run all of your note books in parallel 50 v cores are consumed by 20 notebooks, so the rest of the notebooks are queued. they run when the running notebooks are finished and v cores are freed .
Spark pool - multiple notebooks running parallel
We have an environment where we want to execute around 60 independent processes that will load up data via notebooks (they largely use the same notebook but working on different data). In order to parallelize as much as possible, we have created a spark pool with 130 nodes... none of the data sets are large so would not (and have never noticed anything to the contrary) expect any one process to use any more than the 3 nodes (1 driver, 2 workers) so for a 130 node pool, we should be able to run 40 processes concurrently
Control of the spark process execution is done via data factory which determines what needs to run and kicks off the notebooks
What I find...
- Only around 20 of the processes will kick off with additional processes starting once others have finished
- If we only run one process, they take around 3-5 minutes to process but when executing multiple, the processing time extends so those 20 processes will take 15+ minutes each, now if these are meant to be independent nodes, why do they appear to be having an impact on each other?
How do we get better performance so that we can run 40 processes concurrently without them impacting each other?Creating 40 spark pools seems a ridiculous resolution, are there other settings we can adjust to get better performance?
Azure Synapse Analytics
1 answer
Sort by: Most helpful
-
M Saad 36 Reputation points
2022-09-13T07:09:53.22+00:00