How is node allocation done for Spark pools in Synapse?

asciibscii 50 Reputation points
2023-01-23T08:39:53.75+00:00

Hi,

I am not quite sure I follow how the nodes of a Spark pool is allocated in Synapse. When attaching notebooks to a Spark pool we have control over how many executors we want to allocate to a notebook. So if the pool has 5 nodes and we allocate 2 executors each to 2 different notebooks, how is the node allocation taken care of? Will both notebooks run in the same spark instance or is it possible that two spark instances will be created, since from my understanding Spark instances are based on node availability and not executor availability.

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,042 questions
0 comments No comments
{count} votes

Accepted answer
  1. Bhargava-MSFT 31,121 Reputation points Microsoft Employee
    2023-01-23T22:55:05.6966667+00:00

    Hello @asciibscii ,

    Welcome to the MS Q&A platform.

    Yes, your understanding is correct. When attaching notebooks to a Spark pool we have control over how many executors and Executor sizes, we want to allocate to a notebook. And spark instances are based on node availability.

    If we choose a node size small(4 Vcore/28 GB) and a number of nodes 5, then the total number of Vcores = 4*5 = 20 vcores

    and the Max executors you can select is 4 in this case(when selecting dynamically allocate executors).

    Running 2 different notebooks with 2 executors each:

    In this case: executor size = small (4 vores, 28 GB)

    First note book will use= 4*2 + 1 driver(4 vcores) = 12 vcores will be used

    Out of 20 Vcores, 12 were used on the 1st notebook, and you have left with 8 Vcores.

    So 3 nodes were used by the first notebook ( 5*12 /20 = 3)

    You have only 2 nodes left for the 2nd notebook.

    You have submitted 2nd notebook now, and this is also looking for the same resources as notebook 1, since there is no capacity in the spark pool, if this request comes as a notebook, it will be rejected. Or if this comes as a batch job then it will be queued.

    In case, if your notebook 2 has 3 nodes available, then notebook 2 still has capacity in the pool and processed by the same spark instance ( spark instance which processed the notebook 1).

    To answer your question: if there is a capacity available for the 2nd notebook, then the same spark instance will be used.

    Please follow the below document clearly explaining how spark instances will be used in the synapse.

    https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-concepts

    Please note: Driver Size is equal to Executor size.

    also, please check the below thread that I have explained the Vcores concept:
    https://learn.microsoft.com/en-us/answers/questions/1011305/parallel-synapse-spark-application-run?childToView=1020997#comment-1020997

    I hope this clarifies your question. In case, if you have any further questions, please let me know.

    If this answers your question, please consider accepting the answer by hitting the Accept answer button as it helps the community.

    1 person found this answer helpful.

1 additional answer

Sort by: Most helpful
  1. Deleted

    This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.

    1 deleted comment

    Comments have been turned off. Learn more

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.