How is node allocation done for Spark pools in Synapse?

Question

How is node allocation done for Spark pools in Synapse?

asciibscii 50

Hi,

I am not quite sure I follow how the nodes of a Spark pool is allocated in Synapse. When attaching notebooks to a Spark pool we have control over how many executors we want to allocate to a notebook. So if the pool has 5 nodes and we allocate 2 executors each to 2 different notebooks, how is the node allocation taken care of? Will both notebooks run in the same spark instance or is it possible that two spark instances will be created, since from my understanding Spark instances are based on node availability and not executor availability.

Accepted answer

1 additional answer

Your answer

Answer 1

Bhargava-MSFT 31,261 Microsoft Employee Moderator

Hello @asciibscii ,

Welcome to the MS Q&A platform.

Yes, your understanding is correct. When attaching notebooks to a Spark pool we have control over how many executors and Executor sizes, we want to allocate to a notebook. And spark instances are based on node availability.

If we choose a node size small(4 Vcore/28 GB) and a number of nodes 5, then the total number of Vcores = 4*5 = 20 vcores

and the Max executors you can select is 4 in this case(when selecting dynamically allocate executors).

Running 2 different notebooks with 2 executors each:

In this case: executor size = small (4 vores, 28 GB)

First note book will use= 4*2 + 1 driver(4 vcores) = 12 vcores will be used

Out of 20 Vcores, 12 were used on the 1st notebook, and you have left with 8 Vcores.

So 3 nodes were used by the first notebook ( 5*12 /20 = 3)

You have only 2 nodes left for the 2nd notebook.

You have submitted 2nd notebook now, and this is also looking for the same resources as notebook 1, since there is no capacity in the spark pool, if this request comes as a notebook, it will be rejected. Or if this comes as a batch job then it will be queued.

In case, if your notebook 2 has 3 nodes available, then notebook 2 still has capacity in the pool and processed by the same spark instance ( spark instance which processed the notebook 1).

To answer your question: if there is a capacity available for the 2nd notebook, then the same spark instance will be used.

Please follow the below document clearly explaining how spark instances will be used in the synapse.

https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-concepts

Please note: Driver Size is equal to Executor size.

also, please check the below thread that I have explained the Vcores concept:
https://learn.microsoft.com/en-us/answers/questions/1011305/parallel-synapse-spark-application-run?childToView=1020997#comment-1020997

I hope this clarifies your question. In case, if you have any further questions, please let me know.

If this answers your question, please consider accepting the answer by hitting the Accept answer button as it helps the community.

asciibscii 50 Reputation points

2023-01-24T10:35:05.2933333+00:00
Hi @Bhargava-MSFT thank you so much for your answer, I think I understand this a lot better now. I was wondering if you could clarify some more confusion that I have.

I think I am understanding correctly that the the node to executor and driver relationship is 1-1 to when the driver is the same size as the executor; that is if I allocate 3 executors to a notebook, there will be a total of 3x4 + 4 = 16 vCores allocated accounting for the driver and this will 'consume' 4 nodes from the pool. In other words each node will have 1 executor on it. Can you please confirm this for me?

In the documentation in https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-concepts it says: When you submit a second job, if there is capacity in the pool, the existing Spark instance also has capacity. Then, the existing instance will process the job. Otherwise, if capacity is available at the pool level, then a new Spark instance will be created.

I am getting confused as to what the difference here is in the first sentence and second sentence in terms of capacity in the pool vs capacity is available at the pool level. Could you elaborate on the two scenarios?

Is the first situation when I have a small pool 4vCore/32GB and 5-6 nodes, total 20-24 vCores in pool
NB1 requests 1 executor, meaning NB1 has: 1x4 + 1 driver = 8 vCores
and then NB1 requests 2 executors, meaning NB2 is requesting a total of: 2x4 + 1 driver = 12 vCores. Since there is 12 vCores left in the pool , there is capacity left in the pool, meaning there is capacity in the Spark instance. Then the existing Spark instance will process the job.

Is the second situation that we have the same pool 4vCore/32GB and 5-6 nodes, total 20-24 vCores in pool.
NB1 requests 2 executor, meaning total resource request is: 2x4 + 1 driver = 12 vCores.
NB2 also requests 2 executors, so totals resource request is 12 vCores. Is this where there isn't capacity in the pool (5 nodes) but there is capacity available at the pool level (the pool can scale up to 6 nodes, getting 24 vCores instead of 20 vCores for 5 nodes) and a new Spark instance is created?

I am sorry if this is totally wrong and ask for guidance in understanding properly what happens.

Thanks in advance!
Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2023-01-25T06:10:05.9666667+00:00
Hello @asciibscii ,

I think I am understanding correctly that the the node to executor and driver relationship is 1-1 to when the driver is the same size as the executor; that is if I allocate 3 executors to a notebook, there will be a total of 3x4 + 4 = 16 vCores allocated accounting for the driver and this will 'consume' 4 nodes from the pool. In other words each node will have 1 executor on it. Can you please confirm this for me?

For your first question, yes, your understanding is correct. Driver size is always equal to executor size. If you allocate 3 executors to a notebook, there will be total 3x4+ 4 = 16 vcores. Please note that when we run a notebook, one driver node is always attached(size is equal to the executor). Yes, each node will have 1 executor on it.

In the documentation in https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-concepts it says: When you submit a second job, if there is capacity in the pool, the existing Spark instance also has capacity. Then, the existing instance will process the job. Otherwise, if capacity is available at the pool level, then a new Spark instance will be created.

I am getting confused as to what the difference here is in the first sentence and second sentence in terms of capacity in the pool vs capacity is available at the pool level. Could you elaborate on the two scenarios?

For your second question:

My understanding is they are talking about two different users. Please take a look at the below spark pool and spark instance definitions.

(Please look at example 1 on the documentation page. It gives more clarity)

spark pool -- serverless apache spark pool at the portal level(it exists only as metadata, and no resources are consumed, running, or charged for)

spark instance- Spark instances are created when you connect to a Spark pool, create a session, and run a job. As multiple users may have access to a single Spark pool, a new Spark instance is created for each user that connects.

Consider this below example:

I have a spark pool of 30 nodes

user1 -submitted notebook1- 5 nodes - SI1 created

user1 -submitted notebook2- 10 nodes - SI1 is used

user2 - submitted notebook3- 5 nodes -- SI2 will be created

user1 submitted N1, that uses 5 nodes then spark instance SI1 is created

same user- user1 submitted N2 that uses 10 nodes then instance SI1 is used ( since 25 nodes available)

user 2 submitted N3 that uses 5 nodes - even though spark instance SI1 has capacity new spark instance SI2 will be created.

Please note: spark instance SI1 can go up to all 30 nodes if user1 submits the notebooks.

I hope this clarifies your question. In case, if you have any further questions, please let me know.
Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2023-01-25T17:24:12.41+00:00

Hello @asciibscii ,

I am just checking to see if you have any further questions here.
Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2023-01-26T01:09:04.7066667+00:00

Hello @asciibscii ,

Thank you for accepting the answer.

We received your feedback that the answer provided by @Santhi on this thread was not helpful.

And thank you for taking the time to share your feedback.

We are here to help you and strive to make your experience better and greatly value your feedback

if you wish; please consider re-surveying/rating for this engagement. Your feedback is very important to us.

Looking forward to your reply. Much appreciated your feedback.
LyndoCalrissian 16 Reputation points

2023-11-23T20:32:46.9933333+00:00

Does an executor always = 1 node?

So a larger node size will give an executor more vCores and memory?

Answer 2

Deleted

This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.

1 deleted comment

Comments have been turned off. Learn more

Share via

How is node allocation done for Spark pools in Synapse?

1 additional answer

Your answer