Synapse doesn't allow using all quota, queues jobs and restarts instead of failing

Question

Synapse doesn't allow using all quota, queues jobs and restarts instead of failing

Rafi Trad 66

I have several issues when using Synapse to submit Apache Spark 3.1 jobs, where I use one executor only. I disabled autoscaling and dynamic allocation of executors, i.e., my Apache spark pool scale settings are:

Node size: XLarge (32 vCores, 256 GB)
Autoscale: disabled
Number of nodes: 3 (couldn't choose 1, but specified 1 executor in the job definition itself)
Dynamically allocate executors: disabled
Intelligent cache size: 50% (default)
Subscription Apache Spark (CPU vCore) per workspace / 32

I am using Spark 3.1 library, but not distributing work over several executors, so I am using only one executor with many cores.

The Issues:

1- When I try to run the job with XLarge executor size (32 vCores / 224GB memory), 1 executor only, I get the error:

Failed to submit the Spark job
Error:
{
"code": "SparkJobDefinitionActionFailedWithBadRequest",
"message": "Spark job batch request for workspace asa-synapseworkspace, spark compute synspSpark3p1 with session id null failed with a bad request. Reason: {\n \"TraceId\": \"x-x-x-x | client-request-id : x-x-x-x\",\n \"Message\": \"Your Spark job requested 64 vcores. However, the workspace has a 32 core limit. Try reducing the numbers of vcores requested or increasing your vcore quota.\"\n}",
"target": null,
"details": null,
"error": null
}

Why? What should I do to use XLarge executors? re-increase the quote to 64 vCores?

2- When I submit the job to a smaller executor (Large, 16 vCores), it gets queued for hours without being submitted. That also happened with 8 vCores executors. I then scaled down the pool to Large node size instead of XLarge, and tried to submit the job to Large and Medium executors, to no avail. Queued status stuck.
Finally, I scaled down the Apache Spark pool to Medium node sizes, and tried to submit the job to a Medium executor, which succeeded and the job got submitted, but that is too small of an executor for the workload.

I should be able to submit jobs to XLarge or Large executors at least. What is happening here that is keeping the jobs in queued status with specific Apache Spark Pool Node Size settings?

3- I continued to run the job with my only option left, a Medium executor, and understandably faced a memory error (no enough RAM). After 36 hours did the error happen, but the job restarted instead of failing, and older logs from the previous run attempt were lost.

Why is that? I expect the job to shut down when there is an error, not to restart and continue billing.

Thank you.

PRADEEPCHEEKATLA 90,651 Reputation points Moderator

2022-11-18T08:33:47.853+00:00

Hello @Rafi Trad ,

Thanks for the question and using MS Q&A platform.

This issue looks strange. For a deeper investigation and immediate assistance on this issue, if you have a support plan you may file a support ticket.
Rafi Trad 66 Reputation points

2022-11-18T09:14:52.78+00:00

I have the standard support plan, and I have created a support ticket.

1 answer

Your answer

PRADEEPCHEEKATLA 90,651 Reputation points Moderator

2022-11-18T08:33:47.853+00:00

Hello @Rafi Trad ,

Thanks for the question and using MS Q&A platform.

This issue looks strange. For a deeper investigation and immediate assistance on this issue, if you have a support plan you may file a support ticket.
Rafi Trad 66 Reputation points

2022-11-18T09:14:52.78+00:00

I have the standard support plan, and I have created a support ticket.

Answer 1

All the issues apart from issue 3, which needs time-consuming investigation, were related to quota limitations. We resolved the issue by increasing the quota to 100 vCores per workspace per subscription instead of 32. This allowed us to run jobs on 32 vCores execution nodes in contrast to only 8.

The thing to keep in mind is: there will be always one executor locked away as a driver, and the maximum usable executor size is directly related to quota limitations. i.e., I was hitting a wall with 32 quota when trying to run Large (16 vCores) or XLarge (32 vCores) jobs because 8 was the maximum usable, given that: 32 quota = 4 * 8. 4 because the minimum pool size is 3 executors, which I used, plus 1 driver, so 4.

As to the indefinite queued status of jobs, this is likely a validation mishap in Synapse and MS will try to better the behaviour of Synapse Spark jobs when such a situation happens and make it less confusing. Increasing the quota allowed previously queued jobs to be submitted.

Share via

Synapse doesn't allow using all quota, queues jobs and restarts instead of failing

1 answer

Your answer