Synapse doesn't allow using all quota, queues jobs and restarts instead of failing

Rafi Trad 56 Reputation points

I have several issues when using Synapse to submit Apache Spark 3.1 jobs, where I use one executor only. I disabled autoscaling and dynamic allocation of executors, i.e., my Apache spark pool scale settings are:

  • Node size: XLarge (32 vCores, 256 GB)
  • Autoscale: disabled
  • Number of nodes: 3 (couldn't choose 1, but specified 1 executor in the job definition itself)
  • Dynamically allocate executors: disabled
  • Intelligent cache size: 50% (default)
  • Subscription Apache Spark (CPU vCore) per workspace / 32

I am using Spark 3.1 library, but not distributing work over several executors, so I am using only one executor with many cores.

The Issues:

1- When I try to run the job with XLarge executor size (32 vCores / 224GB memory), 1 executor only, I get the error:

Failed to submit the Spark job
"code": "SparkJobDefinitionActionFailedWithBadRequest",
"message": "Spark job batch request for workspace asa-synapseworkspace, spark compute synspSpark3p1 with session id null failed with a bad request. Reason: {\n \"TraceId\": \"x-x-x-x | client-request-id : x-x-x-x\",\n \"Message\": \"Your Spark job requested 64 vcores. However, the workspace has a 32 core limit. Try reducing the numbers of vcores requested or increasing your vcore quota.\"\n}",
"target": null,
"details": null,
"error": null

Why? What should I do to use XLarge executors? re-increase the quote to 64 vCores?

2- When I submit the job to a smaller executor (Large, 16 vCores), it gets queued for hours without being submitted. That also happened with 8 vCores executors. I then scaled down the pool to Large node size instead of XLarge, and tried to submit the job to Large and Medium executors, to no avail. Queued status stuck.
Finally, I scaled down the Apache Spark pool to Medium node sizes, and tried to submit the job to a Medium executor, which succeeded and the job got submitted, but that is too small of an executor for the workload.

I should be able to submit jobs to XLarge or Large executors at least. What is happening here that is keeping the jobs in queued status with specific Apache Spark Pool Node Size settings?

3- I continued to run the job with my only option left, a Medium executor, and understandably faced a memory error (no enough RAM). After 36 hours did the error happen, but the job restarted instead of failing, and older logs from the previous run attempt were lost.

Why is that? I expect the job to shut down when there is an error, not to restart and continue billing.

Thank you.

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
3,107 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Rafi Trad 56 Reputation points

    All the issues apart from issue 3, which needs time-consuming investigation, were related to quota limitations. We resolved the issue by increasing the quota to 100 vCores per workspace per subscription instead of 32. This allowed us to run jobs on 32 vCores execution nodes in contrast to only 8.

    The thing to keep in mind is: there will be always one executor locked away as a driver, and the maximum usable executor size is directly related to quota limitations. i.e., I was hitting a wall with 32 quota when trying to run Large (16 vCores) or XLarge (32 vCores) jobs because 8 was the maximum usable, given that: 32 quota = 4 * 8. 4 because the minimum pool size is 3 executors, which I used, plus 1 driver, so 4.

    As to the indefinite queued status of jobs, this is likely a validation mishap in Synapse and MS will try to better the behaviour of Synapse Spark jobs when such a situation happens and make it less confusing. Increasing the quota allowed previously queued jobs to be submitted.

    1 person found this answer helpful.
    0 comments No comments