I have several issues when using Synapse to submit Apache Spark 3.1 jobs, where I use one executor only. I disabled autoscaling and dynamic allocation of executors, i.e., my Apache spark pool scale settings are:
- Node size: XLarge (32 vCores, 256 GB)
- Autoscale: disabled
- Number of nodes: 3 (couldn't choose 1, but specified 1 executor in the job definition itself)
- Dynamically allocate executors: disabled
- Intelligent cache size: 50% (default)
- Subscription Apache Spark (CPU vCore) per workspace / 32
I am using Spark 3.1 library, but not distributing work over several executors, so I am using only one executor with many cores.
The Issues:
1- When I try to run the job with XLarge executor size (32 vCores / 224GB memory), 1 executor only, I get the error:
Failed to submit the Spark job
Error:
{
"code": "SparkJobDefinitionActionFailedWithBadRequest",
"message": "Spark job batch request for workspace asa-synapseworkspace, spark compute synspSpark3p1 with session id null failed with a bad request. Reason: {\n \"TraceId\": \"x-x-x-x | client-request-id : x-x-x-x\",\n \"Message\": \"Your Spark job requested 64 vcores. However, the workspace has a 32 core limit. Try reducing the numbers of vcores requested or increasing your vcore quota.\"\n}",
"target": null,
"details": null,
"error": null
}
Why? What should I do to use XLarge executors? re-increase the quote to 64 vCores?
2- When I submit the job to a smaller executor (Large, 16 vCores), it gets queued for hours without being submitted. That also happened with 8 vCores executors. I then scaled down the pool to Large node size instead of XLarge, and tried to submit the job to Large and Medium executors, to no avail. Queued status stuck.
Finally, I scaled down the Apache Spark pool to Medium node sizes, and tried to submit the job to a Medium executor, which succeeded and the job got submitted, but that is too small of an executor for the workload.
I should be able to submit jobs to XLarge or Large executors at least. What is happening here that is keeping the jobs in queued status with specific Apache Spark Pool Node Size settings?
3- I continued to run the job with my only option left, a Medium executor, and understandably faced a memory error (no enough RAM). After 36 hours did the error happen, but the job restarted instead of failing, and older logs from the previous run attempt were lost.
Why is that? I expect the job to shut down when there is an error, not to restart and continue billing.
Thank you.