org.apache.spark.SparkException: Job aborted.

Question

org.apache.spark.SparkException: Job aborted.

Agarwal, Abhishek 0

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 84 in stage 232.0 failed 4 times, most recent failure: Lost task 84.3 in stage 232.0 (TID 150935) (10.241.20.17 executor 36): ExecutorLostFailure (executor 36 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages

PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2023-08-14T03:53:25.0433333+00:00

@Agarwal, Abhishek - Just checking in to see if the below answer helped. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

1 answer

Your answer

PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2023-08-14T03:53:25.0433333+00:00

@Agarwal, Abhishek - Just checking in to see if the below answer helped. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Answer 1

@Agarwal, Abhishek - Thanks for the question and using MS Q&A platform.

This error message indicates that the job failed due to a stage failure caused by a lost task. The most recent failure was due to an ExecutorLostFailure, which means that executor 36 exited because of one of the running tasks. The reason for this failure is likely due to containers exceeding thresholds or network issues**.**

To solve this problem, you can use the following options:

Option-1: Use a powerful cluster (both drive and executor nodes have enough memory to handle big data) to run data flow pipelines with setting "Compute type" to "Memory optimized".

Option-2: Use larger cluster size (for example, 48 cores) to run your job.

Option-3: Check the driver logs for WARN messages to identify any potential issues with the driver.

If none of these options work, please provide more information about your specific scenario so that I can assist you better.

Here are some suggestions that could help you reduce your execution time:

Increase the number of worker nodes: Adding more worker nodes to your cluster can help you process data in parallel and reduce the execution time of your Spark jobs. You can choose the worker type based on the amount of memory and CPU required for your workload.

Increase the number of cores per worker node: Increasing the number of cores per worker node can help you process more data in parallel and reduce the execution time of your Spark jobs. However, keep in mind that increasing the number of cores per worker node can also increase the cost of your cluster.

Increase the amount of memory per worker node: Increasing the amount of memory per worker node can help you process larger datasets and reduce the number of shuffles required during the execution of your Spark jobs. This can also help you reduce the execution time of your Spark jobs.

Use a larger driver node: If your PySpark code requires a lot of memory, you may want to consider using a larger driver node. This can help you avoid out-of-memory errors and improve the performance of your Spark jobs.

Use autoscaling: Databricks provides an autoscaling feature that can automatically add or remove worker nodes based on the workload. This can help you optimize the cost of your cluster while ensuring that you have enough resources to process your data.

Regarding the worker types and driver types, the main difference between them is the amount of memory and CPU available. The more memory and CPU available, the more data you can process in parallel, which can help you reduce the execution time of your Spark jobs. You can choose the worker and driver types based on the requirements of your workload.

For more details, refer to Best practices: Cluster configuration.
Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2023-08-16T09:33:38.3233333+00:00

@Agarwal, Abhishek - Following up to see if the above answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
Agarwal, Abhishek 0 Reputation points

2023-08-16T09:37:57.78+00:00

No actually i have changed the setting and tried but that doesn't work.
PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2023-08-18T05:03:12.4933333+00:00

@Agarwal, Abhishek - If none of these options work, please provide more information about your specific scenario so that I can assist you better.
PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2023-08-22T08:32:01.8766667+00:00

@Agarwal, Abhishek - We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Share via

org.apache.spark.SparkException: Job aborted.

1 answer

Your answer