Azure data brick performance tuning

Pankaj Joshi 331 Reputation points
2024-08-29T17:50:14.14+00:00

I have azure data brick notebook with 40 task(pyspark) dealing with 2.5 million database record (azure sql) and it took 4.5 hours to run on acceptance env, however it took only 35 min on test env with 1 million database record. Could you please advise how can i tune performance so that it take less time with 2.5 million record. I tried increasing cluster worker node but didn't improved performance. Please advise

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,175 questions
0 comments No comments
{count} votes

Accepted answer
  1. Vinodh247 20,476 Reputation points
    2024-08-30T06:40:15.9033333+00:00

    Hi Pankaj Joshi,

    Thanks for reaching out to Microsoft Q&A.

    The significant difference in runtime between your environments suggests that there might be some optimizations needed in data handling, cluster configuration, or Spark job tuning. Focus on identifying and eliminating any bottlenecks, optimizing your data transformations, and making sure that your cluster is appropriately configured for the workload.

    Here are the common strategies to help optimize your azure databricks performance:

    Data Skew: If your data is unevenly distributed, some tasks might be handling much larger chunks of data than others, causing those tasks to take longer and bottleneck the entire process. You can check for data skew by monitoring the stages and tasks in the Spark UI.

    Partitioning: Ensure your data is well-partitioned. You can use repartition() or coalesce() to adjust the number of partitions. Generally, aim for partitions that have about 128 MB each, which is a good balance between task granularity and overhead.

    Broadcast Smaller Tables: If you are joining large tables, try to broadcast the smaller table using broadcast() to avoid shuffling the larger table. This can significantly speed up joins.

    Cache Intermediate Results: If you are performing multiple operations on the same dataset, caching the data in memory can prevent re-computation. Use df.cache() or df.persist(StorageLevel.MEMORY_AND_DISK) if the data is too large to fit in memory.

    Check Cache: Ensure the cached data is being utilized correctly. Sometimes data might be evicted from the cache due to memory pressure, leading to recomputation.

    Auto-scaling: Ensure that auto-scaling is enabled for your cluster, which can help handle variable loads efficiently.

    Instance Types: Choose instance types with higher memory and CPU if your workload is memory or CPU-bound.

    I/O Bound: If your workload is I/O-bound (e.g., Reading/writing to Azure SQL), consider using premium storage or increasing I/O throughput.

    Adjust Executor and Driver Memory: Make sure your executor and driver memory are appropriately set. Sometimes, too much or too little memory can lead to inefficient resource utilization.

    Task Parallelism: Adjust the spark.sql.shuffle.partitions parameter to ensure optimal parallelism. By default, this is set to 200, but it might need to be increased for larger datasets.

    Execution Plan Analysis: Review the execution plan using explain() to understand if there are any inefficiencies in how Spark is executing the tasks.

    Avoid Collecting Data to Driver: Avoid using actions like collect() that bring data to the driver unless absolutely necessary. This can cause memory issues and slow down the overall process.

    Vectorized Operations: Ensure that your code uses vectorized operations wherever possible (e.g., Using withColumn() instead of row-wise operations).

    Optimize Azure SQL Queries: Make sure your SQL queries are optimized, for example by creating indexes on the columns you are filtering or joining on.

    Parallel Reads: Utilize partitioned reads from Azure SQL to parallelize the data loading process.

    Spark UI: Use the Spark UI to monitor task durations, shuffle read/write sizes, and the amount of data processed per task. This can help identify bottlenecks.

    Driver and Executor Logs: Check the logs for any warnings or errors that might indicate issues such as memory pressure or task failures.

    Adaptive Query Execution (AQE): Consider enabling AQE, which dynamically optimizes the query execution plan at runtime.

    Skew Join Optimization: If you have skewed data, consider using Spark's skew join optimization by enabling "spark.sql.adaptive.skewJoin.enabled"

    Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.