Pyspark dataframe is taking too long to save on ADLS from Databricks.

Pratik Roy 1 Reputation point
2022-11-29T10:09:42.24+00:00

I'm running a notebook on Azure databricks using a multinode cluster with 1 driver and 1-8 workers(each with 16 cores and 56 gb ram). Reading the source data from Azure ADLS which has 30K records. Notebook is consist of few transformation steps, also using two UDFs which are necessary for code implementation. While my entire transformation steps are running within 12 minutes(which is expected), it is taking more than 2 hours to save the final dataframe to ADSL Delta table. I'm providing some code snippet here(can't provide the entire code), suggest me ways to reduce this dataframe saving time.

# All the data reading and transformation code  
# only one display statement before saving it to delta table. Up to this statement it is taking 12 minutes to run  
data.display()   
# Persisting the data frame  
from pyspark import StorageLevel  
data.persist(StorageLevel.MEMORY_ONLY)  

mount_path = "/mnt/********/"  
table_name = "********"  
adls_path = mount_path + table_name  
(data.write.format('delta').mode('overwrite').option('overwriteSchema', 'true').save(adls_path))  

This last part is taking 2 - 2.5 hours to finish

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,357 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,947 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. BhargavaGunnam-MSFT 26,496 Reputation points Microsoft Employee
    2022-11-30T20:25:49.443+00:00

    Hello @Pratik Roy ,

    Welcome to the MS Q&A platform.

    I see you are using UDFs. In general, UDFs are slow because Spark cannot optimize them as it does with SQL functions.

    This blog explains about UDFs.

    Apart from this, I guess the data load is creating a huge data frame that is not able to fit into the memory.

    Can you please increase the cluster size and number of worker nodes from the current 8 nodes to a higher number and see if it makes any difference?

    also, please check the network configuration of ADLS. A private link to connect to ADLS is recommended.