Pyspark dataframe is taking too long to save on ADLS from Databricks.

Question

I'm running a notebook on Azure databricks using a multinode cluster with 1 driver and 1-8 workers(each with 16 cores and 56 gb ram). Reading the source data from Azure ADLS which has 30K records. Notebook is consist of few transformation steps, also using two UDFs which are necessary for code implementation. While my entire transformation steps are running within 12 minutes(which is expected), it is taking more than 2 hours to save the final dataframe to ADSL Delta table. I'm providing some code snippet here(can't provide the entire code), suggest me ways to reduce this dataframe saving time.

# All the data reading and transformation code  
# only one display statement before saving it to delta table. Up to this statement it is taking 12 minutes to run  
data.display()   
# Persisting the data frame  
from pyspark import StorageLevel  
data.persist(StorageLevel.MEMORY_ONLY)  

mount_path = "/mnt/********/"  
table_name = "********"  
adls_path = mount_path + table_name  
(data.write.format('delta').mode('overwrite').option('overwriteSchema', 'true').save(adls_path))

This last part is taking 2 - 2.5 hours to finish

Answer

Hello @Pratik Roy ,

Welcome to the MS Q&A platform.

I see you are using UDFs. In general, UDFs are slow because Spark cannot optimize them as it does with SQL functions.

This blog explains about UDFs.

Apart from this, I guess the data load is creating a huge data frame that is not able to fit into the memory.

Can you please increase the cluster size and number of worker nodes from the current 8 nodes to a higher number and see if it makes any difference?

also, please check the network configuration of ADLS. A private link to connect to ADLS is recommended.

Pyspark dataframe is taking too long to save on ADLS from Databricks.

1 answer