generating empty files when writing in datalake

Rakesh Kumar 20 Reputation points
2023-12-18T11:53:39.24+00:00

Hi,

When saving the DataFrame to a Data Lake, the operation creates a file within the specified path, but it also generates an empty file outside of the designated folder with 0 kb.

these are the mount point:

'/mnt/schema/table1/2023/09/'

'/mnt/schema/table1/2023/10/'

'/mnt/schema/table1/2023/11/'

'/mntschema/table1/2023/12/'

code :

for path in dataframes_dict.items():
    # display(path)
    dataframe.write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv(path)

User's image

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,562 questions
Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
3,199 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,526 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator
    2023-12-20T21:59:53.2366667+00:00

    Hello Rakesh Kumar,

    Thanks for confirming this.

    After going through the below forums, the issue could be due to the way Spark handles data partitioning. When writing data, Spark creates a separate file for each partition of the DataFrame. If some partitions are empty, Spark still creates a file for them, which results in the generation of empty files.

    Can you try repartitioning your DataFrame before writing it and see if it resolves creating empty files.

    for path, dataframe in dataframes_dict.items():
        dataframe = dataframe.repartition(1)  
        dataframe.write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv(path)
    

    https://stackoverflow.com/questions/46436077/how-to-avoid-empty-files-while-writing-parquet-files

    https://stackoverflow.com/questions/55994456/azure-databricks-writing-a-file-into-azure-data-lake-gen-2

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.