generating empty files when writing in datalake

Question

generating empty files when writing in datalake

Rakesh Kumar 20

Hi,

When saving the DataFrame to a Data Lake, the operation creates a file within the specified path, but it also generates an empty file outside of the designated folder with 0 kb.

these are the mount point:

'/mnt/schema/table1/2023/09/'

'/mnt/schema/table1/2023/10/'

'/mnt/schema/table1/2023/11/'

'/mntschema/table1/2023/12/'

code :

for path in dataframes_dict.items():
    # display(path)
    dataframe.write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv(path)

User's image

Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2023-12-18T19:50:44.94+00:00

Hello Rakesh Kumar,

Welcome to the Microsoft Q&A forum.

The issue of an empty file being generated outside of the designated folder with 0 kb could be due to the storage account being a Gen1 storage account, which does not support hierarchical namespace.

In Gen1 storage accounts, the files are stored in a flat structure, which can cause issues when saving files to a specific folder. This can result in an empty file being generated outside of the designated folder with 0 kb.

To resolve this issue, you can try using a Gen2 storage account, which supports hierarchical namespace and allows you to store files in a folder structure. This should ensure that the files are saved within the specified folder and prevent an empty file from being generated outside of the folder.

I hope this helps! Let me know if you have any further questions.
Rakesh Kumar 20 Reputation points

2023-12-19T13:07:11.5066667+00:00

@Bhargava-MSFT
I have checked my storage account, it is adls gen2 storage account
Rakesh Kumar 20 Reputation points

2023-12-21T05:05:27.11+00:00

@Bhargava-MSFT

Thanks for the update. Empty file issue is resolved.

But on new problem -
Creating empty folder in datalake '_$azuretmpfolder$' which is empty.
Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2023-12-21T23:06:20.99+00:00

Hello Rakesh Kumar,

Glad to hear that the issue with empty files was resolved. Did you use the code that I provided to fix the issue, or did you try a different solution?

If the issue with empty folders is a new problem, I would encourage you to open a new thread.

1 answer

Your answer

Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2023-12-18T19:50:44.94+00:00

Hello Rakesh Kumar,

Welcome to the Microsoft Q&A forum.

The issue of an empty file being generated outside of the designated folder with 0 kb could be due to the storage account being a Gen1 storage account, which does not support hierarchical namespace.

In Gen1 storage accounts, the files are stored in a flat structure, which can cause issues when saving files to a specific folder. This can result in an empty file being generated outside of the designated folder with 0 kb.

To resolve this issue, you can try using a Gen2 storage account, which supports hierarchical namespace and allows you to store files in a folder structure. This should ensure that the files are saved within the specified folder and prevent an empty file from being generated outside of the folder.

I hope this helps! Let me know if you have any further questions.
Rakesh Kumar 20 Reputation points

2023-12-19T13:07:11.5066667+00:00

@Bhargava-MSFT
I have checked my storage account, it is adls gen2 storage account
Rakesh Kumar 20 Reputation points

2023-12-21T05:05:27.11+00:00

@Bhargava-MSFT

Thanks for the update. Empty file issue is resolved.

But on new problem -
Creating empty folder in datalake '_$azuretmpfolder$' which is empty.
Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2023-12-21T23:06:20.99+00:00

Hello Rakesh Kumar,

Glad to hear that the issue with empty files was resolved. Did you use the code that I provided to fix the issue, or did you try a different solution?

If the issue with empty folders is a new problem, I would encourage you to open a new thread.

Answer 1

Hello Rakesh Kumar,

Thanks for confirming this.

After going through the below forums, the issue could be due to the way Spark handles data partitioning. When writing data, Spark creates a separate file for each partition of the DataFrame. If some partitions are empty, Spark still creates a file for them, which results in the generation of empty files.

Can you try repartitioning your DataFrame before writing it and see if it resolves creating empty files.

for path, dataframe in dataframes_dict.items():
    dataframe = dataframe.repartition(1)  
    dataframe.write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv(path)

https://stackoverflow.com/questions/46436077/how-to-avoid-empty-files-while-writing-parquet-files

https://stackoverflow.com/questions/55994456/azure-databricks-writing-a-file-into-azure-data-lake-gen-2

Share via

generating empty files when writing in datalake

1 answer

Your answer