writing to parquet creates empty blob

Question

writing to parquet creates empty blob

braxx 456

When writing to parquet, I am getting an extra empty file created alongside the folder with data.

I do not need it, causing mess only.

Here are the commands I tried, and got this file in both.

output_path = "/mnt/cointainer/folder/subfolder/sub_subfolder_" + currentdate  
....  
childitems.write.mode('overwrite').parquet(output_path)

or

output_path = "/mnt/cointainer/folder/subfolder/sub_subfolder_" + currentdate  
....  
childitems.write.format("parquet").mode('overwrite').save(output_path)

How to get rid of this unwanted file?

2 answers

Your answer

Answer 1

PRADEEPCHEEKATLA 90,641 Moderator

Hello @braxx ,

Thanks for asking and using Microsoft Q&A.

This is an expected behaviour when run any spark job to create these files.

Expected output:

When DBIO transactional commit is enabled, metadata files starting with started<id> and committed<id> will accompany data files created by Apache Spark jobs. Generally you shouldn’t alter these files directly. Rather, use the VACUUM command to clean the files.

A combination of below three properties will help to disable writing all the transactional files which start with "_".

We can disable the transaction logs of spark parquet write using

spark.sql.sources.commitProtocolClass =
org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol

This will help to disable the committed<TID> and started<TID> files but still _SUCCESS, _common_metadata and _metadata files will generate.

We can disable the _common_metadata and _metadata files using

parquet.enable.summary-metadata=false

We can also disable the _SUCCESS file using

mapreduce.fileoutputcommitter.marksuccessfuljobs=false

For more details, refer "Transactional Writes to Cloud Storage with DBIO" and "Stop Azure Databricks auto creating files" and "How do I prevent _success and _committed files in my write output?".

Hope this helps. Do let us know if you any further queries.

------------

Please don’t forget to Accept Answer and Up-Vote wherever the information provided helps you, this can be beneficial to other community members.

braxx 456 Reputation points

2021-05-05T12:57:59.317+00:00

Thank you for the explanation. That's helpfull for sure although my case is slightly different.

You simply explained what is inside a folder created by databricks.

I am ok with that and understand it. But now, If go one level up, outside the folder I see there is an empty blob with the same name as a folder. It is created alongside the folder, not inside. See on the screen, marked at yellow
PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2021-05-06T07:23:52.96+00:00

Hello @braxx ,

This looks strange. I could not find any files created outside the folder.

In order to investigate further, could you please share the Databricks runtime version which you are using? And the sample dataset to repro your scenario?
braxx 456 Reputation points

2021-05-06T09:35:27.79+00:00
sure, appreciate your help.

steps to reproduce the issue:

save attached json to blob storage
94349-sample-json.txt

mount blob storage to databricks

run the attached script from notebook in databricks (adjust input and output folder). The script parse json and save it as parquet.

94367-sample-notebook.txt

here is what i suppose is a runtime version: DBR 6.4 | Spark 2.4.5 | Scala 2.11
but i think running this on different cluser cause the same issue.
What is weird, when i delete the empty blob, the whole folder is deleted also
braxx 456 Reputation points

2021-05-07T18:00:17.607+00:00

@PRADEEPCHEEKATLA were you able to reporduce the issue?
PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2021-05-10T04:18:20.27+00:00

Hello @braxx ,

Thanks for the sharing the details and I will try to repro this issue and I will let you know the findings.

For a deeper investigation and immediate assistance on this issue, if you have a support plan you may file a support ticket.
PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2021-05-11T07:25:32.797+00:00

Hello @braxx ,

I had tested with provided sample data and the notebook provided and I was able to see the expected files.

Tested on runtime version: DBR 6.4 | Spark 2.4.5 | Scala 2.11

Note: Unfortunately I could not find any files created outside the folder.

For a deeper investigation and immediate assistance on this issue, if you have a support plan you may file a support ticket.
Kuldeep Singh 0 Reputation points

2023-04-27T09:53:49.7933333+00:00

Hello @braxx have you find any solution to regarding this problem in which files are creating outside

Answer 2

Thank you for your effort. Really appreciate it. Here is a related thread. Also not solved. Would it be possible to report it as bug to investigate by product team etc?

databricks-dbutils-creates-empty-blob-files-for-az.html

Maybe it is related to how I mounted the container?

storagename = "AAAA"
containername = "BBBB"
saskey = dbutils.secrets.get(scope = "CCCCC", key = "DDDD")

dbutils.fs.mount(
  source="wasbs://" + containername + "@" + storagename + ".blob.core.windows.net/",
  mount_point = "/mnt/" + containername + "/",
  extra_configs = {"fs.azure.sas." + containername + "." + storagename + ".blob.core.windows.net":"" + saskey +""})

Share via

writing to parquet creates empty blob

2 answers

Your answer