How to save spark dataframe (with synaps) in data container (without making folder and SUCCES file)

Question

How to save spark dataframe (with synaps) in data container (without making folder and SUCCES file)

Vivian Nguyen 40

I want to save a spark dataframe to my data container. It worked with this code:

df.write.csv(path_name + "test5.csv")

However, this makes a folder called test5.csv with 2 files in it. One which is my dataframe (but with a random generated string name) and one is a SUCCES file.

How do i prevent it making a new folder, prevent it making the SUCCES file in the folder, and only make the file for the dataframe with a name I specify (instead of the random string)?

Vivian Nguyen 40 Reputation points

2024-04-25T08:32:32.3033333+00:00

Here an image of the folder that gets generated

Accepted answer

0 additional answers

Your answer

Vivian Nguyen 40 Reputation points

2024-04-25T08:32:32.3033333+00:00

Here an image of the folder that gets generated

Answer 1

@Vivian Nguyen

Thankyou for your query on Microsoft Q&A platform .

It seems that you want to prevent the creation of a SUCCESS file, and also specify a name for the target file.

It is the default behavior of spark to create the transactional files like _success file, _committed file, and _metadata files .

You can consider using the below solutions to remove the generated transactional files and give specific name to target file:

Use coalesce(1) function to create single partition file in a temp folder.
Loop through all the files present in the folder and filter on the .csv files and ignore the transactional files
Copy only the csv files to the new folder with specified file name
Remove the temp folder with recursive set as True

Relevant resources: How to Write Dataframe as single file with specific name in PySpark

Alternatively, you can try the below solution:

we can disable the transaction logs of spark parquet write using spark.sql.sources.commitProtocolClass = org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol".

This will help to disable the "committed<TID>" and "started<TID>" files but still _SUCCESS, _common_metadata and _metadata files will generate.

We can disable the _common_metadata and _metadata files using "parquet.enable.summary-metadata=false".
We can also disable the _SUCCESS file using "mapreduce.fileoutputcommitter.marksuccessfuljobs=false".

Related documentation: https://community.databricks.com/t5/data-engineering/how-do-i-prevent-success-and-committed-files-in-my-write-output/td-p/28690

Hope it helps. Kindly accept the answer by clicking on Accept answer button. Thankyou

Vivian Nguyen 40 Reputation points

2024-04-26T08:28:25.35+00:00

Wow this seems like a great solution. I will try it out soon. I will accept the answer for now.

Share via

How to save spark dataframe (with synaps) in data container (without making folder and SUCCES file)

0 additional answers

Your answer