How to save spark dataframe (with synaps) in data container (without making folder and SUCCES file)

Vivian Nguyen 40 Reputation points
2024-04-25T08:31:33.9933333+00:00

I want to save a spark dataframe to my data container. It worked with this code:

df.write.csv(path_name + "test5.csv")

However, this makes a folder called test5.csv with 2 files in it. One which is my dataframe (but with a random generated string name) and one is a SUCCES file.

How do i prevent it making a new folder, prevent it making the SUCCES file in the folder, and only make the file for the dataframe with a name I specify (instead of the random string)?

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,472 questions
{count} votes

Accepted answer
  1. AnnuKumari-MSFT 31,726 Reputation points Microsoft Employee
    2024-04-26T07:10:01.1533333+00:00

    @Vivian Nguyen

    Thankyou for your query on Microsoft Q&A platform .

    It seems that you want to prevent the creation of a SUCCESS file, and also specify a name for the target file.

    It is the default behavior of spark to create the transactional files like _success file, _committed file, and _metadata files .

    You can consider using the below solutions to remove the generated transactional files and give specific name to target file:

    • Use coalesce(1) function to create single partition file in a temp folder.
    • Loop through all the files present in the folder and filter on the .csv files and ignore the transactional files
    • Copy only the csv files to the new folder with specified file name
    • Remove the temp folder with recursive set as True

    Relevant resources: How to Write Dataframe as single file with specific name in PySpark

    Alternatively, you can try the below solution:

    we can disable the transaction logs of spark parquet write using spark.sql.sources.commitProtocolClass = org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol".

    This will help to disable the "committed<TID>" and "started<TID>" files but still _SUCCESS, _common_metadata and _metadata files will generate.

    1. We can disable the _common_metadata and _metadata files using "parquet.enable.summary-metadata=false".
    2. We can also disable the _SUCCESS file using "mapreduce.fileoutputcommitter.marksuccessfuljobs=false".

    Related documentation: https://community.databricks.com/t5/data-engineering/how-do-i-prevent-success-and-committed-files-in-my-write-output/td-p/28690

    Hope it helps. Kindly accept the answer by clicking on Accept answer button. Thankyou


0 additional answers

Sort by: Most helpful