partition the csv files creating log files

Rakesh Kumar 45 Reputation points
2023-11-30T07:14:37.42+00:00

I am partitioning the csv files and storing in azure data lake. The destination contains:-

_committed_138917450370135985

_started_138917450370135985

_SUCCESS

part-00000-tid-138917450370135985-822eee2b-508b-46ea-9ed6-c426f350d05c-223-1-c000.csv

I only want a file which should be name as table.csv.

Don't want __committed, __started, _success

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,424 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,070 questions
{count} votes

1 answer

Sort by: Most helpful
  1. PRADEEPCHEEKATLA-MSFT 85,351 Reputation points Microsoft Employee
    2023-12-01T05:12:28.5033333+00:00

    @Rakesh Kumar - Thanks for the question and using MS Q&A platform.

    This is an expected behaviour when run any spark job to create these files.

    When DBIO transactional commit is enabled, metadata files starting with started and committed will accompany data files created by Apache Spark jobs. Generally you shouldn’t alter these files directly. Rather, use the VACUUM command to clean the files.

    A combination of below three properties will help to disable writing all the transactional files which start with "_".

    We can disable the transaction logs of spark parquet write using

    spark.sql.sources.commitProtocolClass = org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol

    This will help to disable the committed and started files but still _SUCCESS, _common_metadata and _metadata files will generate.

    We can disable the _common_metadata and _metadata files using

    parquet.enable.summary-metadata=false

    We can also disable the _SUCCESS file using

    mapreduce.fileoutputcommitter.marksuccessfuljobs=false

    For more details, refer "Transactional Writes to Cloud Storage with DBIO" and "Stop Azure Databricks auto creating files" and "How do I prevent _success and _committed files in my write output?".

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

    1 person found this answer helpful.