partition the csv files creating log files

Question

partition the csv files creating log files

Rakesh Kumar 45

I am partitioning the csv files and storing in azure data lake. The destination contains:-

_committed_138917450370135985

_started_138917450370135985

_SUCCESS

part-00000-tid-138917450370135985-822eee2b-508b-46ea-9ed6-c426f350d05c-223-1-c000.csv

I only want a file which should be name as table.csv.

Don't want __committed, __started, _success

PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2023-12-07T07:03:50.8833333+00:00

@Rakesh Kumar - Just checking in to see if the below answer helped. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

1 answer

Your answer

PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2023-12-07T07:03:50.8833333+00:00

@Rakesh Kumar - Just checking in to see if the below answer helped. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Answer 1

PRADEEPCHEEKATLA 90,641 Moderator

@Rakesh Kumar - Thanks for the question and using MS Q&A platform.

This is an expected behaviour when run any spark job to create these files.

When DBIO transactional commit is enabled, metadata files starting with started and committed will accompany data files created by Apache Spark jobs. Generally you shouldn’t alter these files directly. Rather, use the VACUUM command to clean the files.

A combination of below three properties will help to disable writing all the transactional files which start with "_".

We can disable the transaction logs of spark parquet write using

spark.sql.sources.commitProtocolClass = org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol

This will help to disable the committed and started files but still _SUCCESS, _common_metadata and _metadata files will generate.

We can disable the _common_metadata and _metadata files using

parquet.enable.summary-metadata=false

We can also disable the _SUCCESS file using

mapreduce.fileoutputcommitter.marksuccessfuljobs=false

For more details, refer "Transactional Writes to Cloud Storage with DBIO" and "Stop Azure Databricks auto creating files" and "How do I prevent _success and _committed files in my write output?".

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Rakesh Kumar 45 Reputation points

2023-12-01T06:30:20.1233333+00:00

Thanks @PRADEEPCHEEKATLA It works. Now I can see only one file ("part-00000-ef2cc21b-db94-4f09-8f09-be02c6510150-c000.csv"). When I am partitioning the file it should name Table1.csv at destination point(ADLS).Can you help how can we rename the file at the time of partitioning?
PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2023-12-01T06:35:02.4533333+00:00

@Rakesh Kumar - Glad to know it helped. Would could rename the file after file creation as discussed on the SO thread:https://stackoverflow.com/questions/54101135/how-do-i-rename-the-file-that-was-saved-on-a-datalake-in-azure

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
Rakesh Kumar 45 Reputation points

2023-12-01T07:25:29.8066667+00:00
@PRADEEPCHEEKATLA Can we rename at the file creation when partitioning the file. The solution which you have provided is

They have partitioned the data then they are moving to other location.

But i want to rename at the time of file partition don't want to create another folder
PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2023-12-05T04:06:36.31+00:00

@Rakesh Kumar - Unfortunately, you cannot change the directly because parquet files generate part000* files as default.

You can rename the files once it generated as shown above or this can be easily achieved using dbutils.fs.mv(old_name, new_name) by just replacing the paths of the part-00000 files.

May may checkout the video which explains the same: Rename spark generated part files in data lake.

Hope this helps. Do let us know if you any further queries.
Rakesh Kumar 20 Reputation points

2023-12-05T05:13:23.93+00:00

@PRADEEPCHEEKATLA
my end result is in csv format not in parquet format
PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2023-12-05T05:30:35.5433333+00:00

@Rakesh Kumar - Apologizes for the confusion - irrespective of the format it can be any file format.
Spark uses Hadoop File Format, which requires data to be partitioned - that's why you have part- files. You can easily change filename after processing as said above (You can rename the files once it generated as shown above or this can be easily achieved using dbutils.fs.mv(old_name, new_name) by just replacing the paths of the part-00000 files.).

Hope this helps. Do let us know if you any further queries.
PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2023-12-13T02:53:12.28+00:00

@Rakesh Kumar - Following up to see if the above answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Share via

partition the csv files creating log files

1 answer

Your answer