Hi pankaj mahra ,
Welcome to Microsoft Q&A platform and thanks for posting your query here.
As per my understanding, you are looking for suggestions on how to partition the files for optimal usage without using the partition column. Please let me know if that is not the case.
Delta Lake supports partitioning data by one or more columns. Partitioning data can improve query performance by reducing the amount of data that needs to be scanned. Partitioning can also help to organize data into more manageable chunks. However, you mentioned that you do not have any column to partition the data. In this case, you can use the 'repartition' or 'coalesce' function to partition the data.
You can use the coalesce
function to reduce the number of files. The coalesce
function can be used to reduce the number of partitions to a fixed number. For example
df.coalesce(10).write.format("delta").mode("overwrite").save("/delta/path")
Check the below resources for more details:
Spark - Repartition Or Coalesce
How to Write Dataframe as single file with specific name in PySpark
I hope this helps . Kindly accept the answer by clicking on Accept answer
button