Do I have to avoid multi-column partitioning in Pyspark?

eugenia apostolopoulou 76 Reputation points


I want to create (via code in Azure Synapse notebook) a folder hierarchy based on columns year, month, day of my dataframe, as is depicted in the screenshot.236466-folder-hierarchy.png
I read that the Pyspark method PartitionBy() is not recommended for use with multiple columns so it's not the best practise to create such a taxonomy. Why is that?
Thanks in advance!

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,346 questions
{count} votes

1 answer

Sort by: Most helpful
  1. HimanshuSinha-msft 19,376 Reputation points Microsoft Employee

    Hello @eugenia apostolopoulou ,
    Thanks for the question and using MS Q&A platform.
    As we understand the ask here is how to partiion in spark , please do let us know if its not accurate.

    I read that the Pyspark method PartitionBy() is not recommended

    It will be great if you can point me to the document here . I know that partitioning too less or too much can effect the performance .
    On looking that the folder structure . I think you can go with the below code .

    jdbc_df1 = jdbc_df.withColumn("NewDate",to_date("DateOpened"))


    jdbc_df1.write.option("header",True) \
    .partitionBy("NewDate") \
    .mode("overwrite") \


    Please do let me if you have any queries.

    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
      • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators
    0 comments No comments