Hi Vineet SVineet S,
Thanks for reaching out to Microsoft Q&A.
Types of partitioning:
- DataFrame/RDD Level Partitioning: This type of partitioning focuses on distributing data across cluster nodes for parallel processing.
- Table Level Partitioning: Here, the goal is to organize data within storage systems to optimize query performance.
- When you create a table in Databricks, you can define the set of partitioning columns using the PARTITIONEDBY clause.
- As you insert or manipulate rows in the table, Databricks automatically dispatches rows into the appropriate partitions based on the specified partitioning columns https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/sql-ref-partitionhttps://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/sql-ref-partition
Recommendations for partitioning:
- Most tables with less than 1 TB of data do not require partitions due to built-in features and optimizations.
- Databricks recommends that each partition contains at least 1 GB of data. Tables with fewer, larger partitions tend to perform better than those with many smaller partitions.
- Databricks automatically clusters data in unpartitioned tables by ingestion time (available in Databricks Runtime 11.2 and above). This provides query benefits similar to datetime-based partitioning without manual tuning.
- While Delta Lake uses Parquet as its primary format, partitioning strategies differ. Hive-style partitioning used by Apache Spark when saving data in Parquet format is not directly applicable to Delta tables. Always interact with Delta Lake data using officially supported clients and APIs.
When to partition tables on Databricks:
https://docs.databricks.com/en/tables/partitions.html
Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.