Κοινή χρήση μέσω


When to partition tables on Azure Databricks

Note

Databricks recommends using liquid clustering for all new Delta tables. See Use liquid clustering for Delta tables.

Liquid clusters are sometimes also referred to as “liquid partitions”. This article only refers to legacy Delta partitioning and not liquid clustering.

This article provides an overview of how you can partition tables on Azure Databricks and specific recommendations around when you should use partitioning for tables backed by Delta Lake. Because of built-in features and optimizations, most tables with less than 1 TB of data do not require partitions.

Azure Databricks uses Delta Lake for all tables by default. The following recommendations assume you are working with Delta Lake for all tables.

In Databricks Runtime 11.3 LTS and above, Azure Databricks automatically clusters data in unpartitioned tables by ingestion time. See Use ingestion time clustering.

Do small tables need to be partitioned?

Databricks recommends you do not partition tables that contains less than a terabyte of data.

What is minimum size for each partition in a table?

Databricks recommends all partitions contain at least a gigabyte of data. Tables with fewer, larger partitions tend to outperform tables with many smaller partitions.

Use ingestion time clustering

By using Delta Lake and Databricks Runtime 11.3 LTS or above, unpartitioned tables you create benefit automatically from ingestion time clustering. Ingestion time provides similar query benefits to partitioning strategies based on datetime fields without any need to optimize or tune your data.

Note

To maintain ingestion time clustering when you perform a large number of modifications using UPDATE or MERGE statements on a table, Databricks recommends running OPTIMIZE with ZORDER BY using a column that matches the ingestion order. For instance, this could be a column containing an event timestamp or a creation date.

Do Delta Lake and Parquet share partitioning strategies?

Delta Lake uses Parquet as the primary format for storing data, and some Delta tables with partitions specified demonstrate organization similar to Parquet tables stored with Apache Spark. Apache Spark uses Hive-style partitioning when saving data in Parquet format. Hive-style partitioning is not part of the Delta Lake protocol, and workloads should not rely on this partitioning strategy to interact with Delta tables.

Many Delta Lake features break assumptions about data layout that might have been transferred from Parquet, Hive, or even earlier Delta Lake protocol versions. You should always interact with data stored in Delta Lake using officially supported clients and APIs.

How are Delta Lake partitions different from partitions in other data lakes?

While Azure Databricks and Delta Lake build upon open source technologies like Apache Spark, Parquet, Hive, and Hadoop, partitioning motivations and strategies useful in these technologies do not generally hold true for Azure Databricks. If you do choose to partition your table, consider the following facts before choosing a strategy:

  • Transactions are not defined by partition boundaries. Delta Lake ensures ACID through transaction logs, so you do not need to separate a batch of data by a partition to ensure atomic discovery.
  • Azure Databricks compute clusters do not have data locality tied to physical media. Data ingested into the lakehouse is stored in cloud object storage. While data is cached to local disk storage during data processing, Azure Databricks uses file-based statistics to identify the minimal amount of data for parallel loading.

How do Z-order and partitions work together?

You can use Z-order indexes alongside partitions to speed up queries on large datasets.

Note

Most tables can leverage ingestion time clustering to avoid needing to worry about Z-order and partition tuning.

The following rules are important to keep in mind while planning a query optimization strategy based on partition boundaries and Z-order:

  • Z-order works in tandem with the OPTIMIZE command. You cannot combine files across partition boundaries, and so Z-order clustering can only occur within a partition. For unpartitioned tables, files can be combined across the entire table.
  • Partitioning works well only for low or known cardinality fields (for example, date fields or physical locations), but not for fields with high cardinality such as timestamps. Z-order works for all fields, including high cardinality fields and fields that may grow infinitely (for example, timestamps or the customer ID in a transactions or orders table).
  • You cannot Z-order on fields used for partitioning.

If partitions are so bad, why do some Azure Databricks features use them?

Partitions can be beneficial, especially for very large tables. Many performance enhancements around partitioning focus on very large tables (hundreds of terabytes or greater).

Many customers migrate to Delta Lake from Parquet-based data lakes. The CONVERT TO DELTA statement allows you to convert an existing Parquet-based table to a Delta table without rewriting existing data. As such, many customers have large tables that inherit previous partitioning strategies. Some optimizations developed by Databricks seek to leverage these partitions when possible, mitigating some potential downsides for partitioning strategies not optimized for Delta Lake.

Delta Lake and Apache Spark are open-source technologies. While Databricks continues to introduce features that reduce reliance on partitioning, the open source community might continue to build new features that add complexity.

Is it possible to outperform Azure Databricks built-in optimizations with custom partitioning?

Some experienced users of Apache Spark and Delta Lake might be able to design and implement a pattern that provides better performance than ingestion time clustering. Implementing a bad partitioning stategy can have very negative repercussions on downstream performance and might require a full rewrite of data to fix. Databricks recommends that most users use default settings to avoid introducing expensive inefficiencies.