Note
Ang pag-access sa pahinang ito ay nangangailangan ng pahintulot. Maaari mong subukang mag-sign in o magpalit ng mga direktoryo.
Ang pag-access sa pahinang ito ay nangangailangan ng pahintulot. Maaari mong subukang baguhin ang mga direktoryo.
Applies to:
Databricks SQL
Databricks Runtime
This page describes the OPTIMIZE command, which optimizes the layout of Delta Lake data. You can optimize a subset of data or collocate data by column. If you don't specify collocation and the table doesn't use liquid clustering, Delta Lake performs bin-packing optimization.
Syntax
OPTIMIZE table_name [FULL] [WHERE predicate]
[ZORDER BY (col_name1 [, ...] ) ]
Note
Bin-packing optimization is idempotent: if you run it twice on the same data set, the second run has no effect. It produces evenly-balanced data files with respect to their size on disk, but not necessarily the number of tuples per file. The two measures are most often correlated.
Z-Ordering is not idempotent, but operates incrementally. The time Z-Ordering takes is not guaranteed to decrease over multiple runs. However, if no new data was added to a partition that was just Z-Ordered, running Z-Ordering again on that partition has no effect. Z-Ordering produces evenly-balanced data files with respect to the number of tuples, but not necessarily data size on disk. The two measures are most often correlated, but skew in optimize task times can occur when they diverge.
Note
When using Databricks Runtime, to control the output file size, set the Spark configuration spark.databricks.delta.optimize.maxFileSize. The default value is 1073741824 (1 GB). Specifying 104857600 sets the file size to 100 MB.
Parameters
-
Identifies an existing Delta table. The name must not include a temporal specification or options specification.
FULLApplies to:
Databricks Runtime 16.0 and aboveRewrites all data files in the table. Use
OPTIMIZE table_name FULLto:- Optimize the whole table, including data that was previously clustered (for tables using liquid clustering).
- Recompress existing data files when you change the table's compression codec using the
delta.parquet.compression.codecproperty.
To recompress existing data after changing the compression codec, run
OPTIMIZE table_name FULL:-- Change compression codec ALTER TABLE table_name SET TBLPROPERTIES ('delta.parquet.compression.codec' = 'ZSTD'); -- Recompress all existing data files OPTIMIZE table_name FULL;To optimize a subset of files in a table with liquid clustering enabled, combine
FULLwith aWHEREpredicate (Databricks Runtime 18.1 and above). Only simple range predicates on a single clustering column are supported. A file is included if any part of its range overlaps with the predicate. For example,OPTIMIZE events FULL WHERE date > 15includes a file with range(col_min = 10, col_max = 20)because the file's range overlaps with the predicate.WHEREOptimizes the subset of rows matching a partition / clustering predicate. Only filters on partition / clustering key attributes are supported.
For tables that use liquid clustering, use
OPTIMIZE table_name FULL WHERE predicate(Databricks Runtime 18.1 and above) instead.ZORDER BYCollocates column information in the same set of files. Delta Lake data-skipping algorithms use co-locality to reduce the amount of data that needs to be read. You can specify multiple columns as a comma-separated list, but the effectiveness of co-locality decreases with each additional column.
You can't use this clause on tables that use liquid clustering.
Examples
> OPTIMIZE events;
> OPTIMIZE events FULL;
> -- Partitioned table
> OPTIMIZE events WHERE date >= '2017-01-01';
> -- Table with liquid clustering enabled (DBR 18.1 and above)
> OPTIMIZE events FULL WHERE date >= '2025-01-01';
> OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType);
For more information, see Optimize data file layout.