Ibahagi sa


OPTIMIZE

Applies to: check marked yes Databricks SQL check marked yes Databricks Runtime

This page describes the OPTIMIZE command, which optimizes the layout of Delta Lake data. You can optimize a subset of data or collocate data by column. If you don't specify collocation and the table doesn't use liquid clustering, Delta Lake performs bin-packing optimization.

Syntax

OPTIMIZE table_name [FULL] [WHERE predicate]
  [ZORDER BY (col_name1 [, ...] ) ]

Note

Bin-packing optimization is idempotent: if you run it twice on the same data set, the second run has no effect. It produces evenly-balanced data files with respect to their size on disk, but not necessarily the number of tuples per file. The two measures are most often correlated.

Z-Ordering is not idempotent, but operates incrementally. The time Z-Ordering takes is not guaranteed to decrease over multiple runs. However, if no new data was added to a partition that was just Z-Ordered, running Z-Ordering again on that partition has no effect. Z-Ordering produces evenly-balanced data files with respect to the number of tuples, but not necessarily data size on disk. The two measures are most often correlated, but skew in optimize task times can occur when they diverge.

Note

When using Databricks Runtime, to control the output file size, set the Spark configuration spark.databricks.delta.optimize.maxFileSize. The default value is 1073741824 (1 GB). Specifying 104857600 sets the file size to 100 MB.

Parameters

  • table_name

    Identifies an existing Delta table. The name must not include a temporal specification or options specification.

  • FULL

    Applies to: check marked yes Databricks Runtime 16.0 and above

    Rewrites all data files in the table. Use OPTIMIZE table_name FULL to:

    To recompress existing data after changing the compression codec, run OPTIMIZE table_name FULL:

    -- Change compression codec
    ALTER TABLE table_name SET TBLPROPERTIES ('delta.parquet.compression.codec' = 'ZSTD');
    
    -- Recompress all existing data files
    OPTIMIZE table_name FULL;
    

    To optimize a subset of files in a table with liquid clustering enabled, combine FULL with a WHERE predicate (Databricks Runtime 18.1 and above). Only simple range predicates on a single clustering column are supported. A file is included if any part of its range overlaps with the predicate. For example, OPTIMIZE events FULL WHERE date > 15 includes a file with range (col_min = 10, col_max = 20) because the file's range overlaps with the predicate.

  • WHERE

    Optimizes the subset of rows matching a partition / clustering predicate. Only filters on partition / clustering key attributes are supported.

    For tables that use liquid clustering, use OPTIMIZE table_name FULL WHERE predicate (Databricks Runtime 18.1 and above) instead.

  • ZORDER BY

    Collocates column information in the same set of files. Delta Lake data-skipping algorithms use co-locality to reduce the amount of data that needs to be read. You can specify multiple columns as a comma-separated list, but the effectiveness of co-locality decreases with each additional column.

    You can't use this clause on tables that use liquid clustering.

Examples

> OPTIMIZE events;

> OPTIMIZE events FULL;

> -- Partitioned table
> OPTIMIZE events WHERE date >= '2017-01-01';

> -- Table with liquid clustering enabled (DBR 18.1 and above)
> OPTIMIZE events FULL WHERE date >= '2025-01-01';

> OPTIMIZE events
    WHERE date >= current_timestamp() - INTERVAL 1 day
    ZORDER BY (eventType);

For more information, see Optimize data file layout.