Del via


Collation support for Delta Lake

You can specify collations on string fields in Delta tables to control how string comparisons and sorting behave, such as enabling case-insensitive matching or locale-aware ordering. This requires Databricks Runtime 16.4 LTS and above.

For a full explanation of collation types, naming conventions, and precedence rules, see Collation.

By default, Delta Lake sets the collation for string fields to UTF8_BINARY.

Important

Enabling collation adds the collations writer table feature to your Delta table, which affects compatibility with external readers and other platform features. Review the Limitations section before enabling collation on production tables.

Create a table with collation

You can specify collation at the column level when creating a new table. Collation can be applied to top-level string columns and string fields inside nested types:

CREATE TABLE catalog.schema.my_table (
  id BIGINT,
  name STRING COLLATE UTF8_LCASE,
  metadata STRUCT<label: STRING COLLATE UNICODE>,
  tags ARRAY<STRING COLLATE UTF8_LCASE>,
  properties MAP<STRING, STRING COLLATE UTF8_LCASE>
) USING delta

Note

MAP keys can't use collation. Only MAP values support collated strings.

Alter collation on an existing column

You can change the collation of an existing column using ALTER TABLE:

-- Set a column to case-insensitive collation
ALTER TABLE my_table ALTER COLUMN name TYPE STRING COLLATE UTF8_LCASE

-- Revert a column to the default binary collation
ALTER TABLE my_table ALTER COLUMN name TYPE STRING COLLATE UTF8_BINARY

Update statistics and data layout after altering collation

Changing a column's collation doesn't rewrite existing data or update statistics. Queries return correct results under the new collation immediately, but file skipping and clustering might be less effective until you take the following steps:

  1. Update file-skipping statistics for the column:
   ANALYZE TABLE my_table COMPUTE DELTA STATISTICS
  1. If your table uses liquid clustering, rewrite the clustering layout:
   OPTIMIZE FULL my_table
  1. If your table uses ZORDER, disable incremental optimization and rewrite all files:
   SET spark.databricks.optimize.incremental = false;
   OPTIMIZE my_table ZORDER BY zorder_column;

Skipping these steps won't cause incorrect results, but might reduce query performance on historical data until the next full rewrite.

Collation is always respected by Azure Databricks in query results.

Disable collation for a table

To remove the collation table feature, first set every collated column back to UTF8_BINARY:

-- Repeat for each collated column
ALTER TABLE my_table ALTER COLUMN name TYPE STRING COLLATE UTF8_BINARY

Then drop the table feature:

ALTER TABLE my_table DROP FEATURE collations

See Drop a Delta Lake table feature and downgrade table protocol for more details.

Schema evolution and collation

When schema evolution adds or merges columns that have collation specified, the following rules apply:

  • If a source column already exists in the target table, the target table's collation for that column is preserved. The source column's collation is ignored.
  • If a source column is new and has a collation specified, the target table adopts that collation for the new column.
  • If the target table doesn't already have the collations table feature enabled, adding a collated column enables it automatically.

Limitations

The following limitations apply to Delta tables with collation enabled:

Compatibility and interoperability:

  • External readers that don't recognize the collations table feature fall back to UTF8_BINARY, which might produce incorrect sort orders or comparisons
  • Delta Sharing is not supported for tables with collation
  • UniForm is not supported for tables with collation
  • Delta tables created externally with a collation not recognized by Databricks Runtime throw an exception when queried
  • OSS Delta Lake APIs for Scala or Python don't support collation. Use Spark SQL or DataFrame APIs

Query and feature restrictions:

  • Collated columns can't be used in CHECK constraints
  • Collated columns can't be referenced in generated column expressions
  • Collated columns can't be used with bloom filter indexes
  • Collated columns can't be referenced in Structured Streaming stateful queries (aggregations, joins, deduplication)
  • A MAP key can't be a collated string. Only MAP values support collation