इसके माध्यम से साझा किया गया


Collation

Applies to: check marked yes Databricks SQL check marked yes Databricks Runtime 16.1 and above

A collation is a set of rules that determines how string comparisons are performed. Collations support case-insensitive, accent-insensitive, and trailing-space-insensitive comparisons, as well as language-aware string ordering.

Strings in Azure Databricks are represented as UTF-8 encoded Unicode characters. By default, Azure Databricks compares strings by their binary UTF-8 representation, known as UTF8_BINARY collation. UTF8_BINARY comparisons are fast and appropriate in many cases, but might not be suitable for applications that require language-aware sorting or comparisons.

Beyond binary comparisons, a common use case is case-insensitive matching. The UTF8_LCASE collation is designed for this purpose. It converts strings to lowercase before comparing them using UTF8_BINARY.

For language-aware comparisons, Azure Databricks uses the following technologies:

These technologies are encapsulated in a set of named collations available for use in SQL statements.

Note

For limitations when using collations with Delta Lake tables, see Limitations.

Collation names

Azure Databricks provides named system collations to simplify identification. LDML specifications can be complex to read and use directly.

Syntax

{ UTF8_BINARY |
  UTF8_LCASE |
  { UNICODE | locale } [ _ modifier [...] ] }

locale
  language_code [ _ script_code ] [ _ country_code ]

modifier
  { CS | CI | AS | AI | RTRIM }
  • UTF8_BINARY

    A binary collation that compares strings byte by byte based on their UTF-8 representation. UTF8_BINARY is the default and most lightweight collation in Azure Databricks.

    In this collation: 'A' (x'65') < 'B' (x'66') < … < 'Z' (x'90'). However, 'Z' (x'90') < 'a' (x'97'), and 'A' (x'65') <> 'a' (x'97'). Characters such as 'Ä' (x'C384') are greater than both 'Z' and 'z'.

  • UTF8_LCASE

    A lightweight, case-insensitive collation that converts strings to lowercase before comparing them using UTF8_BINARY.

    UTF8_LCASE is the collation used for Identifiers in Azure Databricks.

    For example:

    ORDER BY col COLLATE UTF8_LCASE
    

    is equivalent to:

    ORDER BY LOWER(col) COLLATE UTF8_BINARY
    
  • UNICODE

    The ICU root locale, known in CLDR as the root locale (LDML specification: und-u). This collation applies a language-agnostic order that groups similar characters together. For example: 'a' < 'A' < 'Ä' < 'b'. This collation is case-sensitive and accent-sensitive by default.

  • locale

    A locale-aware collation based on CLDR tables. The locale is specified as a language code, an optional script code, and an optional country code. Locale values are case-insensitive.

  • modifier

    Controls case sensitivity, accent sensitivity, and trailing space behavior. Modifiers are case-insensitive and can be specified in any order.

    • CS: Case-sensitive. The default behavior.
    • CI: Case-insensitive.
    • AS: Accent-sensitive. The default behavior.
    • AI: Accent-insensitive.

    Applies to: check marked yes Databricks SQL check marked yes Databricks Runtime 16.2 and above

    • RTRIM: Trailing-space insensitive. Trims trailing spaces (u0020) before comparison.

    Applies to: check marked yes Databricks SQL check marked yes Databricks Runtime 16.2 and above

    You can specify RTRIM, at most one of CS or CI, and at most one of AS or AI.

When Azure Databricks processes a collation name, it normalizes the name by removing defaults. For example, SR_CYR_SRN_CS_AS normalizes to SR.

For a list of supported collations, see Supported collations.

Examples

-- Fully qualified collation names are supported; case doesn't matter.
system.builtin.unicode

-- All collations are system-defined and do not require qualification.
unicode

-- Two-letter language code for German collation.
DE

-- Two-letter language code and three-letter country code for French Canadian collation.
fr_CAN

-- Two-letter language code, four-letter script code, and three-letter country code
-- for Traditional Chinese in Macao.
zh_Hant_MAC

-- German collation with case-insensitive and accent-insensitive modifiers.
-- 'Ä', 'A', and 'a' are all considered equal.
de_CI_AI

-- Backticks are allowed but not required for built-in collations.
`UTF8_BINARY`

Compare UTF8_LCASE and UNICODE collations

Both UTF8_LCASE and UNICODE-based collations such as UNICODE_CI and UNICODE_CI_AI support case-insensitive comparisons, but they differ in how they handle accented characters and locale-specific rules.

Behavioral differences

UTF8_LCASE converts strings to lowercase and compares them byte by byte using UTF8_BINARY. It is lightweight and fast, but treats accented characters as distinct from their unaccented forms.

UNICODE-based collations use the ICU library and CLDR locale data. The CI and AI modifiers extend this to treat characters as equivalent even when they differ in case, accent, or both. These collations are more thorough but have higher computational cost.

Behavior UTF8_LCASE UNICODE_CI UNICODE_CI_AI
'A' = 'a' true true true
'Café' = 'café' true true true
'Cafe' = 'Café' false false true
'ß' = 'ss' (German sharp s) false false false
'resume' = 'résumé' false false true
Supports LIKE and RLIKE Yes No No
Supports locale-specific rules No Yes Yes

When to use each collation

  • Use UTF8_LCASE for fast case-insensitive comparisons when accented characters should remain distinct. This collation is well-suited for data warehouse migrations from systems that use simple case-insensitive string matching.

  • Use UNICODE_CI for case-insensitive comparisons that follow Unicode rules, particularly when data includes characters from multiple languages where simple lowercasing is insufficient.

  • Use UNICODE_CI_AI when both case and accent differences should be ignored. For example, if a search for "resume" should match "résumé". This is common in user-facing search and multilingual applications.

Examples

-- UTF8_LCASE lowercases then compares bytes.
-- Accented and unaccented characters are not equivalent.
> SELECT 'Cafe' = 'café' COLLATE UTF8_LCASE;
  false

> SELECT 'Café' = 'café' COLLATE UTF8_LCASE;
  true

-- UNICODE_CI is case-insensitive but accent-sensitive.
-- 'Café' equals 'café' (case differs) but not 'Cafe' (accent differs).
> SELECT 'Café' = 'cafe' COLLATE UNICODE_CI;
  false

> SELECT 'Café' = 'café' COLLATE UNICODE_CI;
  true

-- UNICODE_CI_AI is case-insensitive and accent-insensitive.
-- 'Café' matches 'cafe' because both case and accent differences are ignored.
> SELECT 'Cafe' = 'café' COLLATE UNICODE_CI_AI;
  true

> SELECT 'resume' = 'résumé' COLLATE UNICODE_CI_AI;
  true

-- UTF8_LCASE sorts by lowercase byte order.
-- UNICODE collations sort by linguistic similarity.
> SELECT col FROM VALUES('Banana'), ('apple'), ('Ångström'), ('äpfel') AS t(col)
    ORDER BY col COLLATE UTF8_LCASE;
  apple
  Banana
  Ångström
  äpfel

> SELECT col FROM VALUES('Banana'), ('apple'), ('Ångström'), ('äpfel') AS t(col)
    ORDER BY col COLLATE UNICODE_CI;
  apple
  Ångström
  äpfel
  Banana

Default collation

The default collation applies to STRING literals, parameter markers, functions without STRING parameters that produce strings, and column, field, or variable type definitions without a COLLATE clause.

The default collation is derived as follows:

  • For DDL statements such as ALTER TABLE, CREATE VIEW, CREATE TABLE, and CREATE FUNCTION:

    • The default collation is the default collation of the object being created or altered.
    • If no DEFAULT COLLATION clause is specified, the default collation is UTF8_BINARY.
  • For DML statements (UPDATE, DELETE FROM, INSERT, MERGE INTO) and Query, the default collation is UTF8_BINARY.

Collation precedence

Azure Databricks applies collation precedence rules to determine which collation to use for a given string. Four precedence levels are defined:

  1. Explicit

    The collation is explicitly assigned using collate expression.

    -- Force binary collation to check whether a VIN matches a Ferrari.
    vin COLLATE UTF8_BINARY LIKE 'ZFF%'
    
    -- Force German collation to order German first names.
    ORDER BY vorname COLLATE DE
    
  2. Implicit

    The collation is implicitly assigned by a Column name, Field name, Column alias, Variable name, or Parameter name reference, including the result of a subquery where the collation is not None.

    -- Use the collation of the column as defined.
    employee.name LIKE 'Mc%'
    
    -- Use the collation of the variable as defined.
    translate(session.tempvar, 'Z', ',')
    
  3. Default

    A STRING literal, named or unnamed Parameter markers, or a STRING produced by a function from another type.

    -- A literal string has the default collation.
    'Hello'
    
    -- :parm1 is a parameter marker using the session default collation.
    EXECUTE IMMEDIATE 'SELECT :parm1' USING 'Hello' AS parm1;
    
    -- ? is a parameter marker using the session default collation.
    EXECUTE IMMEDIATE 'SELECT ?' USING 'Hello';
    
    -- The result of casting a non-STRING to STRING uses the default collation.
    CAST(5 AS STRING)
    
    -- The date is converted to a string using the default collation.
    to_char(DATE'2016-04-08', 'y')
    
    -- session_user() returns a STRING with the default collation.
    session_user()
    

    The assigned collation is the default collation.

  4. None

    A STRING result of a function, operator, or set operation such as UNION that takes more than one STRING argument with different implicit collations.

    -- Concatenating two strings with different explicit collations results in no collation.
    SELECT fr || de AS freutsch FROM VALUES('Voulez vous ' COLLATE FR, 'Kartoffelsupp...' COLLATE DE) AS T(fr, de)
    
    -- A union of two strings with different explicit collations results in no collation.
    SELECT 'Voulez vous ' COLLATE FR UNION ALL SELECT 'Kartoffelsupp...' COLLATE DE
    

Collation derivation

When deriving the collation for a STRING result, precedence rules apply as follows:

  1. If the expression matches one of the definitions above, the collation and precedence are as defined.

  2. If the expression is a function or operator with a single STRING parameter returning a STRING, the collation and precedence are those of the STRING parameter.

  3. If the expression is a function or operator with two or more STRING parameters:

    1. If all parameters have the same collation and precedence, the result uses that collation and precedence.

    2. If parameters have different collations or precedence, let C1 and C2 be distinct collations and D be the default collation. The result is determined by the following table:

Collation and Precedence C1 Explicit C1 Implicit D Default None
C2 Explicit Error C2 Explicit C2 Explicit C2 Explicit
C2 Implicit Explicit C1 None C2 Implicit None
D Default C1 Explicit C1 Implicit D Default None
None C1 Explicit None None None

Examples

> SELECT 'hello' = 'hello   ' COLLATE UNICODE_RTRIM;
  true

> CREATE TABLE words(fr STRING COLLATE FR, de STRING COLLATE DE, en STRING COLLATE EN);
> INSERT INTO words VALUES ('Salut', 'Hallo', 'Hello');

-- A literal string has the default collation.
> SELECT collation('Ciao');
  UTF8_BINARY

-- A function producing a STRING has the default collation.
> SELECT collation(user());
  UTF8_BINARY

-- A function that modifies a STRING passes the collation through.
> SELECT collation(upper('Ciao'));
  UTF8_BINARY

-- Implicit collation (French) wins over default collation.
> SELECT collation(fr || 'Ciao') FROM words;
  FR

-- Explicit collation (French) wins over implicit collation (German).
> SELECT collation('Salut' COLLATE FR || de) FROM words;
  FR

-- Implicit German collides with implicit French; the result has no collation.
> SELECT collation(de || fr) FROM words;
  null

-- Explicit collation (French) wins over default collation (Italian).
> SELECT collation('Salut' COLLATE FR || 'Ciao');
  FR

-- Explicit collation (French) collides with explicit collation (German).
> SELECT collation('Salut' COLLATE FR || 'Hallo' COLLATE DE);
  COLLATION_MISMATCH.EXPLICIT

-- Explicit collation wins over no collation.
> SELECT collation('Ciao' COLLATE IT || (fr || de)) FROM words;
  IT

-- Implicit collation (English) does not win over None.
> SELECT collation(en || (fr || de)) FROM words;
  null

-- Explicit collation (English) wins over implicit collation anywhere in the expression.
> SELECT collation((fr || ltrim('H' COLLATE EN, fr)) || fr) FROM words;
  EN

Limitations

Many pattern-matching and regex functions support only UTF8_BINARY and UTF8_LCASE collations. Azure Databricks raises an error when these functions are used with other collations such as UNICODE or locale-specific collations like DE or FR_CI_AI.

Affected functions include LIKE, ILIKE, RLIKE, and the regexp_* function family. For string matching on columns with other collations, use contains function, startswith function, or endswith function.