नोट
इस पेज तक पहुँच के लिए प्रमाणन की आवश्यकता होती है. आप साइन इन करने या निर्देशिकाओं को बदलने का प्रयास कर सकते हैं.
इस पेज तक पहुँच के लिए प्रमाणन की आवश्यकता होती है. आप निर्देशिकाओं को बदलने का प्रयास कर सकते हैं.
Applies to:
Databricks SQL
Databricks Runtime 16.1 and above
A collation is a set of rules that determines how string comparisons are performed. Collations support case-insensitive, accent-insensitive, and trailing-space-insensitive comparisons, as well as language-aware string ordering.
Strings in Azure Databricks are represented as UTF-8 encoded Unicode characters. By default, Azure Databricks compares strings by their binary UTF-8 representation, known as UTF8_BINARY collation. UTF8_BINARY comparisons are fast and appropriate in many cases, but might not be suitable for applications that require language-aware sorting or comparisons.
Beyond binary comparisons, a common use case is case-insensitive matching. The UTF8_LCASE collation is designed for this purpose. It converts strings to lowercase before comparing them using UTF8_BINARY.
For language-aware comparisons, Azure Databricks uses the following technologies:
- International Components for Unicode (ICU) library to compute collation.
- Common Locale Data Repository (CLDR) tables for locale-aware collation.
- Unicode Locale Data Markup Language (LDML) to encode collations internally.
These technologies are encapsulated in a set of named collations available for use in SQL statements.
Note
For limitations when using collations with Delta Lake tables, see Limitations.
Collation names
Azure Databricks provides named system collations to simplify identification. LDML specifications can be complex to read and use directly.
Syntax
{ UTF8_BINARY |
UTF8_LCASE |
{ UNICODE | locale } [ _ modifier [...] ] }
locale
language_code [ _ script_code ] [ _ country_code ]
modifier
{ CS | CI | AS | AI | RTRIM }
UTF8_BINARYA binary collation that compares strings byte by byte based on their UTF-8 representation.
UTF8_BINARYis the default and most lightweight collation in Azure Databricks.In this collation:
'A'(x'65') <'B'(x'66') < … <'Z'(x'90'). However,'Z'(x'90') <'a'(x'97'), and'A'(x'65') <>'a'(x'97'). Characters such as'Ä'(x'C384') are greater than both'Z'and'z'.UTF8_LCASEA lightweight, case-insensitive collation that converts strings to lowercase before comparing them using
UTF8_BINARY.UTF8_LCASEis the collation used for Identifiers in Azure Databricks.For example:
ORDER BY col COLLATE UTF8_LCASEis equivalent to:
ORDER BY LOWER(col) COLLATE UTF8_BINARYUNICODEThe ICU root locale, known in CLDR as the
rootlocale (LDML specification:und-u). This collation applies a language-agnostic order that groups similar characters together. For example:'a'<'A'<'Ä'<'b'. This collation is case-sensitive and accent-sensitive by default.localeA locale-aware collation based on CLDR tables. The locale is specified as a language code, an optional script code, and an optional country code. Locale values are case-insensitive.
language_code: A two-letter ISO 639-1 language code.script_code: A four-letter ISO 15924 script code.country_code: A three-letter ISO 3166-1 alpha-3 country code.
modifierControls case sensitivity, accent sensitivity, and trailing space behavior. Modifiers are case-insensitive and can be specified in any order.
CS: Case-sensitive. The default behavior.CI: Case-insensitive.AS: Accent-sensitive. The default behavior.AI: Accent-insensitive.
Applies to:
Databricks SQL
Databricks Runtime 16.2 and aboveRTRIM: Trailing-space insensitive. Trims trailing spaces (u0020) before comparison.
Applies to:
Databricks SQL
Databricks Runtime 16.2 and aboveYou can specify
RTRIM, at most one ofCSorCI, and at most one ofASorAI.
When Azure Databricks processes a collation name, it normalizes the name by removing defaults. For example, SR_CYR_SRN_CS_AS normalizes to SR.
For a list of supported collations, see Supported collations.
Examples
-- Fully qualified collation names are supported; case doesn't matter.
system.builtin.unicode
-- All collations are system-defined and do not require qualification.
unicode
-- Two-letter language code for German collation.
DE
-- Two-letter language code and three-letter country code for French Canadian collation.
fr_CAN
-- Two-letter language code, four-letter script code, and three-letter country code
-- for Traditional Chinese in Macao.
zh_Hant_MAC
-- German collation with case-insensitive and accent-insensitive modifiers.
-- 'Ä', 'A', and 'a' are all considered equal.
de_CI_AI
-- Backticks are allowed but not required for built-in collations.
`UTF8_BINARY`
Compare UTF8_LCASE and UNICODE collations
Both UTF8_LCASE and UNICODE-based collations such as UNICODE_CI and UNICODE_CI_AI support case-insensitive comparisons, but they differ in how they handle accented characters and locale-specific rules.
Behavioral differences
UTF8_LCASE converts strings to lowercase and compares them byte by byte using UTF8_BINARY. It is lightweight and fast, but treats accented characters as distinct from their unaccented forms.
UNICODE-based collations use the ICU library and CLDR locale data. The CI and AI modifiers extend this to treat characters as equivalent even when they differ in case, accent, or both. These collations are more thorough but have higher computational cost.
| Behavior | UTF8_LCASE |
UNICODE_CI |
UNICODE_CI_AI |
|---|---|---|---|
'A' = 'a' |
true |
true |
true |
'Café' = 'café' |
true |
true |
true |
'Cafe' = 'Café' |
false |
false |
true |
'ß' = 'ss' (German sharp s) |
false |
false |
false |
'resume' = 'résumé' |
false |
false |
true |
Supports LIKE and RLIKE |
Yes | No | No |
| Supports locale-specific rules | No | Yes | Yes |
When to use each collation
Use
UTF8_LCASEfor fast case-insensitive comparisons when accented characters should remain distinct. This collation is well-suited for data warehouse migrations from systems that use simple case-insensitive string matching.Use
UNICODE_CIfor case-insensitive comparisons that follow Unicode rules, particularly when data includes characters from multiple languages where simple lowercasing is insufficient.Use
UNICODE_CI_AIwhen both case and accent differences should be ignored. For example, if a search for"resume"should match"résumé". This is common in user-facing search and multilingual applications.
Examples
-- UTF8_LCASE lowercases then compares bytes.
-- Accented and unaccented characters are not equivalent.
> SELECT 'Cafe' = 'café' COLLATE UTF8_LCASE;
false
> SELECT 'Café' = 'café' COLLATE UTF8_LCASE;
true
-- UNICODE_CI is case-insensitive but accent-sensitive.
-- 'Café' equals 'café' (case differs) but not 'Cafe' (accent differs).
> SELECT 'Café' = 'cafe' COLLATE UNICODE_CI;
false
> SELECT 'Café' = 'café' COLLATE UNICODE_CI;
true
-- UNICODE_CI_AI is case-insensitive and accent-insensitive.
-- 'Café' matches 'cafe' because both case and accent differences are ignored.
> SELECT 'Cafe' = 'café' COLLATE UNICODE_CI_AI;
true
> SELECT 'resume' = 'résumé' COLLATE UNICODE_CI_AI;
true
-- UTF8_LCASE sorts by lowercase byte order.
-- UNICODE collations sort by linguistic similarity.
> SELECT col FROM VALUES('Banana'), ('apple'), ('Ångström'), ('äpfel') AS t(col)
ORDER BY col COLLATE UTF8_LCASE;
apple
Banana
Ångström
äpfel
> SELECT col FROM VALUES('Banana'), ('apple'), ('Ångström'), ('äpfel') AS t(col)
ORDER BY col COLLATE UNICODE_CI;
apple
Ångström
äpfel
Banana
Default collation
The default collation applies to STRING literals, parameter markers, functions without STRING parameters that produce strings, and column, field, or variable type definitions without a COLLATE clause.
The default collation is derived as follows:
For DDL statements such as
ALTER TABLE,CREATE VIEW,CREATE TABLE, andCREATE FUNCTION:- The default collation is the default collation of the object being created or altered.
- If no
DEFAULT COLLATIONclause is specified, the default collation isUTF8_BINARY.
For DML statements (
UPDATE,DELETE FROM,INSERT,MERGE INTO) and Query, the default collation isUTF8_BINARY.
Collation precedence
Azure Databricks applies collation precedence rules to determine which collation to use for a given string. Four precedence levels are defined:
Explicit
The collation is explicitly assigned using
collateexpression.-- Force binary collation to check whether a VIN matches a Ferrari. vin COLLATE UTF8_BINARY LIKE 'ZFF%' -- Force German collation to order German first names. ORDER BY vorname COLLATE DEImplicit
The collation is implicitly assigned by a Column name, Field name, Column alias, Variable name, or Parameter name reference, including the result of a subquery where the collation is not None.
-- Use the collation of the column as defined. employee.name LIKE 'Mc%' -- Use the collation of the variable as defined. translate(session.tempvar, 'Z', ',')Default
A
STRINGliteral, named or unnamed Parameter markers, or aSTRINGproduced by a function from another type.-- A literal string has the default collation. 'Hello' -- :parm1 is a parameter marker using the session default collation. EXECUTE IMMEDIATE 'SELECT :parm1' USING 'Hello' AS parm1; -- ? is a parameter marker using the session default collation. EXECUTE IMMEDIATE 'SELECT ?' USING 'Hello'; -- The result of casting a non-STRING to STRING uses the default collation. CAST(5 AS STRING) -- The date is converted to a string using the default collation. to_char(DATE'2016-04-08', 'y') -- session_user() returns a STRING with the default collation. session_user()The assigned collation is the default collation.
None
A
STRINGresult of a function, operator, or set operation such asUNIONthat takes more than oneSTRINGargument with different implicit collations.-- Concatenating two strings with different explicit collations results in no collation. SELECT fr || de AS freutsch FROM VALUES('Voulez vous ' COLLATE FR, 'Kartoffelsupp...' COLLATE DE) AS T(fr, de) -- A union of two strings with different explicit collations results in no collation. SELECT 'Voulez vous ' COLLATE FR UNION ALL SELECT 'Kartoffelsupp...' COLLATE DE
Collation derivation
When deriving the collation for a STRING result, precedence rules apply as follows:
If the expression matches one of the definitions above, the collation and precedence are as defined.
If the expression is a function or operator with a single
STRINGparameter returning aSTRING, the collation and precedence are those of theSTRINGparameter.If the expression is a function or operator with two or more
STRINGparameters:If all parameters have the same collation and precedence, the result uses that collation and precedence.
If parameters have different collations or precedence, let
C1andC2be distinct collations andDbe the default collation. The result is determined by the following table:
| Collation and Precedence | C1 Explicit | C1 Implicit | D Default | None |
|---|---|---|---|---|
| C2 Explicit | Error | C2 Explicit | C2 Explicit | C2 Explicit |
| C2 Implicit | Explicit C1 | None | C2 Implicit | None |
| D Default | C1 Explicit | C1 Implicit | D Default | None |
| None | C1 Explicit | None | None | None |
Examples
> SELECT 'hello' = 'hello ' COLLATE UNICODE_RTRIM;
true
> CREATE TABLE words(fr STRING COLLATE FR, de STRING COLLATE DE, en STRING COLLATE EN);
> INSERT INTO words VALUES ('Salut', 'Hallo', 'Hello');
-- A literal string has the default collation.
> SELECT collation('Ciao');
UTF8_BINARY
-- A function producing a STRING has the default collation.
> SELECT collation(user());
UTF8_BINARY
-- A function that modifies a STRING passes the collation through.
> SELECT collation(upper('Ciao'));
UTF8_BINARY
-- Implicit collation (French) wins over default collation.
> SELECT collation(fr || 'Ciao') FROM words;
FR
-- Explicit collation (French) wins over implicit collation (German).
> SELECT collation('Salut' COLLATE FR || de) FROM words;
FR
-- Implicit German collides with implicit French; the result has no collation.
> SELECT collation(de || fr) FROM words;
null
-- Explicit collation (French) wins over default collation (Italian).
> SELECT collation('Salut' COLLATE FR || 'Ciao');
FR
-- Explicit collation (French) collides with explicit collation (German).
> SELECT collation('Salut' COLLATE FR || 'Hallo' COLLATE DE);
COLLATION_MISMATCH.EXPLICIT
-- Explicit collation wins over no collation.
> SELECT collation('Ciao' COLLATE IT || (fr || de)) FROM words;
IT
-- Implicit collation (English) does not win over None.
> SELECT collation(en || (fr || de)) FROM words;
null
-- Explicit collation (English) wins over implicit collation anywhere in the expression.
> SELECT collation((fr || ltrim('H' COLLATE EN, fr)) || fr) FROM words;
EN
Limitations
Many pattern-matching and regex functions support only UTF8_BINARY and UTF8_LCASE collations. Azure Databricks raises an error when these functions are used with other collations such as UNICODE or locale-specific collations like DE or FR_CI_AI.
Affected functions include LIKE, ILIKE, RLIKE, and the regexp_* function family. For string matching on columns with other collations, use contains function, startswith function, or endswith function.