Full-text search indexes on Unity Catalog managed tables

Important

This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Azure Databricks previews.

A full-text search index accelerates lookups on one or more text columns of a managed Delta Lake or Iceberg table. The index supports substring matching and word matching. When you query the table with the search or isearch functions, Azure Databricks uses the index to skip files that are guaranteed not to contain matching rows. This significantly reduces the amount of data scanned, especially for selective lookups.

Important

Indexes created during the Beta release aren't guaranteed to be compatible with later releases. When the feature reaches Public Preview, you must drop existing indexes and create new ones.

Requirements

Full-text search indexes have requirements for compute, base table and schema permissions, and base table configuration.

Compute

Full-text search indexes are available only in Azure Databricks Runtime 18.2 and above, and you must enable this Beta feature in your workspace settings. See Manage Azure Databricks previews.

Permissions

To create a search index:

  • You must have the MODIFY permission for the table referenced in the search index.
  • You must have the CREATE TABLE permission on the parent schema. A schema owner or user with the MANAGE privilege can grant you CREATE TABLE privileges on the schema.

Table configuration

Before you create a full-text search index, the base table must satisfy all of the following:

  • You must create the index in the same catalog and schema as the base table.
  • The table is a managed Delta Lake table or a managed Iceberg table.
  • Row tracking is enabled (delta.enableRowTracking = true). See Row tracking in Databricks.
  • Indexed columns are of type STRING, VARIANT, STRUCT, or ARRAY. STRING columns use the UTF8_BINARY collation.
  • A STRUCT column contains at least one STRING, VARIANT, or ARRAY leaf field at any nesting depth; other leaf fields are ignored.
  • The table doesn't use any features from the list of limitations, including: Delta Sharing, shallow cloning, attribute-based access controls, row-level security policies, and column masks. See Limitations.

For information about table protocol requirements, which apply to both Delta Lake and Iceberg tables, see Delta Lake feature compatibility and protocols.

Create a full-text search index

Use CREATE SEARCH INDEX to create an index over one or more text columns. The following example indexes two text columns of an existing log table:

CREATE SEARCH INDEX log_idx
ON logs (message, error_detail);

The full syntax is:

CREATE SEARCH INDEX [IF NOT EXISTS] index_name
  ON table_name ( column_name [, column_name ...] )
  [OPTIONS ( option_key = option_value [, ... ] )]

index_name must be unique within the schema and can't match an existing table name.

To control how the text is tokenized, see Options.

Warning

If CREATE SEARCH INDEX and REFRESH INDEX fail mid-execution, run REFRESH INDEX to recover from a partial failure.

Options

The OPTIONS clause accepts the following keys:

Key Values Default Description
tokenizer ngram, split ngram How the text is tokenized for indexing. See Select a tokenizer for your use case.
ngram_size integer in [3, 10] 5 Length of the n-grams produced. Only valid when tokenizer = 'ngram'.
min_token_length integer >= 1 3 Minimum length of tokens to keep. Tokens shorter than this are dropped during indexing. Only valid when tokenizer = 'split'.

For detailed information about invalid option errors, see SEARCH_INDEX_INVALID_PARAMETERS error condition.

Select a tokenizer for your use case

Search indexes have 2 tokenizer options available, depending on your use case:

Tokenizer Use case Description
ngram Substring matching. Splits text into overlapping n-grams of length ngram_size.
split Whole-word containment checks. Splits text into word tokens. A token is a run of Unicode letters (\p{L}) and combining marks (\p{M}); any other character is a delimiter.

To create an n-gram index with an n-gram size of 4:

CREATE SEARCH INDEX log_ngram_idx
  ON logs (message)
  OPTIONS (tokenizer = 'ngram', ngram_size = 4);

To create a split index with a minimum token length of 2:

CREATE SEARCH INDEX log_word_idx
  ON logs (message)
  OPTIONS (tokenizer = 'split', min_token_length = 2);

Query data using search and isearch

Azure Databricks has two SQL functions to test whether a search pattern is present in one or more text targets:

  • search: Case-sensitive.
  • isearch: Case-insensitive.

Select search or isearch based on your case-sensitivity requirement. When the indexed columns are covered by a full-text search index, Azure Databricks uses the index to skip files guaranteed not to contain matching rows. Search indexes do not affect results.

Indexes accelerate queries the most when the search pattern appears in a small fraction of the table's files.

search( target [, target ... ] , 'pattern' [, mode => 'substring' | 'word' ] )
isearch( target [, target ... ] , 'pattern' [, mode => 'substring' | 'word' ] )

Arguments

search and isearch accept the following arguments:

  • target must be of type STRING, VARIANT, STRUCT, or ARRAY, the same types that indexing allows. Targets are deduplicated.
  • pattern must be a non-null string literal.
  • mode specifies how pattern matches each target:
    • substring (default): pattern is matched as a substring within each target.
    • word: pattern is split into word tokens using the same rule as the split tokenizer. The function returns true if every word in pattern appears in at least one target, regardless of order. See Select a tokenizer for your use case.

Returns

search and isearch return a BOOLEAN value with three-valued logic:

  • true if at least one non-null target matches.
  • null if no non-null target matches but at least one target is null.
  • false if all targets are non-null and none match.

Examples

The following examples show common search and isearch queries:

-- Case-insensitive substring search across one column.
SELECT * FROM logs
WHERE isearch(message, 'connection refused');

-- Case-sensitive substring search across multiple columns.
SELECT * FROM logs
WHERE search(message, error_detail, '550e8400-e29b-41d4-a716-446655440000');

-- Word search: matches rows containing all three words, in any order.
SELECT * FROM audit_logs
WHERE search(message, 'user admin login', mode => 'word');

Manage indexes

Important

Full-text search indexes don't update automatically when the base table changes. See Refresh an index.

Azure Databricks maintains query correctness, regardless of index freshness. When a table contains non-indexed data, the query uses the existing index to accelerate access to the indexed records and uses a table scan for the non-indexed records.

Use the following operations to manage full-text search indexes:

Describe or view an index

To view information about an index:

DESCRIBE INDEX log_idx;

Refresh an index

Full-text search indexes don't update automatically when the base table changes.

To update the index, adding entries for new rows:

REFRESH INDEX log_idx;

REFRESH INDEX is an incremental, append-only operation. It indexes new data but doesn't remove entries for deleted rows.

To update the index, both adding entries for new rows and removing entries for deleted rows, use REFRESH INDEX ... FULL:

REFRESH INDEX log_idx FULL;

A full refresh requires more compute resources than an incremental refresh. Over time, incremental refreshes accumulate stale entries, which increase the size of the index and negatively affect performance.

Drop an index

To drop an index, run the following:

DROP INDEX log_idx;

To avoid an error for missing indexes, use:

DROP INDEX IF EXISTS log_idx;

Note

If you drop the base table, the command also drops the full-text search indexes.

Limitations

Full-text search indexes have the following limitations:

  • Renaming an indexed column on the base table, or changing its data type, is not supported.
  • Tables with Delta Sharing are not supported. If you add the base table as a Delta Sharing source or target after creating the index, Azure Databricks ignores the search index.
  • Tables with shallow clones are not supported. If you add the base table as a shallow clone source after creating the index, Azure Databricks ignores the search index.
  • Tables with attribute-based access controls, column masks, or row-level security policies are not supported. If you add any of these controls to a table with a search index, Azure Databricks ignores the search index. See Core concepts for attribute-based access control (ABAC).