lakebase_vector

Important

This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Azure Databricks previews.

The lakebase_vector extension adds approximate nearest-neighbor (ANN) vector search to Lakebase via the lakebase_ann index type. It is a drop-in companion to pgvector: the same vector types, distance operators, and query syntax work without modification.

Install

First, enable Lakebase Search in your project settings. Then install the extension:

CREATE EXTENSION IF NOT EXISTS lakebase_vector CASCADE;

The CASCADE keyword automatically installs pgvector as a dependency.

Quick start

-- Create a table with a vector column
CREATE TABLE items (id BIGSERIAL PRIMARY KEY, embedding VECTOR(3));

-- Insert sample data
INSERT INTO items (embedding)
SELECT ARRAY[random(), random(), random()]::real[]
FROM generate_series(1, 1000);

-- Create a lakebase_ann index
CREATE INDEX ON items USING lakebase_ann (embedding vector_l2_ops);

-- Query using standard pgvector distance operators
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;

Configure the index

Set build_mode at index creation to control the accuracy/speed tradeoff:

  • standard (default): optimizes for recall. Use for most workloads.
  • fast: builds faster at lower recall. Use when build time matters more than search quality.
CREATE INDEX ON items USING lakebase_ann (embedding vector_l2_ops) WITH (build_mode = 'fast');

Build indexes concurrently

Use CREATE INDEX CONCURRENTLY to build without locking the table, then REINDEX CONCURRENTLY to rebuild without downtime:

CREATE INDEX CONCURRENTLY items_embedding_ann ON items
  USING lakebase_ann (embedding vector_l2_ops);

REINDEX INDEX CONCURRENTLY items_embedding_ann;

Tune search accuracy

Before tuning, call lakebase_ann_index_info(index_name) to get the index's lists, default_probes, and default_epsilon values.

Set lakebase_ann.probes at query time to control the accuracy/speed tradeoff. Higher values improve recall but slow queries.

SET lakebase_ann.probes TO '10';
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 10;

lakebase_ann.epsilon controls the re-ranking margin. The default value of 1.9 works well for most workloads.

SET lakebase_ann.epsilon TO '1.5';

Operator classes

Distance metric Operator class Query operator
L2 (Euclidean) vector_l2_ops <->
Inner product vector_ip_ops <#>
Cosine similarity vector_cosine_ops <=>

Index options reference

Option Type Default Description
build_mode string 'standard' Controls the accuracy/speed tradeoff at index build time. 'standard' optimizes for recall; 'fast' builds faster with lower recall.

GUC reference

Parameter Type Default Description
lakebase_ann.probes integer (unset) Number of IVF partitions to scan at query time. Higher values improve recall at the cost of query speed.
lakebase_ann.epsilon float 1.9 Re-ranking margin. Valid range: 0.0 to 4.0.

Utility functions

Function Returns Description
lakebase_ann_prewarm(regclass) void Loads an index into memory to eliminate cold-start latency on the first query.
lakebase_ann_index_info(regclass) text Returns index metadata as text, including lists, default_probes, and default_epsilon.

Next steps