lakebase_vector

Important

This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Azure Databricks previews.

The lakebase_vector extension adds approximate nearest-neighbor (ANN) vector search to Lakebase via the lakebase_ann index type. It is a drop-in companion to pgvector: the same vector types, distance operators, and query syntax work without modification.

Install

First, enable Lakebase Search in your project settings. Then install the extension:

CREATE EXTENSION IF NOT EXISTS lakebase_vector CASCADE;

The CASCADE keyword automatically installs pgvector as a dependency.

Quick start

-- Create a table with a vector column
CREATE TABLE items (id BIGSERIAL PRIMARY KEY, embedding VECTOR(3));

-- Insert sample data
INSERT INTO items (embedding)
SELECT ARRAY[random(), random(), random()]::real[]
FROM generate_series(1, 1000);

-- Create a lakebase_ann index
CREATE INDEX ON items USING lakebase_ann (embedding vector_l2_ops);

-- Query using standard pgvector distance operators
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;

Configure the index

Set build_mode at index creation to control the accuracy/speed tradeoff:

standard (default): optimizes for recall. Use for most workloads.
fast: builds faster at lower recall. Use when build time matters more than search quality.

CREATE INDEX ON items USING lakebase_ann (embedding vector_l2_ops) WITH (build_mode = 'fast');

Build indexes concurrently

Use CREATE INDEX CONCURRENTLY to build without locking the table, then REINDEX CONCURRENTLY to rebuild without downtime:

CREATE INDEX CONCURRENTLY items_embedding_ann ON items
  USING lakebase_ann (embedding vector_l2_ops);

REINDEX INDEX CONCURRENTLY items_embedding_ann;

Tune search accuracy

Before tuning, call lakebase_ann_index_info(index_name) to get the index's lists, default_probes, and default_epsilon values.

Set lakebase_ann.probes at query time to control the accuracy/speed tradeoff. Higher values improve recall but slow queries.

SET lakebase_ann.probes TO '10';
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 10;

lakebase_ann.epsilon controls the re-ranking margin. The default value of 1.9 works well for most workloads.

SET lakebase_ann.epsilon TO '1.5';

Operator classes

Distance metric	Operator class	Query operator
L2 (Euclidean)	`vector_l2_ops`	`<->`
Inner product	`vector_ip_ops`	`<#>`
Cosine similarity	`vector_cosine_ops`	`<=>`

Index options reference

Option	Type	Default	Description
`build_mode`	string	`'standard'`	Controls the accuracy/speed tradeoff at index build time. `'standard'` optimizes for recall; `'fast'` builds faster with lower recall.

GUC reference

Parameter	Type	Default	Description
`lakebase_ann.probes`	integer	(unset)	Number of IVF partitions to scan at query time. Higher values improve recall at the cost of query speed.
`lakebase_ann.epsilon`	float	`1.9`	Re-ranking margin. Valid range: `0.0` to `4.0`.

Utility functions

Function	Returns	Description
`lakebase_ann_prewarm(regclass)`	void	Loads an index into memory to eliminate cold-start latency on the first query.
`lakebase_ann_index_info(regclass)`	text	Returns index metadata as text, including `lists`, `default_probes`, and `default_epsilon`.

Next steps

Feedback

Was this page helpful?

Last updated on 2026-06-16