Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Important
This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Azure Databricks previews.
The lakebase_vector extension adds approximate nearest-neighbor (ANN) vector search to Lakebase via the lakebase_ann index type. It is a drop-in companion to pgvector: the same vector types, distance operators, and query syntax work without modification.
Install
First, enable Lakebase Search in your project settings. Then install the extension:
CREATE EXTENSION IF NOT EXISTS lakebase_vector CASCADE;
The CASCADE keyword automatically installs pgvector as a dependency.
Quick start
-- Create a table with a vector column
CREATE TABLE items (id BIGSERIAL PRIMARY KEY, embedding VECTOR(3));
-- Insert sample data
INSERT INTO items (embedding)
SELECT ARRAY[random(), random(), random()]::real[]
FROM generate_series(1, 1000);
-- Create a lakebase_ann index
CREATE INDEX ON items USING lakebase_ann (embedding vector_l2_ops);
-- Query using standard pgvector distance operators
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;
Configure the index
Set build_mode at index creation to control the accuracy/speed tradeoff:
standard(default): optimizes for recall. Use for most workloads.fast: builds faster at lower recall. Use when build time matters more than search quality.
CREATE INDEX ON items USING lakebase_ann (embedding vector_l2_ops) WITH (build_mode = 'fast');
Build indexes concurrently
Use CREATE INDEX CONCURRENTLY to build without locking the table, then REINDEX CONCURRENTLY to rebuild without downtime:
CREATE INDEX CONCURRENTLY items_embedding_ann ON items
USING lakebase_ann (embedding vector_l2_ops);
REINDEX INDEX CONCURRENTLY items_embedding_ann;
Tune search accuracy
Before tuning, call lakebase_ann_index_info(index_name) to get the index's lists, default_probes, and default_epsilon values.
Set lakebase_ann.probes at query time to control the accuracy/speed tradeoff. Higher values improve recall but slow queries.
SET lakebase_ann.probes TO '10';
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 10;
lakebase_ann.epsilon controls the re-ranking margin. The default value of 1.9 works well for most workloads.
SET lakebase_ann.epsilon TO '1.5';
Operator classes
| Distance metric | Operator class | Query operator |
|---|---|---|
| L2 (Euclidean) | vector_l2_ops |
<-> |
| Inner product | vector_ip_ops |
<#> |
| Cosine similarity | vector_cosine_ops |
<=> |
Index options reference
| Option | Type | Default | Description |
|---|---|---|---|
build_mode |
string | 'standard' |
Controls the accuracy/speed tradeoff at index build time. 'standard' optimizes for recall; 'fast' builds faster with lower recall. |
GUC reference
| Parameter | Type | Default | Description |
|---|---|---|---|
lakebase_ann.probes |
integer | (unset) | Number of IVF partitions to scan at query time. Higher values improve recall at the cost of query speed. |
lakebase_ann.epsilon |
float | 1.9 |
Re-ranking margin. Valid range: 0.0 to 4.0. |
Utility functions
| Function | Returns | Description |
|---|---|---|
lakebase_ann_prewarm(regclass) |
void | Loads an index into memory to eliminate cold-start latency on the first query. |
lakebase_ann_index_info(regclass) |
text | Returns index metadata as text, including lists, default_probes, and default_epsilon. |