Mosaic AI Vector Search

Article
10/15/2024

This article gives an overview of Databricks’ vector database solution, Mosaic AI Vector Search, including what it is and how it works.

What is Mosaic AI Vector Search?

Mosaic AI Vector Search is a vector database that is built into the Databricks Data Intelligence Platform and integrated with its governance and productivity tools. A vector database is a database that is optimized to store and retrieve embeddings. Embeddings are mathematical representations of the semantic content of data, typically text or image data. Embeddings are generated by a large language model and are a key component of many GenAI applications that depend on finding documents or images that are similar to each other. Examples are RAG systems, recommender systems, and image and video recognition.

With Mosaic AI Vector Search, you create a vector search index from a Delta table. The index includes embedded data with metadata. You can then query the index using a REST API to identify the most similar vectors and return the associated documents. You can structure the index to automatically sync when the underlying Delta table is updated.

Mosaic AI Vector Search supports the following:

How does Mosaic AI Vector Search work?

Mosaic AI Vector Search uses the Hierarchical Navigable Small World (HNSW) algorithm for its approximate nearest neighbor searches and the L2 distance distance metric to measure embedding vector similarity. If you want to use cosine similarity you need to normalize your datapoint embeddings before feeding them into vector search. When the data points are normalized, the ranking produced by L2 distance is the same as the ranking produces by cosine similarity.

Mosaic AI Vector Search also supports hybrid keyword-similarity search, which combines vector-based embedding search with traditional keyword-based search techniques. This approach matches exact words in the query while also using a vector-based similarity search to capture the semantic relationships and context of the query.

By integrating these two techniques, hybrid keyword-similarity search retrieves documents that contain not only the exact keywords but also those that are conceptually similar, providing more comprehensive and relevant search results. This method is particularly useful in RAG applications where source data has unique keywords such as SKUs or identifiers that are not well suited to pure similarity search.

For details about the API, see the Python SDK reference and Query a vector search endpoint.

Similarity search calculation

The similarity search calculation uses the following formula:

reciprocal of 1 plus the squared distance

where dist is the Euclidean distance between the query q and the index entry x:

Eucidean distance, square root of the sum of squared differences

Keyword search algorithm

Relevance scores are calculated using Okapi BM25. All text or string columns are searched, including the source text embedding and metadata columns in text or string format. The tokenization function splits at word boundaries, removes punctuation, and converts all text to lowercase.

How similarity search and keyword search are combined

The similarity search and keyword search results are combined using the Reciprocal Rank Fusion (RRF) function.

RRF rescores each document from each method using the score:

RRF equation

In the above equation, rank starts at 0, sums the scores for each document and returns the highest scoring documents.

rrf_param controls the relative importance of higher-ranked and lower-ranked documents. Based on the literature, rrf_param is set to 60.

Scores are normalized so that the highest score is 1 and the lowest score is 0 using the following equation:

normalization

Options for providing vector embeddings

To create a vector database in Databricks, you must first decide how to provide vector embeddings. Databricks supports three options:

Option 1: Delta Sync Index with embeddings computed by Databricks You provide a source Delta table that contains data in text format. Databricks calculates the embeddings, using a model that you specify, and optionally saves the embeddings to a table in Unity Catalog. As the Delta table is updated, the index stays synced with the Delta table.

The following diagram illustrates the process:
1. Calculate query embeddings. Query can include metadata filters.
2. Perform similarity search to identify most relevant documents.
3. Return the most relevant documents and append them to the query.
Option 2: Delta Sync Index with self-managed embeddings You provide a source Delta table that contains pre-calculated embeddings. As the Delta table is updated, the index stays synced with the Delta table.

The following diagram illustrates the process:
1. Query consists of embeddings and can include metadata filters.
2. Perform similarity search to identify most relevant documents. Return the most relevant documents and append them to the query.
Option 3: Direct Vector Access Index You must manually update the index using the REST API when the embeddings table changes.

The following diagram illustrates the process:

How to set up Mosaic AI Vector Search

To use Mosaic AI Vector Search, you must create the following:

A vector search endpoint. This endpoint serves the vector search index. You can query and update the endpoint using the REST API or the SDK. Endpoints scale automatically to support the size of the index or the number of concurrent requests. See Create a vector search endpoint for instructions.
A vector search index. The vector search index is created from a Delta table and is optimized to provide real-time approximate nearest neighbor searches. The goal of the search is to identify documents that are similar to the query. Vector search indexes appear in and are governed by Unity Catalog. See Create a vector search index for instructions.

In addition, if you choose to have Databricks compute the embeddings, you can use a pre-configured Foundation Model APIs endpoint or create a model serving endpoint to serve the embedding model of your choice. See Pay-per-token Foundation Model APIs or Create generative AI model serving endpoints for instructions.

To query the model serving endpoint, you use either the REST API or the Python SDK. Your query can define filters based on any column in the Delta table. For details, see Use filters on queries, the API reference, or the Python SDK reference.

Requirements

Unity Catalog enabled workspace.
Serverless compute enabled. For instructions, see Connect to serverless compute.
Source table must have Change Data Feed enabled. For instructions, see Use Delta Lake change data feed on Azure Databricks.
CREATE TABLE privileges on catalog schema(s) to create indexes.
Personal access tokens enabled.

Permission to create and manage vector search endpoints is configured using access control lists. See Vector search endpoint ACLs.

Data protection and authentication

Databricks implements the following security controls to protect your data:

Every customer request to Mosaic AI Vector Search is logically isolated, authenticated, and authorized.
Mosaic AI Vector Search encrypts all data at rest (AES-256) and in transit (TLS 1.2+).

Mosaic AI Vector Search supports two modes of authentication:

Personal Access Token - You can use a personal access token to authenticate with Mosaic AI Vector Search. See personal access authentication token. If you use the SDK in a notebook environment, it automatically generates a PAT token for authentication.
Service Principal Token - An admin can generate a service principal token and pass it to the SDK or API. See use service principals. For production use cases, Databricks recommends using a service principal token.

Customer Managed Keys (CMK) are supported on endpoints created on or after May 8, 2024.

Monitor usage and costs

The billable usage system table lets you monitor usage and costs associated with vector search indexes and endpoints. Here is an example query:

WITH all_vector_search_usage (
  SELECT *,
         CASE WHEN usage_metadata.endpoint_name IS NULL
              THEN 'ingest'
              ELSE 'serving'
        END as workload_type
    FROM system.billing.usage
   WHERE billing_origin_product = 'VECTOR_SEARCH'
),
daily_dbus AS (
  SELECT workspace_id,
       cloud,
       usage_date,
       workload_type,
       usage_metadata.endpoint_name as vector_search_endpoint,
       SUM(usage_quantity) as dbus
 FROM all_vector_search_usage
 GROUP BY all
ORDER BY 1,2,3,4,5 DESC
)
SELECT * FROM daily_dbus

For details about the contents of the billing usage table, see Billable usage system table reference. Additional queries are in the following example notebook.

Vector search system tables queries notebook

Get notebook

Resource and data size limits

The following table summarizes resource and data size limits for vector search endpoints and indexes:

Resource	Granularity	Limit
Vector search endpoints	Per workspace	100
Embeddings	Per endpoint	320,000,000
Embedding dimension	Per index	4096
Indexes	Per endpoint	50
Columns	Per index	50
Columns		Supported types: Bytes, short, integer, long, float, double, boolean, string, timestamp, date
Metadata fields	Per index	20
Index name	Per index	128 characters

The following limits apply to the creation and update of vector search indexes:

Resource	Granularity	Limit
Row size for Delta Sync Index	Per index	100KB
Embedding source column size for Delta Sync index	Per Index	32764 bytes
Bulk upsert request size limit for Direct Vector index	Per Index	10MB
Bulk delete request size limit for Direct Vector index	Per Index	10MB

The following limits apply to the query API.

Resource	Granularity	Limit
Query text length	Per query	32764
Maximum number of results returned	Per query	10,000

Limitations

Row and column level permissions are not supported. However, you can implement your own application level ACLs using the filter API.

Share via