Vector Search performance guide

Mosaic AI Vector Search is built for fast, scalable retrieval. Vector search performance depends on many factors, including SKU choice, index size, query type, vector dimensionality, authentication methods, and how your application handles traffic spikes. Most workloads perform well out of the box, but for situations where you need to scale or optimize latency, this guide presents practical tips and common patterns to help you configure your system for optimal vector search performance.

Factors that affect performance

Performance isn’t a single number — it’s a range that depends on workload characteristics, configuration choices, and client implementation. This guide is designed to help you build a clear mental model of how performance works so you can use Mosaic AI Vector Search most effectively.

The following are the key factors that influence how the system behaves:

SKU choice: standard or storage-optimized.
Index size: number of vectors stored.
Embedding size: typically 384–1536.
Query type: approximate nearest neighbor (ANN) or hybrid.
Number of requested results: higher values increase retrieval time.
Embedding type: managed or self-managed.
Query load: how much traffic hits the endpoint over time.
Authentication method: how your app connects to Databricks.

The rest of this article provides practical tips for tuning each of these variables and explains how they affect search latency and query throughput in real-world deployments.

Pick the right SKU

Mosaic AI Vector Search offers two SKUs, each designed to balance latency, scalability, and cost efficiency depending on the workload. Choosing the right SKU for your application is the first lever for tuning performance.

In general:

Choose standard endpoints when latency is critical and your index is well under 320M vectors.
Choose storage-optimized endpoints when you're working with 10M+ vectors, can tolerate some extra latency, and need better cost efficiency per vector (up to 7x cheaper).

The following table shows some expected performance guidelines.

SKU	Latency	QPS	Index capacity	Vector search unit (VSU) size
Standard	20–50 ms	30–200+	320M vectors	2M vectors
Storage-optimized	300–500 ms	30–50	1B vectors	64M vectors

Understand index size

Performance is highest when your index fits within a single vector search unit, with extra space to handle additional query load. As workloads scale beyond a single vector search unit (that is, 2M+ vectors for standard or 64M+ for storage-optimized), latency increases and QPS tapers off. Eventually, QPS plateaus at approximately 30 QPS (ANN).

Performance depends on many factors unique to each workload, such as query patterns, filters, vector dimensionality, and concurrency. The following numbers are reference points.

SKU	Vectors	Dimension	Latency	QPS	Monthly queries
Standard	10K	768	20ms	200+	500M+
	10M	768	40ms	30	78M
	100M	768	50ms	30	78M
Storage-optimized	10M	768	300ms	50	130M
	100M	768	400ms	40	100M
	1B	768	500ms	30	78M

Minimize embedding size

Vector dimensionality refers to the number of features in each vector. Typical values are 384, 768, 1024, or 1536. Higher dimensions provide more expressive representations that can improve quality, but come at a compute cost. Lower-dimensional vectors require less computation during retrieval, which translates into faster query times and higher QPS. Conversely, higher-dimensional vectors increase compute load and reduce throughput.

As a general rule, choose the smallest dimensionality that preserves retrieval quality for your use case.

For example, reducing dimensionality by a factor of two (say, from 768 to 384) typically improves QPS by about 1.5x and reduces latency by about 20%, depending on the index size and query pattern. These gains compound further at very low dimensionalities. For example, using 64-dimensional vectors can deliver dramatically higher QPS and significantly lower latency compared to the 768-dimension benchmarks shown in the table. That makes 384 dimensions and below especially attractive for high-throughput, latency-sensitive use cases, as long as retrieval quality remains acceptable.

Use ANN for efficiency, and use hybrid when necessary

Use ANN queries whenever possible. They are the most compute-efficient and support the highest QPS.

Use hybrid search when necessary. Hybrid search improves recall in some applications, especially where domain-specific keywords are important. Hybrid search typically uses about twice as many resources as ANN and can significantly reduce throughput.

Use 10–100 results

Each query includes a num_results parameter, which is the number of search results to return. This value directly impacts performance. Retrieving more results requires deeper scanning of the index, which increases latency and reduces QPS. The effect becomes more significant at higher values. For example, increasing num_results by 10x can double query latency and reduce QPS capacity by 3x, depending on index size and configuration.

As a best practice, keep num_results in the range of 10–100 unless your application specifically requires more. Try out different num_results values using realistic queries to understand the impact on your workload.

Avoid scale-to-zero for production

Vector Search supports two types of embeddings with different performance tradeoffs.

Managed embeddings are the most convenient. With managed embeddings, Databricks generates embeddings for both your rows and queries automatically. At query time, the query text is passed to a model serving endpoint to generate the embedding, which adds latency. If the embedding model is external, it also introduces additional network overhead.

Self-managed embeddings let you compute embeddings in advance and pass the vectors directly at query time. This avoids runtime generation and enables the fastest retrieval. All of the performance numbers included in this article are based on self-managed embeddings.

For real-time production use cases, avoid model endpoints that scale to zero. Cold starts can delay responses by several minutes or even cause failures if the endpoint is inactive when a query arrives.

Plan for query spikes

This section describes what to expect as traffic ramps up and how to stay below the critical limits that trigger latency spikes or 429 errors (Too Many Requests).

Latency stays low when the load is moderate, and gradually increases as you approach the maximum QPS capacity. When the system reaches 100% QPS capacity, it starts returning 429 errors. If you have not set up proper backoff, the app might become unresponsive.

429 errors serve as a safety mechanism to protect the system. They instruct the client to back off and retry later so the endpoint remains healthy and responsive, even under sudden traffic spikes.

As a best practice, use the Vector Search Python SDK, which includes built-in backoff and retry handling.

If you use the REST API, implement exponential backoff with jitter. See Azure antipatterns.

Use service principals with OAuth tokens

Use efficient authentication methods for best performance. Databricks recommends using a service principal with OAuth token in all production environments. OAuth access tokens provide greater security, and also leverage network-optimized infrastructure to allow the system to operate at full capacity.

Avoid using personal access tokens (PATs), as they introduce network overhead, add hundreds of milliseconds of latency, and significantly reduce the QPS your endpoint can sustain.

Use the Python SDK

Use the latest version of the Python SDK to benefit from built-in performance and reliability optimizations.

Reuse the index object across queries. Avoid calling client.get_index(...).similarity_search(...) in every request, as this pattern adds unnecessary latency.

Instead, initialize the index once and reuse it:

# Initialize once
index = client.get_index(...)

# Then use it for every query
index.similarity_search(...)

This is important when using the vector search index in MLFlow environments, where you can create the index object at endpoint initialization and then reuse it for every query.

This guidance is especially helpful in real-time, latency-sensitive applications. In RAG setups with multiple indexes or on-behalf-of-user authorization flows, where index selection or credentials are only available at query time, initializing the index object once might not be feasible. This optimization isn't necessary in those scenarios.

Parallelize across endpoints

Databricks recommends exploring the following strategies to increase the total QPS in the system:

Split indexes across endpoints. If you have multiple indexes each of which receive a significant portion of traffic, host them on separate endpoints to reach higher total bandwidth.
Replicate the index across endpoints. If most traffic hits a single index, duplicate it across multiple endpoints. Split traffic evenly at the client level for linear QPS gains.

Feedback

Was this page helpful?

Last updated on 2025-11-13