Scale endpoint throughput with high QPS (Beta)

Important

This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Azure Databricks previews.

By default, standard endpoints support 20–200 QPS depending on index size. Real-time applications such as search bars, recommendation systems, and entity matching often require 100–1000+ QPS. On standard endpoints only, you can set a minimum QPS. Databricks provisions the infrastructure to support that throughput level when indexes are created or synced.

Important

Setting a minimum QPS provisions additional capacity, which increases the cost of the endpoint. You are charged for this additional capacity regardless of actual query traffic. To stop incurring these charges, reset the endpoint to the default configuration using min_qps=-1. Throughput scaling is best-effort and not guaranteed during Beta.

Use high QPS when:

Your application requires more than 50 QPS of sustained throughput.
You receive 429 (Too Many Requests) errors under normal load.
Latency degrades as traffic ramps up, even when average utilization appears low.

Requirements

High QPS is available for standard endpoints only. Storage-optimized endpoints are not supported.
OAuth authentication is required for endpoints handling more than 70–100 QPS. Personal access tokens (PATs) are rate-limited to 70–100 QPS. See Use service principals with OAuth tokens.

Configure minimum QPS

Set a minimum QPS when creating a new endpoint or updating an existing one. The additional capacity needed to achieve the target throughput is calculated automatically the next time an index on the endpoint is created or synced. In Beta, throughput scaling is best-effort and not guaranteed: actual QPS depends on your index size, vector dimensionality, query complexity, and filter usage.

Databricks UI

When creating a new endpoint:

In the left sidebar, click Compute.
Click the Vector Search tab and click Create.
Under Advanced Settings, enter the Min QPS value.

When updating an existing endpoint:

Navigate to the endpoint detail page.
Locate the Min QPS field in the right panel and click the pencil icon next to the current value.
Enter the new value and click Save.

After changing minimum QPS, sync your indexes to apply the new configuration.

Python SDK

from databricks.vector_search.client import VectorSearchClient, MIN_QPS_RESET_TO_DEFAULT

client = VectorSearchClient()

# Create a new endpoint with minimum QPS
endpoint = client.create_endpoint(
    name="my-high-qps-endpoint",
    endpoint_type="STANDARD",
    min_qps=500,
)

# Update an existing endpoint's minimum QPS
response = client.update_endpoint(name="my-endpoint", min_qps=500)

# Check scaling status
scaling_info = response.get("endpoint", {}).get("scaling_info", {})
print(f"Requested min QPS: {scaling_info.get('requested_min_qps')}")
print(f"State: {scaling_info.get('state')}")
# State is "SCALING_CHANGE_IN_PROGRESS" until the next index sync,
# then transitions to "SCALING_CHANGE_APPLIED"

# Reset to default (remove high QPS configuration)
client.update_endpoint(name="my-endpoint", min_qps=MIN_QPS_RESET_TO_DEFAULT)

REST API

Create an endpoint with minimum QPS:

POST /api/2.0/vector-search/endpoints
{
  "name": "my-high-qps-endpoint",
  "endpoint_type": "STANDARD",
  "min_qps": 500
}

Update minimum QPS on an existing endpoint:

PATCH /api/2.0/vector-search/endpoints/<ENDPOINT_NAME>
{
  "min_qps": 500
}

Check scaling status:

GET /api/2.0/vector-search/endpoints/<ENDPOINT_NAME>

The response scaling_info field shows the requested minimum QPS and scaling state. The state is SCALING_CHANGE_IN_PROGRESS until the next index sync completes, then transitions to SCALING_CHANGE_APPLIED.

Reset to default (remove high QPS):

PATCH /api/2.0/vector-search/endpoints/<ENDPOINT_NAME>
{
  "min_qps": -1
}

How scaling applies

After you set a minimum QPS, the required capacity is provisioned the next time an index on that endpoint is created or synced. To apply the change immediately, trigger a sync on each index hosted on the endpoint.

Note

Attempting to update minimum QPS while a scaling operation is in progress returns a RESOURCE_CONFLICT error. Wait for the current operation to complete before retrying.

Limitations

No autoscaling: You must set minimum QPS manually based on expected traffic. If traffic exceeds the provisioned level, 429 errors occur. See Plan for query spikes.
Standard endpoints only: Storage-optimized endpoints do not support min_qps.

Athugasemdir

Var þessi síða gagnleg?

Last updated on 2026-02-20