Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Important
This feature is in Beta. Account admins can control access to this feature from the account console Previews page. See Manage Azure Databricks previews.
This page describes how to configure rate limits for AI Gateway (Beta) endpoints. Rate limits allow you to enforce consumption limits on an endpoint to manage capacity and costs.
Requirements
- AI Gateway (Beta) preview enabled for your account. See Manage Azure Databricks previews.
- A Azure Databricks workspace in an AI Gateway (Beta) supported region.
Configure rate limits on an endpoint
You can manage and specify the number of queries per minute (QPM) or tokens per minute (TPM) that your endpoint can support.
To enable rate limits, select Rate limits when configuring your AI Gateway endpoint. You can define query-based and token-based rate limits at the following levels:
| Field | Description |
|---|---|
| Endpoint | Specify the maximum QPM or TPM that the entire endpoint can handle. This limit applies to all traffic, regardless of the user. |
| User (Default) | Specify a default per-user rate limit that applies to all users of the endpoint, unless a more specific, custom rate limit is defined. |
| Custom rate limits | Custom rate limits can be specified for:
|
Details and behavior
- Rate limits apply only to users with permission to query the endpoint.
- By default, there are no rate limits configured for users or the endpoint.
- The endpoint rate limit is a global maximum. If this limit is exceeded, all requests to the endpoint are blocked, regardless of any user-specific or group-specific rate limits.
- If an endpoint, user, or service principal has both a query-based rate limit and token-based rate limit specified, the more restrictive rate limit is enforced.
- Custom rate limits override the User (Default) rate limit.
- If a user belongs to both a user-specific limit and a group-specific limit, the user-specific limit is enforced.
- If a user belongs to multiple user groups with different QPM or TPM rate limits, the user is rate limited if they exceed all of the QPM rate limits or all of the TPM rate limits of their user groups.
Limitations
- You can specify a maximum of 20 rate limits per endpoint.
- You can specify a maximum of 5 group-specific rate limits per endpoint.