Model serving with Serverless Real-Time Inference
This feature is in Public Preview.
This article describes model serving with Azure Databricks Serverless Real-Time Inference, including its advantages and limits in comparison to Classic MLflow model serving.
Serverless Real-Time Inference exposes your MLflow machine learning models as scalable REST API endpoints. This functionality uses Serverless compute, which means that the endpoints and associated compute resources are managed and run in the Databricks cloud account. See the Serverless Real-Time Inference pricing page for more details.
Classic MLflow model serving uses a single-node cluster that runs under your own account within what is now called the Classic data plane. This data plane includes the virtual network and its associated compute resources such as clusters for notebooks and jobs, pro and classic SQL warehouses, and Classic model serving endpoints.
Why use Serverless Real-Time Inference?
Serverless Real-Time Inference offers:
- Ability to launch an endpoint with one click: Databricks automatically prepares a production-ready environment for your model and offers serverless configuration options for compute.
- High availability and scalability: Serverless Real-Time Inference is intended for production use and can support up to 3000 queries-per-second (QPS). Serverless Real-Time Inference endpoints automatically scale up and down, which means that endpoints automatically adjust based on the volume of scoring requests.
- Dashboards: Use the built-in Serverless Real-Time Inference dashboard to monitor the health of your model endpoints using metrics such as QPS, latency, and error rate.
- Feature store integration: When your model is trained with features from Databricks Feature Store, the model is packaged with feature metadata. If you configure your online store, these features are incorporated in real-time as scoring requests are received.
While this service is in preview, the following limits apply:
- Payload size limit of 16 MB per request.
- Default limit of 200 QPS of scoring requests per workspace enrolled. You can increase this limit to up to 3000 QPS per workspace by reaching out to your Databricks support contact.
- Best effort support on less than 100 millisecond latency overhead and availability.
- It is possible for a workspace to be deployed in a supported region, but be served by a control plane in a different region. These workspaces do not support Serverless Real-Time Inference, resulting in a
Your workspace is not currently supported.message. In this case, please create a new workspace in a supported region, or enable the feature on a different workspace that does not have this issue. Reach out to your Databricks representative for more information.
Serverless Real-Time Inference endpoints are open to the internet for inbound traffic unless an IP allowlist is enabled in the workspace, in which case this list applies to the endpoints as well.
Serverless Real-Time Inference is available in the following Azure regions:
Staging and production time expectations
Transitioning a model from staging to production takes time. Deploying a newly registered model version involves building a model container image and provisioning the model endpoint. This process can take ~5 minutes.
Databricks performs a “zero-downtime” update of
/production endpoints by keeping the existing model deployment up until the new one becomes ready. Doing so ensures no interruption for model endpoints that are in use.
If model computation takes longer than 60 seconds, requests will time out. If you believe your model computation will take longer than 60 seconds, please reach out to your Databricks support contact.
During public preview, you need to reach out to your Databricks support contact to enable Serverless Real-Time Inference on your workspace.
Before you can create Serverless Real-Time Inference endpoints, you must enable them on your workspace. See Enable Serverless Real-Time Inference endpoints for model serving.
After Serverless Real-Time Inference endpoints have been enabled on your workspace, you need the following permissions to create endpoints for model serving:
- Cluster Creation permissions on the workspace.
- Can Manage Production Versions permissions on the registered model to serve it.