Serve multiple models to a Model Serving endpoint
This article describes how to serve multiple models to a CPU serving endpoint that utilizes Azure Databricks Model Serving.
For serving multiple generative AI models, like those provided by external models, see Serve multiple external models to an endpoint.
Requirements
See Requirements for model serving endpoint creation.
To understand access control options for model serving endpoints and best practice guidance for endpoint management, see Serving endpoint ACLs.
Create an endpoint and set the initial traffic split
You can create model serving endpoints with the Databricks Machine Learning serving API or the Databricks Machine Learning UI. An endpoint can serve any registered Python MLflow model registered in the Model Registry.
The following API example creates a single endpoint with two models and sets the endpoint traffic split between those models. The served model, current
, hosts version 1 of model-A
and gets 90% of the endpoint traffic, while the other served model, challenger
, hosts version 1 of model-B
and gets 10% of the endpoint traffic.
POST /api/2.0/serving-endpoints
{
"name":"multi-model"
"config":{
"served_entities":[
{
"name":"current",
"entity_name":"model-A",
"entity_version":"1",
"workload_size":"Small",
"scale_to_zero_enabled":true
},
{
"name":"challenger",
"entity_name":"model-B",
"entity_version":"1",
"workload_size":"Small",
"scale_to_zero_enabled":true
}
],
"traffic_config":{
"routes":[
{
"served_model_name":"current",
"traffic_percentage":"90"
},
{
"served_model_name":"challenger",
"traffic_percentage":"10"
}
]
}
}
}
Update the traffic split between served models
You can also update the traffic split between served models. The following API example sets the served model, current
, to get 50% of the endpoint traffic and the other model, challenger
, to get the remaining 50% of the traffic.
You can also make this update from the Serving tab in the Databricks Machine Learning UI using the Edit configuration button.
PUT /api/2.0/serving-endpoints/{name}/config
{
"served_entities":[
{
"name":"current",
"entity_name":"model-A",
"entity_version":"1",
"workload_size":"Small",
"scale_to_zero_enabled":true
},
{
"name":"challenger",
"entity_name":"model-B",
"entity_version":"1",
"workload_size":"Small",
"scale_to_zero_enabled":true
}
],
"traffic_config":{
"routes":[
{
"served_model_name":"current",
"traffic_percentage":"50"
},
{
"served_model_name":"challenger",
"traffic_percentage":"50"
}
]
}
}
Query individual models behind an endpoint
In some scenarios, you may want to query individual models behind the endpoint.
You can do so by using:
POST /serving-endpoints/{endpoint-name}/served-models/{served-model-name}/invocations
Here the specific served model is queried. The request format is the same as querying the endpoint. While querying the individual served model, the traffic settings are ignored.
In the context of the multi-model
endpoint example, if all requests are sent to /serving-endpoints/multi-model/served-models/challenger/invocations
, then all requests are served by the challenger
served model.