Deploy an agent for generative AI applications

2025-06-05

This article shows how to deploy your AI agent to Mosaic AI Model Serving using the deploy() function from the Agent Framework Python API.

Deploying agents on Mosaic AI Model Serving provides the following benefits:

Model Serving manages autoscaling, logging, version control, and access control allowing you to focus on developing quality agents.
Subject matter experts can use the Review App to interact with the deployed agent and provide feedback that you can incorporate into your monitoring and evaluations.
You can monitor the agent by running evaluation on live traffic. Although user traffic won't include the ground truth, LLM judges (and the custom metric you created) perform an unsupervised evaluation.

Requirements

MLflow 2.13.1 or above to deploy agents using the deploy() API from databricks.agents.
Register an AI agent to Unity Catalog. See Register the agent to Unity Catalog.
Deploying agents from outside a Databricks notebook requires databricks-agents SDK version 0.12.0 or above.
The endpoint creator (the user deploying the agent) must have CREATE VOLUME permissions on the Unity Catalog schema selected to store the inference tables at deployment time. This ensures that relevant assessment and logging tables can be created in the schema. See Enable and disable inference tables.

Install the databricks-agents SDK.

%pip install databricks-agents
dbutils.library.restartPython()

Deploy an agent using `deploy()`

Use deploy() to deploy your agent to a model serving endpoint.

from databricks import agents

deployment = agents.deploy(uc_model_name, uc_model_info.version)

# Retrieve the query endpoint URL for making API requests
deployment.query_endpoint

The deploy() function performs the following actions by default:

`deploy()` action	Description
Create CPU model serving endpoints	Makes your agent accessible to user-facing applications by serving it through a model serving endpoint.
Provision short-lived service principal credentials	Databricks automatically provides short-lived credentials with minimum permissions to access the Databricks-managed resources defined when logging your model. Databricks checks that the endpoint owner has the necessary permissions before issuing the credentials to prevent privilege escalation and unauthorized access. See Authentication for dependent resources. If your agent depends on non-Databricks-managed resource, you can pass environment variables with secrets to `deploy()`. See Configure access to resources from model serving endpoints.
Enable Review App	Allows stakeholders to interact with the agent and provide feedback. See Use the review app for human reviews of a gen AI app (MLflow 2).
Enable inference tables	Monitor and debug agents by logging request inputs and responses. For `ChatAgent` and `ChatModel` agents, inference tables are enabled with AI Gateway. For other deprecated agent schemas, standard inference tables are used. For streaming response logs, only fields and traces compatible with `ChatAgent` and `ChatCompletion` are aggregated.
Log REST API requests and Review App feedback	Logs API requests and feedback to an inference table. Create feedback model to accept and log feedback from the Review App. This model is served in the same CPU model serving endpoint as your deployed agent.
Enable Lakehouse Monitoring for Gen AI (beta)	Requires enrollment in the Lakehouse Monitoring for Gen AI beta. Basic monitoring is automatically enabled for deployed agent traces.
Enable real-time tracing and monitoring with MLflow 3 (beta)	Requires enrollment in the Lakehouse Monitoring for Gen AI beta and use of MLflow 3.0 or above. In addition to logging traces from deployed agents to inference tables for longer-term storage, Databricks logs traces from your deployed agent to an MLflow experiment for real-time visibility. This reduces monitoring and debugging latencies. When you create a new endpoint via `agents.deploy()`, monitoring and tracing are configured to read and write from the currently active MLflow experiment. Configure the experiment for a particular endpoint by calling `mlflow.set_experiment()` before invoking `agents.deploy()` to create the endpoint. Traces from all served agents in the endpoint (including agents added to the endpoint via subsequent calls to `agents.deploy()`) are written to this experiment. Monitoring computes quality metrics on traces in this experiment. By default, only basic monitoring metrics are configured. To add LLM judges and more, see Set up monitoring.

Note

Deployments can take up to 15 minutes to complete. Raw JSON payloads take 10 - 30 minutes to arrive, and the formatted logs are processed from the raw payloads about every hour.

Customize deployment

To customize the deployment, you can pass additional arguments to deploy(). For example, you can enable scale to zero for idle endpoints by passing scale_to_zero_enabled=True. This reduces costs but increases the time to serve initial queries.

For more parameters, see Databricks Agents Python API.

Retrieve and delete agent deployments

Retrieve or manage existing agent deployments:

from databricks.agents import list_deployments, get_deployments, delete_deployment

# Print all current deployments
deployments = list_deployments()
print(deployments)

# Get the deployment for a specific agent model name and version
agent_model_name = ""  # Set to your Unity Catalog model name
agent_model_version = 1  # Set to your agent model version
deployment = get_deployments(model_name=agent_model_name, model_version=agent_model_version)

# Delete an agent deployment
delete_deployment(model_name=agent_model_name, model_version=agent_model_version)

Authentication for dependent resources

AI agents often need to authenticate to other resources to complete tasks. For example, an agent may need to access a Vector Search index to query unstructured data.

Your agent can use one of the following methods to authenticate to dependent resources when you serve it behind a Model Serving endpoint:

Automatic authentication passthrough: Declare Databricks resource dependencies for your agent during logging. Databricks can automatically provision, rotate, and manage short-lived credentials when your agent is deployed to securely access resources. Databricks recommends using automatic authentication passthrough where possible.
On-behalf-of-user authentication: Allows using agent end user credentials to access Databricks REST APIs and resources
Manual authentication: Manually specify long-lived credentials during agent deployment. Use manual authentication for Databricks resources that do not support automatic authentication passthrough, or for external API access.

Automatic authentication passthrough

Model Serving supports automatic authentication passthrough for the most common types of Databricks resources used by agents.

To enable automatic authentication passthrough, you must specify dependencies during agent logging.

Then, when you serve the agent behind an endpoint, Databricks performs the following steps:

Permission verification: Databricks verifies that the endpoint creator can access all dependencies specified during agent logging.
Service principal creation and grants: A service principal is created for the agent model version and is automatically granted read access to agent resources.

Note

The system-generated service principal does not appear in API or UI listings. If the agent model version is removed from the endpoint, the service principal is also deleted.
Credential provisioning and rotation: Short-lived credentials (an M2M OAuth token) for the service principal are injected into the endpoint, allowing agent code to access Databricks resources. Databricks also rotates the credentials, ensuring that your agent has continued, secure access to dependent resources.

This authentication behavior is similar to the “Run as owner” behavior for Databricks dashboards - downstream resources like Unity Catalog tables are accessed using the credentials of a service principal with least-privilege access to dependent resources.

The following table lists the Databricks resources that support automatic authentication passthrough and the permissions the endpoint creator must have when deploying the agent.

Note

Unity Catalog resources also require USE SCHEMA on the parent schema and USE CATALOG on the parent catalog.

Resource type	Permission
SQL Warehouse	Use Endpoint
Model Serving endpoint	Can Query
Unity Catalog Function	EXECUTE
Genie space	Can Run
Vector Search index	Can Use
Unity Catalog Table	SELECT

On-behalf-of-user authentication

On-behalf-of-user authentication allows agent developers to access sensitive Databricks resources using the agent's end user credentials. To enable on-behalf-of-user access to resources, there are two steps:

In agent code, ensure a databricks resource is being accessed with a client that has on-behalf-of-user authentication enabled. See Deploy an agent using on-behalf-of-user authentication for more information.
At agent logging time, specify the end user REST API scopes (for example, vectorsearch.vector-search-endpoints) required by your agent. When your agent is subsequently deployed, it can access Databricks resources on behalf of the end user, but only using the specified scopes. For more information on API scopes, see On-behalf-of-user authentication.

Manual authentication

You can also manually provide credentials using secrets-based environment variables. Manual authentication can be helpful in the following scenarios:

The dependent resource does not support automatic authentication passthrough.
The agent is accessing an external resource or API.
The agent needs to use credentials other than those of the agent deployer.

For example, to use the Databricks SDK in your agent to access other dependent resources, you can set the environment variables described in Databricks client unified authentication.

Monitor deployed agents

After an agent is deployed to Databricks Model Serving, you can use AI Gateway inference tables to monitor the deployed agent. The inference tables contain detailed logs of requests, responses, agent traces, and agent feedback from the review app. This information lets you debug issues, monitor performance, and create a golden dataset for offline evaluation.

Important

If MLflow 3 is installed in your development environment when you call agents.deploy(), your endpoint will log MLflow Traces in real-time to the MLflow Experiment activated at the time of calling agents.deploy(). You can call mlflow.set_experiment() to change the active experiment before deployment.

See the MLflow docs for more details.

See Debug & Observe Your App with Tracing.

Get deployed applications

The following shows how to get your deployed agents.

from databricks.agents import list_deployments, get_deployments

# Get the deployment for specific model_fqn and version
deployment = get_deployments(model_name=model_fqn, model_version=model_version.version)

deployments = list_deployments()
# Print all the current deployments
deployments

See Databricks Agents Python API.

Provide feedback on a deployed agent (experimental)

When you deploy your agent with agents.deploy(), agent framework also creates and deploys a “feedback” model version within the same endpoint, which you can query to provide feedback on your agent application. Feedback entries appear as request rows within the inference table associated with your agent serving endpoint.

Note that this behavior is experimental: Databricks may provide a first-class API for providing feedback on a deployed agent in the future, and future functionality may require migrating to this API.

Limitations of this API include:

The feedback API lacks input validation - it always responds successfully, even if passed invalid input.
The feedback API requires passing in the Databricks-generated request_id of the agent endpoint request on which you wish to provide feedback. To get the databricks_request_id, include {"databricks_options": {"return_trace": True}} in your original request to the agent serving endpoint. The agent endpoint response will then include the databricks_request_id associated with the request so that you can pass that request ID back to the feedback API when providing feedback on the agent response.
Feedback is collected using inference tables. See inference table limitations.

The following example request provides feedback on the agent endpoint named “your-agent-endpoint-name”, and assumes that the DATABRICKS_TOKEN environment variable is set to a Databricks REST API token.

curl \
  -u token:$DATABRICKS_TOKEN \
  -X POST \
  -H "Content-Type: application/json" \
  -d '
      {
          "dataframe_records": [
              {
                  "source": {
                      "id": "user@company.com",
                      "type": "human"
                  },
                  "request_id": "573d4a61-4adb-41bd-96db-0ec8cebc3744",
                  "text_assessments": [
                      {
                          "ratings": {
                              "answer_correct": {
                                  "value": "positive"
                              },
                              "accurate": {
                                  "value": "positive"
                              }
                          },
                          "free_text_comment": "The answer used the provided context to talk about Lakeflow Declarative Pipelines"
                      }
                  ],
                  "retrieval_assessments": [
                      {
                          "ratings": {
                              "groundedness": {
                                  "value": "positive"
                              }
                          }
                      }
                  ]
              }
          ]
      }' \
https://<workspace-host>.databricks.com/serving-endpoints/<your-agent-endpoint-name>/served-models/feedback/invocations

You can pass additional or different key-value pairs in the text_assessments.ratings and retrieval_assessments.ratings fields to provide different types of feedback. In the example, the feedback payload indicates that the agent's response to the request with ID 573d4a61-4adb-41bd-96db-0ec8cebc3744 was correct, accurate, and grounded in context fetched by a retriever tool.

Share via