Share via


Debug a deployed AI agent

This page covers how to debug common issues with AI agents deployed on Azure Databricks.

Go to:

Most debugging sections on this page apply to agents deployed to Databricks Apps. However, you can also find debugging information for agents deployed on Model Serving (legacy) using the tab selectors.

Author agents using best practices

Use the following best practices when authoring agents:

  • Enable MLflow tracing: Follow the best practices in Author an AI agent and deploy it on Databricks Apps. Enable MLflow trace autologging to make your agents easier to debug.
  • Document tools clearly: Clear tool and parameter descriptions ensure your agent understands your tools and uses them appropriately. See Improve tool-calling with clear documentation.
  • Add timeouts and token limits to LLM calls: Add timeouts and token limits to the LLM calls in your code to avoid delays caused by long-running steps.
    • If your agent uses the OpenAI client to query a Azure Databricks LLM serving endpoint, set custom timeouts on the serving endpoint calls as needed.
  • Validate configuration before deployment: Run databricks bundle validate before you deploy to catch YAML configuration issues early. This helps identify mismatched resource references, invalid permissions, and syntax errors.
  • Test locally first: Use local development to catch issues before you deploy. Start your agent server locally, test with sample requests, and verify that MLflow traces appear correctly before you deploy to Databricks Apps.

Debug local development issues

Test your agent locally to identify issues before deployment.

Before you run your agent locally, verify that your environment is configured correctly:

  1. Check Databricks CLI version: Run databricks -v to verify that you have version 0.283.0 or later.

  2. Verify CLI profiles: Run databricks auth profiles to see the configured authentication profiles.

  3. Validate environment configuration: Check that your .env file contains the required variables, especially MLFLOW_TRACKING_URI, which must use the format databricks://PROFILE_NAME to include your CLI profile.

Common local development errors

Error Cause Solution
The provided MLFLOW_EXPERIMENT_ID does not exist Wrong tracking URI format or experiment was deleted Verify that MLFLOW_TRACKING_URI uses the databricks://PROFILE_NAME format with your CLI profile name
Module not found Dependencies not installed Run uv sync to install dependencies
Port already in use Another process using the port Use --port flag to specify a different port (e.g., uv run start-app --port 8001)
Authentication errors when running locally The environment is not configured Run the quickstart script or manually configure the .env file with your CLI profile

Test the agent locally

To test your agent before deployment:

  1. Start the agent server locally:

    uv run start-app
    
  2. In another terminal, send a test request:

    curl -X POST http://localhost:8000/invocations \
      -H "Content-Type: application/json" \
      -d '{"input": [{"role": "user", "content": "hello"}]}'
    
  3. View MLflow traces in the Azure Databricks UI to verify your agent is logging traces correctly.

Debug configuration issues

Configuration errors in databricks.yml and app.yaml are common sources of deployment failures.

Validate the Databricks Asset Bundles configuration

Validate the Databricks Asset Bundles configuration before deploying the app:

databricks bundle validate

This command checks your configuration for:

  • YAML syntax errors
  • Missing required fields
  • Not valid resource references
  • Permission configuration issues

Common configuration mismatches

Configuration point Rule How to debug
valueFrom references in app.yaml Must exactly match a resource name in databricks.yml Search for the exact string in both files to verify they match
App name Must start with the agent- prefix (e.g., agent-data-analyst) Check the name field under resources.apps in databricks.yml
Genie space ID Must be the 32-character hex string from the Genie URL Extract from the URL path: https://workspace.cloud.databricks.com/genie/rooms/{SPACE_ID}
Unity Catalog function reference Must use format catalog.schema.function_name Verify the function exists using databricks unity-catalog functions list
Lakebase instance reference Must use value (not valueFrom) in the app.yaml file The instance name is a literal string, not a resource reference

Debug deployment issues

Agents deployed to Apps

App already exists error

App already exists error

If you see Error: failed to create app - An app with the same name already exists, you have two options:

Option 1: Bind to existing app (recommended)

# Get existing app configuration
databricks apps get <app-name> --output json

# Sync the configuration to your databricks.yml, then bind
databricks bundle deployment bind <bundle-name> <app-name> --auto-approve

# Deploy
databricks bundle deploy
databricks bundle run <bundle-name>

Option 2: Delete and recreate

databricks apps delete <app-name>
databricks bundle deploy
databricks bundle run <bundle-name>
App not updating after deployment

App not updating after deploy

databricks bundle deploy only uploads files to the workspace. You must also run databricks bundle run <bundle-name> to restart the app with the new code.

Always deploy using both commands:

databricks bundle deploy && databricks bundle run <bundle-name>
View deployment status and logs

View deployment status and logs

To check your app's deployment status:

databricks apps get <app-name>

To view app logs in real-time:

databricks apps logs <app-name> --follow

Agents on Model Serving (legacy)

If you deployed your agent using agents.deploy() to a Model Serving endpoint, review Debugging guide for Model Serving for deployment-specific issues.

To debug runtime issues such as slow or failing requests, see Debug runtime errors.

Debug runtime errors

Agents deployed to Apps

Use app logs and request testing to identify issues with your deployed agent.

Analyze app logs

View real-time logs from your deployed app:

databricks apps logs <app-name> --follow

Look for:

  • Stack traces indicating code errors
  • Permission denied messages for resources
  • Connection errors to external services
  • Timeout messages

Common runtime errors

Error Cause Solution
302 redirect when querying app Using Personal Access Token instead of OAuth Get an OAuth token with databricks auth token
Agent not using available tools Tools not returned from MCP client Verify the MCP server URL is correct and the resource has proper permissions in databricks.yml
Streaming response breaks mid-response Connection timeout Increase the CHAT_PROXY_TIMEOUT_SECONDS environment variable in app.yaml
Agent returning "Memory not available" Missing user_id in request Pass custom_inputs.user_id in the request payload
Empty or error responses despite 200 status Error occurred within streamed response Check the actual stream content and app logs, not just the HTTP status code

Agents on Model Serving (legacy)

Use inference tables and MLflow traces to identify issues with agents deployed to Model Serving endpoints.

Identify problematic requests

If you enabled MLflow trace autologging while authoring your agent, traces are automatically logged in inference tables. Use these traces to identify agent components that are slow or failing.

  1. In your workspace, go to the Serving tab and select your deployment name.
  2. In the Inference tables section, find the inference table's fully-qualified name. For example, my-catalog.my-schema.my-table.
  3. Run the following in a Databricks notebook:
    %sql
    SELECT * FROM my-catalog.my-schema.my-table
    
  4. Inspect the Response column for detailed trace information.
  5. Filter on request_time, databricks_request_id or status_code to narrow down results.
    %sql
    SELECT * FROM my-catalog.my-schema.my-table
    WHERE status_code != 200
    

Analyze root cause issues

After identifying failing or slow requests, use the mlflow.models.validate_serving_input API to invoke your agent against the failed input request. View the resulting trace and perform root cause analysis on the failed response.

For a faster development loop, update your agent code directly and iterate by invoking your agent against the failed input example.

Debug authentication errors

Agents deployed to Apps

OAuth token authentication required

OAuth token authentication required

You must use a Databricks OAuth token to query agents deployed to Apps. Using a Personal Access Token (PAT) results in a 302 redirect error.

To get an OAuth token:

databricks auth token

Use the token in requests to your deployed app:

TOKEN=$(databricks auth token | jq -r '.access_token')
curl -X POST <app-url>/invocations \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"input": [{"role": "user", "content": "hello"}]}'
Resource permission errors

Resource permission errors

When your agent cannot access workspace resources, verify the resource is properly configured in databricks.yml. Each resource type requires specific permissions:

Error Cause Solution
Permission denied on Genie space Missing genie_space resource Add a genie_space resource with permission: 'CAN_RUN'
Vector search index not accessible Missing uc_securable resource for the index Add a uc_securable resource with securable_type: 'TABLE' and permission: 'SELECT'
Unity Catalog function execution denied Missing uc_securable resource for the function Add a uc_securable resource with securable_type: 'FUNCTION' and permission: 'EXECUTE'
Serving endpoint access denied Missing serving_endpoint resource Add a serving_endpoint resource with permission: 'CAN_QUERY'
SQL warehouse access denied Missing sql_warehouse resource Add a sql_warehouse resource with permission: 'CAN_USE'

Example resource configuration in databricks.yml:

resources:
  apps:
    my_agent:
      name: 'agent-my-app'
      resources:
        - name: 'my_genie_space'
          genie_space:
            space_id: '01234567890abcdef01234567890abcd'
            permission: 'CAN_RUN'
        - name: 'my_vector_index'
          uc_securable:
            securable_full_name: 'catalog.schema.index_name'
            securable_type: 'TABLE'
            permission: 'SELECT'
Custom MCP server permissions

Custom MCP server permissions

If your agent connects to a custom MCP server running as a Databricks app, you must manually grant permissions since apps are not yet supported as resource dependencies in databricks.yml.

# Get your agent app's service principal
AGENT_SP=$(databricks apps get <agent-app-name> --output json | jq -r '.service_principal_name')

# Grant permission on the MCP server app
databricks apps update-permissions <mcp-server-app-name> \
  --json "{\"access_control_list\": [{\"service_principal_name\": \"$AGENT_SP\", \"permission_level\": \"CAN_USE\"}]}"

Agents on Model Serving (legacy)

If your deployed agent encounters authentication errors while accessing resources such as vector search indexes or LLM endpoints, verify that it was logged with the necessary resources for automatic authentication passthrough. See Automatic authentication passthrough.

To inspect the logged resources, run the following in a notebook:

%pip install -U mlflow[databricks]
%restart_python

import mlflow
mlflow.set_registry_uri("databricks-uc")

# Replace with the model name and version of your deployed agent
agent_registered_model_name = ...
agent_model_version = ...

model_uri = f"models:/{agent_registered_model_name}/{agent_model_version}"
agent_info = mlflow.models.Model.load(model_uri)
print(f"Resources logged for agent model {model_uri}:", agent_info.resources)

To re-add missing or incorrect resources, log the agent and deploy it again.

If you use manual authentication for resources, verify that environment variables are correctly set. Manual settings override any automatic authentication configurations. See Manual authentication.

Debug memory and storage issues

For agents using Lakebase for memory storage, the following issues are common:

Error Cause Solution
relation 'store' does not exist Memory tables not initialized Run await store.setup() locally before deploying to create required tables
Unable to resolve :re[LKB] instance Wrong instance name or incorrect configuration Verify LAKEBASE_INSTANCE_NAME uses value (not valueFrom) in app.yaml and matches the instance_name in databricks.yml
permission denied for table store Missing Lakebase permissions Add a database resource in databricks.yml with permission: 'CAN_CONNECT_AND_CREATE'
Memory not persisting across conversations Different user_id per request Ensure you pass a consistent user_id in custom_inputs for each user

Example Lakebase resource configuration:

resources:
  apps:
    my_agent:
      resources:
        - name: 'memory_database'
          database:
            instance_name: '<lakebase-instance-name>'
            database_name: 'postgres'
            permission: 'CAN_CONNECT_AND_CREATE'

Before deploying an agent with memory, initialize the tables locally:

import asyncio
from databricks_langchain import AsyncDatabricksStore

async def setup_memory():
    async with AsyncDatabricksStore(
        instance_name='your-lakebase-instance',
        embedding_endpoint='databricks-gte-large-en',
        embedding_dims=1024,
    ) as store:
        await store.setup()

asyncio.run(setup_memory())