Share via


Guide: Agents development workflow

This guide provides a starting point to understand the full lifecycle of building an AI application or AI agent. Throughout this guide, "AI agent" is an umbrella term for GenAI-powered systems, including simple LLM calls, AI functions, and agent-based implementations.

Overview of the development lifecycle

  1. Understand use case, scope, and success metrics
  2. Build an initial AI agent
  3. Iterate on AI agent quality
  4. Align with stakeholders before production
  5. Release to production and continuously monitor quality

1. Understand use case, scope, and success metrics

Before building anything, clarify what the AI agent is meant to do. Align with stakeholders, including the people who will sign off on deploying to production.

  • What types of inputs will the agent handle (the "domain" or "scope")? What users will submit the inputs?
  • How should the agent ideally respond to common inputs? What information or context should it use?
  • What criteria define a good or bad response: tone, accuracy, completeness, response length, safety, citations, or other requirements?
  • What system requirements and constraints are there in production: cost, latency, and scalability?
  • What are potential failure modes, and how should the agent handle them: bad user inputs, insufficient information to answer, user feedback indicating a bad answer, or others?

Choose the simplest viable approach. Many use cases do not require complex agentic or multi-agent systems. Before building, assess where your problem falls on the complexity continuum. Will simple deterministic logic or batch AI functions suffice? If dynamic tool-calling, reasoning, or coordination are needed, then consider tool-calling agents or multi-agent systems. For deeper guidance, see the Agent system design patterns.

This foundation enables you to:

  1. Identify the data sources and tools your agent will need
  2. Write initial instructions or prompts that reflect the intended behavior
  3. Identify domain experts or testers who can provide representative examples and early feedback
  4. Create automated judges that encode the assessment criteria and accelerate iteration

You do not need perfect clarity at this stage, and your understanding will improve as you iterate. But stronger early alignment, especially on how quality will be measured and what "production ready" means, makes later quality improvements and sign-off significantly faster.

2. Build an initial AI agent

After your use case and goals are well-defined, you are ready to prototype your AI agent. Databricks provides both guided, UI-based routes and fully custom, code-based routes for building AI agents.

2.1. Prepare data and tools

AI agents generally use data and tools to provide context and abilities. See AI agent tools for an overview of working with data and tools on Databricks.

Search for existing data and tools before creating new ones:

  • Explore available data in Unity Catalog or workspace search to understand what governed assets already exist. This helps you understand what context and capabilities are available before creating new assets.
  • In AI Playground, you can view and select tools that are already available to agents, such as Vector Search indexes, MCP servers, or UC Functions.

Create and manage new assets as needed:

All of these data assets and tools are governed and versioned in Unity Catalog, making them discoverable and reusable across AI agents and applications.

2.2. Build an initial agent

Before building a custom agent, assess whether a declarative Agent Bricks offering or an existing Databricks solutions accelerator already matches your use case. For common patterns, these guided approaches can significantly reduce setup, improve default quality, and speed time to production.

If a custom agent is still required, new builders should start with the fastest way to experiment. Use AI Playground to prototype an agent without writing code. AI Playground allows you to try different models, do prompt engineering, and test tools in order to understand data quality, agent behavior, and the potential of your approach quickly. You can then export the agent as code for further customization and iteration.

If you already have agent code, you can bring existing code into Databricks and deploy it as a Databricks App.

As you build your agent, plan ahead for evaluation and production:

  • Instrument your agent with MLflow Tracing to record and analyze agent behavior.
    • At this stage, focus on functional correctness: make sure the agent runs end to end and can access the required data and tools.
    • Vibe check for early issues such as wrong tool selection, missing context, or hallucinations.
    • Later, these traces will be used for evaluating agent quality.
  • During implementation, consider the right authentication method for your production application.

3. Iterate on AI agent quality

After a working prototype exists, the next phase is a tight loop of measuring, understanding, and improving quality. Databricks places MLflow Evaluation at the center of this loop, supported by MLflow Tracing, evaluation datasets, and LLM judges.

Automated scorers and LLM judges provide scale and consistency, but human feedback is critical for validating real world usefulness and understanding subtle failures. Human feedback also guides development and calibration of LLM judges. Human feedback typically enters in three stages as the agent matures:

  1. Early developer and stakeholder validation
  2. Broader domain experts review
  3. End user feedback

3.1. Validate early behavior

Developers and a small group of stakeholders or domain experts can provide quick, early feedback. Before scaling testing and evaluation, confirm the agent does the right things in the most obvious situations.

During prototyping, developers often perform an informal "vibe check" by manually querying the agent to confirm it runs end to end and behaves as expected. With the MLflow Tracing UI, developers can attach feedback or expectations directly to traces to flag quality issues, mark successful examples, and capture notes for future evaluation and iteration.

After you deploy an internal prototype, the Review App Chat UI provides a simple UI for collecting feedback. Share the Chat UI for your prototype with a small set of developers or domain experts who can ask both reasonable and problematic queries.

MLflow Tracing records the interactions and feedback to build an initial dataset of results. Analyze traces with the MLflow UI or code to understand the agent's performance and behavior. If results are bad or unexpected, use the traces to debug:

  • Analyze quality issues in the agent, such as tool misuse, hallucinations, or missing context. Apply fixes, such as prompt tuning, tool usage, or data. See 3.4. Fix issues and re-verify improvements.
  • As you iterate, you can use the trace dataset as representative user inputs to generate traces for your new prototype.
  • Repeat this loop: run, inspect, fix, and re-run, until the agent handles all or most of the representative inputs as expected.
  • More issues may be uncovered and addressed in later iterations. Quality improvement is iterative and not limited to this early phase.

After this step, you can feel confident the prototype behaves sensibly in common cases and achieves a reasonable level of quality, before investing in more extensive testing.

3.2. Expand testing and feedback

After the prototype works in simple cases, scale up quality evaluation by broadening your set of beta testers and by collecting more customized feedback. This phase reveals blind spots such as unexpected topics, misunderstood queries, tools and retrieval gaps, or emerging usage patterns. It also expands your evaluation datasets.

  • Roll out the application to a broader set of stakeholders and domain experts, or to beta end users. Incorporate their feedback as the agent is exposed to broader usage patterns.
  • Capture more detailed feedback and expectations using Review App labeling sessions with custom schema for expert feedback.
  • Build evaluation datasets by syncing human feedback and labeled traces, preparing for systematic evaluation and monitoring in the next step.
  • To further enrich the evaluation dataset, consider generating synthetic evaluation sets.

3.3. Evaluate quality and debug systematically

As your evaluation datasets become larger and more diverse, you will need structured and more automated ways to detect issues, surface the most important failures, and understand root causes.

In practice, you will likely divide your data into two types of evaluation datasets:

  • Regression testing: Data with high-quality AI responses helps to define expected behavior. Use these datasets to validate that new versions of the agent continue to perform well across a broad and diverse set of expected scenarios.
  • Issue-focused debugging: Data with low-quality AI responses may include a variety of unwanted behaviors. Isolate groups of traces that exhibit the same types of low-quality behavior so you can understand the root causes and iterate on targeted fixes.

The tools below help to build and analyze both types of evaluation datasets.

Run regression tests

  • Build regression tests by selecting representative subsets of data for which you have high-quality AI responses or human expectations.
  • Define evaluation criteria using built-in or custom LLM judges and scorers. Automated evaluation can use LLMs alone to assess response quality, or they can compare responses against ground-truth responses or expectations.
  • Run evaluation on new versions of your agent to ensure updates do not degrade previously good behavior.

Identify types of low-quality responses

Improve the accuracy of automated detection

Though you can begin to build evaluation datasets using mostly human feedback, you can scale evaluation with automated detection. As you iterate, invest in LLM judges or code-based scorers tailored to your application and domain.

  • Start with built-in judges, and add custom judges and code-based scorers as needed. When you observe a failure mode not captured by a built-in judge, you can automate future detection with a custom judge or scorer designed to detect that specific type of failure.
  • Use human feedback to align custom judges with expert understanding. Tuning judges to reduce false positives and negatives will increase trust in automated evaluation and triage.
  • Your new judges and scorers can be used both for automated evaluation and monitoring and for filtering traces to build datasets for debugging.

Root cause issues effectively

After a failure is identified, you need to determine why it occurred.

  • Use MLflow Tracing to inspect each step of the agent's reasoning manually:
    • Which tools were selected
    • How tool inputs and outputs were used
    • Whether retrieval returned relevant context
    • How model responses influenced downstream decisions
  • Apply MLflow AI Insights or agent-as-a-judge to analyze traces and point to likely causes such as poor grounding, bad prompt structure, or incorrect tool arguments.
  • Compare versions in MLflow's evaluation UI to see whether issues regress or persist across iterations.

The ideal outcome of this step is to have a structured understanding of what is failing, why it fails, and how to fix it. Automation and application-specific judges allow you to iterate confidently as your agent grows more capable and the test set grows more complex.

3.4. Fix issues and re-verify improvements

Just as issues are application-specific, fixes must be tailored to your application. Examples of common fixes include:

  • Prompt optimization: Refine the agent's instructions manually, or use data-driven prompt optimization. For broader agent optimization such as tuning multi-step reasoning or tool use, use DSPy tuning.
  • Tools and data: Improve tools or retrieval flows when traces show missing facts or poor grounding.
  • Routing: When traces show the wrong tools or sub-agents were called, improve tool or agent metadata, prompts, or the routing model.
  • Guardrails: When responses violate safety rules or leak information, use either AI Gateway guardrails or customized guardrails in your agent.
  • Fallbacks: Handle extreme cases, missing data, or API call failures gracefully using fallback mechanisms such as alternate API endpoints or fallback responses.

As you iterate on fixes, use app versioning and the Prompt Registry to record versions for simpler comparisons and regression testing.

Each fix to prompts, retrieval, tools, data, or other parts of your agent should be validated the same way it was discovered. Re-run the new agent version on the same evaluation datasets to confirm that the issue is fixed and no regressions have been introduced.

4. Align with stakeholders before production

Before releasing an agent into a real environment, teams need a shared understanding of its current capabilities, limitations, and measured quality. Getting to this point typically requires multiple rounds of iteration and quality improvement in Step 3. At this stage, translate the technical signals (such as evaluation metrics, system metrics, and example traces) into the business context that ultimately determines whether the agent is truly "ready."

  • Translate evaluation results into clear business signals: Summarize accuracy, stability, safety, and known limitations in language stakeholders can act on.
  • Confirm standardized quality checks are met: Make sure required evaluation metrics, regression checks, and dataset coverage thresholds pass for the candidate version.
  • Validate operational readiness and obtain sign-off: Review the monitoring setup, guardrails, and rollout plan. Document risks and acceptance criteria before production.

5. Release to production and continuously monitor quality

Reaching production is a major milestone! It means the agent is ready for real users and real impact. At the same time, production is also the beginning of a new cycle. After an agent is live, it enters continuous monitoring and improvement because real usage will surface new behaviors, edge cases, and issues.

  • Collect feedback from end users in production. Link user feedback to specific traces so it can be analyzed alongside model behavior. You can do this by logging feedback as assessments attached to the original trace.
  • Leverage AI Gateway for guardrails, routing, and consistent logging. Make sure each new agent version can be evaluated against real traffic without operational friction.
  • Monitor quality on live traffic by running evaluation on sampled production traces. Confirm the new version performs at least as well as prior versions, and look for new issues as users submit new types of queries. Continuous monitoring keeps the agent reliable, safe, and aligned with business needs as it evolves. MLflow provides a monitoring dashboard, but since traces can be stored in Unity Catalog, you can customize dashboards and alerts:
  • Act on production insights:
    • For high-risk use cases, link monitoring to automated or gated rollback mechanisms to fix critical issues.
    • Use your production insights in your next iteration. Convert real-world failures into new evaluation data, and return to the evaluation and debugging loop to build the next, better version of your agent.

Next steps