Edit

What is the agent optimizer? (preview)

Important

Agent Optimizer is currently in limited preview and only available through a sign-up process. To access the service, complete the intake form. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

The agent optimizer in Foundry Agent Service automatically improves your hosted agents by evaluating their behavior and generating better configurations. These configurations primarily include improved system instructions and discovered skills.

Building effective AI agents requires extensive prompt engineering. You deploy an agent with handcrafted instructions, test it against real scenarios, identify weaknesses, revise the prompt, and repeat. This loop is slow, subjective, and doesn't scale. The agent optimizer automates this cycle so you can focus on your agent's core logic.

How the agent optimizer works

The agent optimizer runs a closed-loop evaluation and improvement cycle:

  1. Evaluate the baseline. The optimizer invokes your agent against a dataset of tasks and scores each response against criteria you define or a built-in default set. The baseline is your agent's score before any changes.
  2. Generate candidates. The optimizer produces alternative configurations called candidates—rewritten instructions or discovered skills—designed to improve scores.
  3. Evaluate candidates. The optimizer tests each candidate against the same dataset.
  4. Rank and recommend. The optimizer ranks results by composite score, a value between 0.0 and 1.0 that represents aggregate performance, and marks the best candidate with ★.
  5. Deploy the winner. A single command promotes the winning candidate and saves its configuration to your agent's environment.

The entire process runs in the cloud. Start it with azd ai agent optimize (requires the azd CLI extension). The run takes 5 to 20 minutes depending on dataset size.

Warning

During optimization, the optimizer evaluates your agent by invoking it against every task in your dataset. If your agent calls external tools—such as APIs, databases, or third-party services—those calls execute during each evaluation run. To avoid unintended side effects (charges, state mutations, or rate limiting), consider using test endpoints or mocking tool implementations during optimization.

Tip

For the best results, generate a dataset tailored to your agent with azd ai agent eval init before running optimization. The optimizer auto-detects the generated eval.yaml. For details, see Create an evaluation dataset.

Optimization targets

An optimization target is a specific aspect of your agent's configuration that the optimizer can improve. The agent optimizer automatically determines which targets to activate based on your agent's baseline configuration and the eval.yaml settings.

Instruction tuning

The optimizer rewrites and refines your agent's system prompt. It analyzes baseline performance and generates prompt variations that score higher.

When it activates: Instruction tuning runs when your agent has an instructions.md file in the baseline config directory. This is the most common optimization target and works well for improving response quality, adherence to task requirements, and reducing inaccurate outputs.

Skill improvement

The optimizer improves reusable skills your agent uses. It refines existing skill bodies (the implementation content in each SKILL.md file) while keeping skill descriptions unchanged. The agent loads these skills through load_config() and appends them to the instruction set.

When it activates: Skill improvement runs when your agent has a skills/ directory in the baseline config. Use skills for agents that need structured, repeatable behaviors. For example, a support agent that follows a specific escalation procedure or a travel agent that checks budget policies.

Tool optimization

The optimizer improves tool descriptions and parameter descriptions to help the model call tools more accurately. It does not change parameter types, defaults, or required fields—only the natural-language descriptions are refined.

When it activates: Tool optimization runs when your agent has a tools.json file in the baseline config. The optimizer analyzes which tool calls succeed or fail and generates clearer descriptions and parameter descriptions.

Model selection

The optimizer evaluates your agent across multiple model deployments in a single run to find the best quality-to-cost trade-off. For example, it can determine whether gpt-4.1-mini handles your workload at lower cost or whether gpt-4.1 provides a quality improvement that justifies the extra token cost.

When it activates: Model selection runs when you include optimization_config.model in your eval.yaml with a list of model deployments to evaluate. The optimizer scores each model option against the same dataset and shows the trade-offs.

Note

If the model list includes your agent's current model deployment, it is automatically removed from the candidates (the baseline already represents that model). If no models remain after this removal, you receive a validation error.

Configure model candidates in your eval.yaml:

# eval.yaml
options:
  optimization_config:
    model:
      - gpt-4.1
      - gpt-4.1-mini
      - gpt-4o

You can combine model selection with instruction and skill optimization in the same run. The optimizer automatically determines which targets to improve based on your baseline configuration and the optimization_config settings.

Config resolution

When your agent starts, the load_config() function checks three sources in order:

Priority Source Environment variables When it's used
1 Inline JSON OPTIMIZATION_CONFIG After deploying directly through the API
2 Local directory OPTIMIZATION_LOCAL_DIR (defaults to .agent_configs/) After azd ai agent optimize apply writes config locally
3 No config Raises ValueError (or returns None if required=False)

Your agent always works with or without optimization. You don't need feature flags or conditional logic. Call load_config() and use the values it returns. For implementation details, see Make your agent optimizer-ready.

What gets optimized

Field Description Target
instructions System prompt and instructions instruction, skill
skills Discovered skill catalog skill
model Model deployment name model
tools Tool definitions (descriptions, parameters) tool

Models

The agent optimizer uses two models during an optimization run. Both must be deployed in your Foundry project.

Model Config key CLI flag Role Supported models
Eval model eval_model --eval-model Scores agent responses against criteria in the dataset Any chat-completion model (for example, gpt-4.1-mini)
Optimization model optimization_model --optimize-model Generates candidate configurations (instructions, skills, tools, model selection) gpt-5, gpt-5.1, gpt-5.2, gpt-5.4, gpt-5.5, DeepSeek-V4-Pro, DeepSeek-V-3.2

The eval model runs once per task per candidate. It reads the agent's response and each criterion, then returns a binary score. The optimization model analyzes baseline results and generates improved candidates across the configured targets (instructions, skills, tools, and models). Because it reasons over the full dataset, a more capable optimization model typically produces better candidates.

# eval.yaml
options:
  eval_model: gpt-4.1-mini
  optimization_model: gpt-5.1

Important

The optimization model must be from the supported list above. If you don't specify optimization_model, the optimizer falls back to the eval model. In that case, the eval model must also be a supported optimization model.

Understand optimization results

This section describes the results table structure, how scores are computed, what score improvements mean, and how to diagnose common issues.

Tip

You can also view optimization results in the Azure AI Foundry portal. Navigate to your project, select Agents, choose your agent, and then select the Optimize tab to see score comparisons, charts, and deployment options.

After an optimization run completes, you see a results table:

Results:
  Candidate              Score    Pass  Eval
  ──────────────────── ─────── ───────  ──────
  baseline                0.76     83%  View
  candidate_1             0.78     73%  View
  candidate_2             0.79     78%  View
  candidate_3             0.77     71%  View
  candidate_4 ★           0.80     80%  View

  Candidate IDs:
      baseline             cand_abc123...
      candidate_1          cand_def456...
      candidate_2          cand_ghi789...
      candidate_3          cand_jkl012...
    ★ candidate_4          cand_mno345...

  Apply the best candidate locally, then deploy:
    azd ai agent optimize apply --candidate cand_mno345...
    azd deploy

Results table columns

Column Description
Candidate Name of the configuration. baseline is your current agent before optimization.
Score Composite score across all tasks and criteria, ranging from 0.0 to 1.0.
Pass Percentage of evaluator scores that meet the pass threshold.
Eval Link to the evaluation job in the Azure AI Foundry portal.

The ★ marks the candidate with the highest composite score. This is the recommended candidate to deploy.

How scores are computed

Each evaluator in your dataset produces a raw score for the agent's response. The optimizer processes these scores to produce the composite score shown in results:

  • Rescale: Each evaluator's raw score is rescaled to 0–1.
  • Flip if needed: If an evaluator is configured so that lower is better, the score is flipped so that all evaluators use "higher is better" semantics.
  • Average: The rescaled scores across all evaluators and tasks are averaged to produce the composite score.

Composite score: The average of all rescaled evaluator scores across all tasks.

Interpret score improvements

Improvement Interpretation
Less than 0.03 Noise. Not a meaningful improvement.
0.03 to 0.10 Moderate improvement. Worth deploying.
0.10 to 0.20 Significant improvement.
Greater than 0.20 Major improvement. Likely from a poor baseline.

Token trade-offs

Optimized instructions are often longer and more detailed, which can increase response token usage. Consider these factors:

  • Whether the token increase is proportional to the score improvement
  • Whether the cost increase fits your budget
  • Whether responses are unnecessarily verbose or adding value with the extra length

Pass rate

Pass rate is computed from each evaluator's pass threshold. For each evaluator score:

  • If the evaluator's raw score is less than its configured threshold, the result is a fail.
  • If the evaluator's raw score is equal to or greater than the threshold, the result is a pass.
  • For evaluators where lower is better, the logic is reversed (score above threshold is a fail).

The pass rate percentage shown in results is the proportion of evaluator scores that passed across all tasks.

All scores are zero

If all candidates (including baseline) score 0.00, the likely cause is a missing eval model. The eval model scores agent responses against criteria and must be deployed in your Foundry project.

azd ai agent optimize --eval-model gpt-4.1-mini

Important

If the eval model isn't deployed, all scores are zero with no error message. Always verify that your eval model exists in the project.

Limitations and availability