Share via


MLflow Prompt Optimization (beta)

Important

This feature is currently in Beta.

MLflow offers the mlflow.genai.optimize_prompts() API, which enables you to automatically improve your prompts using evaluation metrics and training data. This feature allows you to enhance prompt effectiveness across any agent framework by applying prompt optimization algorithms, reducing manual effort and ensuring consistent quality.

MLflow supports the GEPA optimization algorithm through the GepaPromptOptimizer researched and validated by the Mosaic Research Team. GEPA iteratively refines prompts using LLM-driven reflection and automated feedback, leading to systematic and data-driven improvements.

Key benefits

  • Automatic Improvement: Optimizes prompts based on evaluation metrics without manual tuning.
  • Data-Driven Optimization: Uses your training data and custom scorers to guide optimization.
  • Framework Agnostic: Works with any agent framework, providing broad compatibility.
  • Joint Optimization: Enable the simultaneous refinement of multiple prompts for best overall performance.
  • Flexible Evaluation: Provides support for custom scorers and aggregation function.
  • Version Control: Automatically registers optimized prompts in MLflow Prompt Registry.
  • Extensible: Plug in custom optimization algorithms by extending the base class.

Important

The optimize_prompts API requires MLflow >= 3.5.0.

Prompt optimization example

See Optimize prompts tutorial for a simple example of prompt optimization.

The API produces an improved prompt that performs better on your evaluation criteria.

Example: Simple Prompt → Optimized Prompt

Before Optimization:

Answer this question: {{question}}

After Optimization:

Answer this question: {{question}}.
Focus on providing precise,
factual information without additional commentary or explanations.

1. **Identify the Subject**: Clearly determine the specific subject
of the question (e.g., geography, history)
and provide a concise answer.

2. **Clarity and Precision**: Your response should be a single,
clear statement that directly addresses the question.
Do not add extra details, context, or alternatives.

3. **Expected Format**: The expected output should be the exact answer
with minimal words where appropriate.
For instance, when asked about capitals, the answer should
simply state the name of the capital city,
e.g., "Tokyo" for Japan, "Rome" for Italy, and "Paris" for France.

4. **Handling Variations**: If the question contains multiple
parts or variations, focus on the primary query
 and answer it directly. Avoid over-complication.

5. **Niche Knowledge**: Ensure that the responses are based on
commonly accepted geographic and historical facts,
as this type of information is crucial for accuracy in your answers.

Adhere strictly to these guidelines to maintain consistency
and quality in your responses.

For a complete explanation, see the MLflow documentation

Advanced usage

See the following guides for advanced use cases,

Common use cases

The following sections provide example code for common use cases.

Improving accuracy

Optimize prompts to produce more accurate outputs:

from mlflow.genai.scorers import Correctness


result = mlflow.genai.optimize_prompts(
    predict_fn=predict_fn,
    train_data=dataset,
    prompt_uris=[prompt.uri],
    optimizer=GepaPromptOptimizer(reflection_model="databricks:/databricks-gpt-5"),
    scorers=[Correctness(model="databricks:/databricks-claude-sonnet-4-5")],
)

Optimize for safeness

Ensure outputs are safe:

from mlflow.genai.scorers import Safety


result = mlflow.genai.optimize_prompts(
    predict_fn=predict_fn,
    train_data=dataset,
    prompt_uris=[prompt.uri],
    optimizer=GepaPromptOptimizer(reflection_model="databricks:/databricks-claude-sonnet-4-5"),
    scorers=[Safety(model="databricks:/databricks-claude-sonnet-4-5")],
)

Troubleshooting

The following sections provide troubleshooting guidance for common errors.

Issue: Optimization takes too long

Solution: Reduce dataset size or reduce the optimizer budget:

# Use fewer examples
small_dataset = dataset[:20]

# Use faster model for optimization
optimizer = GepaPromptOptimizer(
    reflection_model="databricks:/databricks-gpt-5-mini", max_metric_calls=100
)

Issue: No improvement observed

Solution: Check your evaluation metrics and increase dataset diversity as follows:

  • Ensure scorers accurately measure what you care about.
  • Increase training data size and diversity.
  • Try to modify optimizer configurations.
  • Verify that the form of outputs matches expectations.

Issue: Prompts not being used

Solution: Ensure predict_fn calls mlflow.entities.model_registry.PromptVersion.format:

# ✅ Correct - loads from registry
def predict_fn(question: str):
    prompt = mlflow.genai.load_prompt(f"prompts:/{prompt_location}@latest)
    return llm_call(prompt.format(question=question))


# ❌ Incorrect - hardcoded prompt
def predict_fn(question: str):
    return llm_call(f"Answer: {question}")

Next steps

To learn more about the API, see Optimize Prompts (Beta).

To learn more about tracing and evaluation for GenAI applications, see the following articles: