Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Important
This feature is currently in Beta.
MLflow offers the mlflow.genai.optimize_prompts() API, which enables you to automatically improve your prompts using evaluation metrics and training data. This feature allows you to enhance prompt effectiveness across any agent framework by applying prompt optimization algorithms, reducing manual effort and ensuring consistent quality.
MLflow supports the GEPA optimization algorithm through the GepaPromptOptimizer researched and validated by the Mosaic Research Team. GEPA iteratively refines prompts using LLM-driven reflection and automated feedback, leading to systematic and data-driven improvements.
Key benefits
- Automatic Improvement: Optimizes prompts based on evaluation metrics without manual tuning.
- Data-Driven Optimization: Uses your training data and custom scorers to guide optimization.
- Framework Agnostic: Works with any agent framework, providing broad compatibility.
- Joint Optimization: Enable the simultaneous refinement of multiple prompts for best overall performance.
- Flexible Evaluation: Provides support for custom scorers and aggregation function.
- Version Control: Automatically registers optimized prompts in MLflow Prompt Registry.
- Extensible: Plug in custom optimization algorithms by extending the base class.
Important
The optimize_prompts API requires MLflow >= 3.5.0.
Prompt optimization example
See Optimize prompts tutorial for a simple example of prompt optimization.
The API produces an improved prompt that performs better on your evaluation criteria.
Example: Simple Prompt → Optimized Prompt
Before Optimization:
Answer this question: {{question}}
After Optimization:
Answer this question: {{question}}.
Focus on providing precise,
factual information without additional commentary or explanations.
1. **Identify the Subject**: Clearly determine the specific subject
of the question (e.g., geography, history)
and provide a concise answer.
2. **Clarity and Precision**: Your response should be a single,
clear statement that directly addresses the question.
Do not add extra details, context, or alternatives.
3. **Expected Format**: The expected output should be the exact answer
with minimal words where appropriate.
For instance, when asked about capitals, the answer should
simply state the name of the capital city,
e.g., "Tokyo" for Japan, "Rome" for Italy, and "Paris" for France.
4. **Handling Variations**: If the question contains multiple
parts or variations, focus on the primary query
and answer it directly. Avoid over-complication.
5. **Niche Knowledge**: Ensure that the responses are based on
commonly accepted geographic and historical facts,
as this type of information is crucial for accuracy in your answers.
Adhere strictly to these guidelines to maintain consistency
and quality in your responses.
For a complete explanation, see the MLflow documentation
Advanced usage
See the following guides for advanced use cases,
Common use cases
The following sections provide example code for common use cases.
Improving accuracy
Optimize prompts to produce more accurate outputs:
from mlflow.genai.scorers import Correctness
result = mlflow.genai.optimize_prompts(
predict_fn=predict_fn,
train_data=dataset,
prompt_uris=[prompt.uri],
optimizer=GepaPromptOptimizer(reflection_model="databricks:/databricks-gpt-5"),
scorers=[Correctness(model="databricks:/databricks-claude-sonnet-4-5")],
)
Optimize for safeness
Ensure outputs are safe:
from mlflow.genai.scorers import Safety
result = mlflow.genai.optimize_prompts(
predict_fn=predict_fn,
train_data=dataset,
prompt_uris=[prompt.uri],
optimizer=GepaPromptOptimizer(reflection_model="databricks:/databricks-claude-sonnet-4-5"),
scorers=[Safety(model="databricks:/databricks-claude-sonnet-4-5")],
)
Troubleshooting
The following sections provide troubleshooting guidance for common errors.
Issue: Optimization takes too long
Solution: Reduce dataset size or reduce the optimizer budget:
# Use fewer examples
small_dataset = dataset[:20]
# Use faster model for optimization
optimizer = GepaPromptOptimizer(
reflection_model="databricks:/databricks-gpt-5-mini", max_metric_calls=100
)
Issue: No improvement observed
Solution: Check your evaluation metrics and increase dataset diversity as follows:
- Ensure scorers accurately measure what you care about.
- Increase training data size and diversity.
- Try to modify optimizer configurations.
- Verify that the form of outputs matches expectations.
Issue: Prompts not being used
Solution: Ensure predict_fn calls mlflow.entities.model_registry.PromptVersion.format:
# ✅ Correct - loads from registry
def predict_fn(question: str):
prompt = mlflow.genai.load_prompt(f"prompts:/{prompt_location}@latest)
return llm_call(prompt.format(question=question))
# ❌ Incorrect - hardcoded prompt
def predict_fn(question: str):
return llm_call(f"Answer: {question}")
Next steps
To learn more about the API, see Optimize Prompts (Beta).
To learn more about tracing and evaluation for GenAI applications, see the following articles: