Use agent metrics & LLM judges to evaluate app performance
Important
This feature is in Public Preview.
This article describes the agent metrics and large language model (LLM) judge evaluations computed by Agent Evaluation evaluation runs.
Databricks is dedicated to enhancing the quality of judges by measuring their agreement with human raters. Databricks uses diverse, challenging examples from academic and proprietary datasets to benchmark and improve judges against state-of-the-art LLM-judge approaches, ensuring continuous improvement and high accuracy.
Agent metrics and judges
Mosaic AI Agent Evaluation uses two approaches to evaluate the quality of an agentic application.
LLM judges: A separate LLM acts as a judge to evaluate the application’s retrieval and response quality. Agent Evaluation includes a suite of built-in LLM judges that make it possible to scale up the evaluation process and include a large set of test cases.
Deterministic calculations: Assess performance by deriving deterministic metrics from the application’s trace and optionally the ground-truth recorded in the evaluation set. Examples include token count and latency metrics, or retrieval recall based on ground-truth documents.
The following table lists the built-in metrics and the questions they can answer:
Metric name | Question | Metric type |
---|---|---|
chunk_relevance |
Did the retriever find relevant chunks? | LLM judged |
document_recall |
How many of the known relevant documents did the retriever find? | Deterministic (ground truth required) |
context_sufficiency |
Did the retriever find documents sufficient to produce the expected response? | LLM judged (ground truth required) |
correctness |
Overall, did the agent generate a correct response? | LLM judged (ground truth required) |
relevance_to_query |
Is the response relevant to the request? | LLM judged |
groundedness |
Is the response a hallucination or grounded in context? | LLM judged |
safety |
Is there harmful content in the response? | LLM judged |
total_token_count , total_input_token_count , total_output_token_count |
What’s the total count of tokens for LLM generations? | Deterministic |
latency_seconds |
What’s the latency of executing the agent? | Deterministic |
You can also define a custom LLM judge to evaluate criteria specific to your use case. See Custom judge metrics.
See Information about the models powering LLM judges for LLM judge trust and safety information.
Retrieval metrics
Retrieval metrics assess how successfully your agentic application retrieves relevant supporting data. Precision and recall are two key retrieval metrics.
recall = # of relevant retrieved items / total # of relevant items
precision = # of relevant retrieved items / # of items retrieved
Did the retriever find relevant chunks?
The chunk-relevance-precision
LLM judge determines whether the chunks returned by the retriever are relevant to the input request. Precision is calculated as the number of relevant chunks returned divided by the total number of chunks returned. For example, if the retriever returns four chunks, and the LLM judge determines that three of the four returned documents are relevant to the request, then llm_judged/chunk_relevance/precision
is 0.75.
Input required for llm_judged/chunk_relevance
Ground truth is not required.
The input evaluation set must have the following column:
request
In addition, if you do not use the model
argument in the call to mlflow.evaluate()
, you must also provide either retrieved_context[].content
or trace
.
Output for llm_judged/chunk_relevance
The following metrics are calculated for each question:
Data field | Type | Description |
---|---|---|
retrieval/llm_judged/chunk_relevance/ratings |
array[string] |
For each chunk, yes or no , indicating if the retrieved chunk is relevant to the input request. |
retrieval/llm_judged/chunk_relevance/rationales |
array[string] |
For each chunk, LLM’s reasoning for the corresponding rating. |
retrieval/llm_judged/chunk_relevance/error_messages |
array[string] |
For each chunk, if there was an error computing the rating, details of the error are here, and other output values will be NULL. If no error, this is NULL. |
retrieval/llm_judged/chunk_relevance/precision |
float, [0, 1] |
Calculates the percentage of relevant chunks among all retrieved chunks. |
The following metric is reported for the entire evaluation set:
Metric name | Type | Description |
---|---|---|
retrieval/llm_judged/chunk_relevance/precision/average |
float, [0, 1] |
Average value of chunk_relevance/precision across all questions. |
How many of the known relevant documents did the retriever find?
document_recall
is calculated as the number of relevant documents returned divided by the total number of relevant documents based on ground truth. For example, suppose that two documents are relevant based on ground truth. If the retriever returns one of those documents, document_recall
is 0.5. This metric is not affected by the total number of documents returned.
This metric is deterministic and does not use an LLM judge.
Input required for document-recall
Ground truth is required.
The input evaluation set must have the following column:
expected_retrieved_context[].doc_uri
In addition, if you do not use the model
argument in the call to mlflow.evaluate()
, you must also provide either retrieved_context[].doc_uri
or trace
.
Output for document-recall
The following metric is calculated for each question:
Data field | Type | Description |
---|---|---|
retrieval/ground_truth/document_recall |
float, [0, 1] |
The percentage of ground truth doc_uris present in the retrieved chunks. |
The following metric is calculated for the entire evaluation set:
Metric name | Type | Description |
---|---|---|
retrieval/ground_truth/document_recall/average |
float, [0, 1] |
Average value of document_recall across all questions. |
Did the retriever find documents sufficient to produce the expected response?
The context_sufficiency
LLM judge determines whether the retriever has retrieved documents that are sufficient to produce the expected response.
Input required for context_sufficiency
Ground truth expected_response
is required.
The input evaluation set must have the following columns:
request
expected_response
In addition, if you do not use the model
argument in the call to mlflow.evaluate()
, you must also provide either retrieved_context[].content
or trace
.
Output for context_sufficiency
The following metrics are calculated for each question:
Data field | Type | Description |
---|---|---|
retrieval/llm_judged/context_sufficiency/rating |
string |
yes or no . yes indicates that the retrieved context is sufficient to produce the expected response. no indicates that the retrieval needs to be tuned for this question so that it brings back the missing information. The output rationale should mention what information is missing. |
retrieval/llm_judged/context_sufficiency/rationale |
string |
LLM’s written reasoning for yes or no . |
retrieval/llm_judged/context_sufficiency/error_message |
string |
If there was an error computing this metric, details of the error are here. If no error, this is NULL. |
The following metric is calculated for the entire evaluation set:
Metric name | Type | Description |
---|---|---|
retrieval/llm_judged/context_sufficiency/rating/percentage |
float, [0, 1] |
Percentage where context sufficiency is judged as yes . |
Response metrics
Response quality metrics assess how well the application responds to a user’s request. These metrics evaluate factors like the accuracy of the response compared to ground truth, whether the response is well-grounded given the retrieved context (or if the LLM is hallucinating), and whether the response is safe and free of toxic language.
Overall, did the LLM give an accurate answer?
The correctness
LLM judge gives a binary evaluation and written rationale on whether the agent’s generated response is factually accurate and semantically similar to the provided ground-truth response.
Input required for correctness
The ground truth expected_response
is required.
The input evaluation set must have the following columns:
request
expected_response
In addition, if you do not use the model
argument in the call to mlflow.evaluate()
, you must also provide either response
or trace
.
Important
The ground truth expected_response
should include only the minimal set of facts that is required for a correct response. If you copy a response from another source, be sure to edit the response to remove any text that is not required for an answer to be considered correct.
Including only the required information, and leaving out information that is not strictly required in the answer, enables Agent Evaluation to provide a more robust signal on output quality.
Output for correctness
The following metrics are calculated for each question:
Data field | Type | Description |
---|---|---|
response/llm_judged/correctness/rating |
string |
yes or no . yes indicates that the generated response is highly accurate and semantically similar to the ground truth. Minor omissions or inaccuracies that still capture the intent of the ground truth are acceptable. no indicates that the response does not meet the criteria. |
response/llm_judged/correctness/rationale |
string |
LLM’s written reasoning for yes or no . |
response/llm_judged/correctness/error_message |
string |
If there was an error computing this metric, details of the error are here. If no error, this is NULL. |
The following metric is calculated for the entire evaluation set:
Metric name | Type | Description |
---|---|---|
response/llm_judged/correctness/rating/percentage |
float, [0, 1] |
Across all questions, percentage where correctness is judged as yes . |
Is the response relevant to the request?
The relevance_to_query
LLM judge determines whether the response is relevant to the input request.
Input required for relevance_to_query
Ground truth is not required.
The input evaluation set must have the following column:
request
In addition, if you do not use the model
argument in the call to mlflow.evaluate()
, you must also provide either response
or trace
.
Output for relevance_to_query
The following metrics are calculated for each question:
Data field | Type | Description |
---|---|---|
response/llm_judged/relevance_to_query/rating |
string |
yes if the response is judged to be relevant to the request, no otherwise. |
response/llm_judged/relevance_to_query/rationale |
string |
LLM’s written reasoning for yes or no . |
response/llm_judged/relevance_to_query/error_message |
string |
If there was an error computing this metric, details of the error are here. If no error, this is NULL. |
The following metric is calculated for the entire evaluation set:
Metric name | Type | Description |
---|---|---|
response/llm_judged/relevance_to_query/rating/percentage |
float, [0, 1] |
Across all questions, percentage where relevance_to_query/rating is judged to be yes . |
Is the response a hallucination, or is it grounded in the retrieved context?
The groundedness
LLM judge returns a binary evaluation and written rationale on whether the generated response is factually consistent with the retrieved context.
Input required for groundedness
Ground truth is not required.
The input evaluation set must have the following column:
request
In addition, if you do not use the model
argument in the call to mlflow.evaluate()
, you must also provide either trace
or both of response
and retrieved_context[].content
.
Output for groundedness
The following metrics are calculated for each question:
Data field | Type | Description |
---|---|---|
response/llm_judged/groundedness/rating |
string |
yes if the retrieved context supports all or almost all generated responses, no otherwise. |
response/llm_judged/groundedness/rationale |
string |
LLM’s written reasoning for yes or no . |
response/llm_judged/groundedness/error_message |
string |
If there was an error computing this metric, details of the error are here. If no error, this is NULL. |
The following metric is calculated for the entire evaluation set:
Metric name | Type | Description |
---|---|---|
response/llm_judged/groundedness/rating/percentage |
float, [0, 1] |
Across all questions, what’s the percentage where groundedness/rating is judged as yes . |
Is there harmful content in the agent response?
The safety
LLM judge returns a binary rating and a written rationale on whether the generated response has harmful or toxic content.
Input required for safety
Ground truth is not required.
The input evaluation set must have the following column:
request
In addition, if you do not use the model
argument in the call to mlflow.evaluate()
, you must also provide either response
or trace
.
Output for safety
The following metrics are calculated for each question:
Data field | Type | Description |
---|---|---|
response/llm_judged/safety/rating |
string |
yes if the response does not have harmful or toxic content, no otherwise. |
response/llm_judged/safety/rationale |
string |
LLM’s written reasoning for yes or no . |
response/llm_judged/safety/error_message |
string |
If there was an error computing this metric, details of the error are here. If no error, this is NULL. |
The following metric is calculated for the entire evaluation set:
Metric name | Type | Description |
---|---|---|
response/llm_judged/safety/rating/average |
float, [0, 1] |
Percentage of all questions that were judged to be yes . |
Performance metrics
Performance metrics capture the overall cost and performance of the agentic applications. Overall latency and token consumption are examples of performance metrics.
What’s the token cost of executing the agentic application?
Computes the total token count across all LLM generation calls in the trace. This approximates the total cost given as more tokens, which generally leads to more cost.
Token counts are only calculated when a trace is available. If the model
argument is included in the call to mlflow.evaluate()
, a trace is automatically generated. You can also directly provide a trace
column in the evaluation dataset.
The following metrics are calculated for each question:
Data field | Type | Description |
---|---|---|
agent/total_token_count |
integer |
Sum of all input and output tokens across all LLM spans in the agent’s trace. |
agent/total_input_token_count |
integer |
Sum of all input tokens across all LLM spans in the agent’s trace. |
agent/total_output_token_count |
integer |
Sum of all output tokens across all LLM spans in the agent’s trace. |
The following metric is calculated for the entire evaluation set:
Name | Description |
---|---|
agent/total_token_count/average |
Average value across all questions. |
agent/input_token_count/average |
Average value across all questions. |
agent/output_token_count/average |
Average value across all questions. |
What’s the latency of executing the agentic application?
Computes the entire application’s latency in seconds for the trace.
Latency is only calculated when a trace is available. If the model
argument is included in the call to mlflow.evaluate()
, a trace is automatically generated. You can also directly provide a trace
column in the evaluation dataset.
The following metrics are calculated for each question:
Name | Description |
---|---|
agent/latency_seconds |
End-to-end latency based on the trace |
The following metric is calculated for the entire evaluation set:
Metric name | Description |
---|---|
agent/latency_seconds/average |
Average value across all questions |
Custom judge metrics
You can create a custom judge to perform assessments specific to your use case. For details, see Create custom LLM judges.
The output produced by a custom judge depends on its assessment_type
, ANSWER
or RETRIEVAL
.
Custom LLM judge for ANSWER assessment
A custom LLM judge for ANSWER assessment evaluates the response for each question.
Outputs provided for each assessment:
Data field | Type | Description |
---|---|---|
response/llm_judged/{assessment_name}/rating |
string |
yes or no . |
response/llm_judged/{assessment_name}/rationale |
string |
LLM’s written reasoning for yes or no . |
response/llm_judged/{assessment_name}/error_message |
string |
If there was an error computing this metric, details of the error are here. If no error, this is NULL. |
The following metric is calculated for the entire evaluation set:
Metric name | Type | Description |
---|---|---|
response/llm_judged/{assessment_name}/rating/percentage |
float, [0, 1] |
Across all questions, percentage where {assessment_name} is judged as yes . |
Custom LLM judge for RETRIEVAL assessment
A custom LLM judge for RETRIEVAL assessment evaluates each retrieved chunk across all questions.
Outputs provided for each assessment:
Data field | Type | Description |
---|---|---|
retrieval/llm_judged/{assessment_name}/ratings |
array[string] |
Evaluation of the custom judge for each chunk,yes or no . |
retrieval/llm_judged/{assessment_name}/rationales |
array[string] |
For each chunk, LLM’s written reasoning for yes or no . |
retrieval/llm_judged/{assessment_name}/error_messages |
array[string] |
For each chunk, if there was an error computing this metric, details of the error are here, and other values are NULL. If no error, this is NULL. |
retrieval/llm_judged/{assessment_name}/precision |
float, [0, 1] |
Percentage of all retrieved chunks that the custom judge evaluated as yes . |
Metrics reported for the entire evaluation set:
Metric name | Type | Description |
---|---|---|
retrieval/llm_judged/{assessment_name}/precision/average |
float, [0, 1] |
Average value of {assessment_name}_precision across all questions. |
Information about the models powering LLM judges
- LLM judges might use third-party services to evaluate your GenAI applications, including Azure OpenAI operated by Microsoft.
- For Azure OpenAI, Databricks has opted out of Abuse Monitoring so no prompts or responses are stored with Azure OpenAI.
- For European Union (EU) workspaces, LLM judges use models hosted in the EU. All other regions use models hosted in the US.
- Disabling Azure AI-powered AI assistive features prevents the LLM judge from calling Azure AI-powered models.
- Data sent to the LLM judge is not used for any model training.
- LLM judges are intended to help customers evaluate their RAG applications, and LLM judge outputs should not be used to train, improve, or fine-tune an LLM.