Use agent metrics & LLM judges to evaluate app performance

Article
06/25/2024

Important

This article describes the agent metrics and large language model (LLM) judge evaluations computed by Agent Evaluation evaluation runs. Learn how to use evaluation results to determine the quality of your agentic application.

Databricks is dedicated to enhancing the quality of judges by measuring their agreement with human raters. Databricks uses diverse, challenging examples from academic and proprietary datasets to benchmark and improve judges against state-of-the-art LLM-judge approaches, ensuring continuous improvement and high accuracy.

Evaluation run outputs

Each evaluation run generates the following types of outputs:

Request and response information
- request_id
- request
- response
- expected_retrieved_context
- expected_response
- retrieved_context
- trace
Agent metrics and LLM judges

Agent metrics and LLM judges help you determine the quality of your application.

Agent metrics and judges

There are two approaches to measuring performance across these metrics:

Use an LLM judge: A separate LLM acts as a judge to evaluate the application’s retrieval and response quality. This approach automates evaluation across numerous dimensions.

Use deterministic functions: Assess performance by deriving deterministic metrics from the application’s trace and optionally the ground-truth recorded in the evaluation set. Some examples include metrics on cost and latency, or evaluating retrieval recall based on ground-truth documents.

The following table lists the built-in metrics and the questions they can answer:

Metric name	Question	Metric type
`chunk_relevance`	Did the retriever find relevant chunks?	LLM judged
`document_recall`	How many of the known relevant documents did the retriever find?	Deterministic (ground truth required)
`correctness`	Overall, did the agent generate a correct response?	LLM judged (ground truth required)
`relevance_to_query`	Is the response relevant to the request?	LLM judged
`groundedness`	Is the response a hallucination or grounded in context?	LLM judged
`safety`	Is there harmful content in the response?	LLM judged
`total_token_count`, `total_input_token_count`, `total_output_token_count`	What’s the total count of tokens for LLM generations?	Deterministic
`latency_seconds`	What’s the latency of executing the agent?	Deterministic

You can also define a custom LLM judge to evaluate criteria specific to your use case.

See Information about the models powering LLM judges for LLM judge trust and safety information.

Retrieval metrics

Retrieval metrics assess how successfully your agentic application retrieves relevant supporting data. Precision and recall are two key retrieval metrics.

recall    =  # of relevant retrieved items / total # of relevant items
precision =  # of relevant retrieved items / # of items retrieved

Did the retriever find relevant chunks?

Determine whether the retriever returns chunks relevant to the input request. You can use an LLM judge to determine the relevance of chunks without ground truth and use a derived precision metric to quantify the overall relevance of returned chunks.

Chunk relevance precision example

LLM judge: chunk-relevance-precision judge
Ground truth required: None
Input evaluation set schema:
- request
- retrieved_context[].content or trace (only if model argument is not used in mlflow.evaluate())

yes: The retrieved chunk is relevant to the input request.

no: The retrieved chunk is not relevant to the input request.

Outputs for each question:

Data field	Type	Description
`retrieval/llm_judged/chunk_relevance/ratings`	`array[string]`	For each chunk, `yes` or `no` if judged relevant
`retrieval/llm_judged/chunk_relevance/rationales`	`array[string]`	For each chunk, LLM’s reasoning for the corresponding rating
`retrieval/llm_judged/chunk_relevance/error_messages`	`array[string]`	For each chunk, if there was an error computing the rating, details of the error are here, and other output values will be NULL. If no error, this is NULL.
`retrieval/llm_judged/chunk_relevance/precision`	`float, [0, 1]`	Calculates the percentage of relevant chunks among all retrieved chunks.

Metrics reported for the entire evaluation set:

Metric name	Type	Description
`retrieval/llm_judged/chunk_relevance/precision/average`	`float; [0, 1]`	Average value of `chunk_relevance/precision` across all questions

How many of the known relevant documents did the retriever find?

Calculates the recall percentage of ground truth relevant documents successfully retrieved by the retriever.

Document recall example

LLM judge: None, ground truth based
Ground truth required: Yes
Input evaluation set schema:
- expected_retrieved_context[].doc_uri
- retrieved_context[].doc_uri or trace (only if model argument is not used in mlflow.evaluate())

Outputs for each question:

Data field	Type	Description
`retrieval/ground_truth/document_recall`	`float, [0, 1]`	The percentage of ground truth `doc_uris` present in the retrieved chunks.

Metrics reported for the entire evaluation set:

Metric name	Type	Description
`retrieval/ground_truth/document_recall/average`	`float; [0, 1]`	Across all questions, what is the average value of `document_recall`?

Response metrics

Response quality metrics assess how well the application responds to a user’s request. Response metrics can measure, for instance, if the resulting answer is accurate per the ground-truth, how well-grounded the response was given the retrieved context (e.g., did the LLM hallucinate), or how safe the response was (e.g., no toxicity).

Overall, did the LLM give an accurate answer?

Get a binary evaluation and written rationale on whether the agent’s generated response is factually accurate and semantically similar to the provided ground-truth response.

LLM judge: correctness judge
Ground truth required: Yes, expected_response
Input evaluation set schema:
- request
- expected_response
- response or trace (only if model argument is not used in mlflow.evaluate())

yes: The generated response is highly accurate and semantically similar to the ground truth. Minor omissions or inaccuracies that still capture the intent of the ground truth are acceptable.

no: The response does not meet the criteria. It is either inaccurate, partially accurate, or semantically dissimilar.

Outputs provided for each question:

Data field	Type	Description
`response/llm_judged/correctness/rating`	`string`	`yes` if the response is correct (per the ground truth), `no` otherwise
`response/llm_judged/correctness/rationale`	`string`	LLM’s written reasoning for yes/no
`retrieval/llm_judged/correctness/error_message`	`string`	If there was an error computing this metric, details of the error are here, and other values are NULL. If no error, this is NULL.

Metrics reported for the entire evaluation set:

Metric name	Type	Description
`response/llm_judged/correctness/rating/percentage`	`float; [0, 1]`	Across all questions, what’s the percentage where correctness is judged as `yes`

Is the response relevant to the request?

Determine whether the response is relevant to the input request.

LLM judge: relevance_to_query judge
Ground truth required: None
Input evaluation set schema:
- request
- response or trace (only if model argument is not used in mlflow.evaluate())

yes: The response is relevant to the original input request.

no: The response is not relevant to the original input request.

Outputs for each question:

Data field	Type	Description
`response/llm_judged/relevance_to_query/rating`	`string`	`yes` if the response is judged to be relevant to the request, `no` otherwise.
`response/llm_judged/relevance_to_query/rationale`	`string`	LLM’s written reasoning for `yes`/`no`
`response/llm_judged/relevance_to_query/error_message`	`string`	If there was an error computing this metric, details of the error are here, and other values are NULL. If no error, this is NULL.

Metrics reported for the entire evaluation set:

Metric name	Type	Description
`response/llm_judged/relevance_to_query/rating/percentage`	`float; [0, 1]`	Across all questions, what’s the percentage where `relevance_to_query/rating` is judged to be `yes`.

Is the response a hallucination, or is it grounded in the retrieved context?

Get a binary evaluation and written rationale on whether the generated response is factually consistent with the retrieved context.

LLM judge: groundedness judge
Ground truth required: None
Input evaluation set schema:
- request
- retrieved_context[].content or trace (only if model argument is not used in mlflow.evaluate())
- response or trace (only if model argument is not used in mlflow.evaluate())

yes: The retrieved context supports all or almost all generated responses.

no: The retrieved context does not support the generated response.

Outputs provided for each question:

Data field	Type	Description
`response/llm_judged/groundedness/rating`	`string`	`yes` if the response is grounded (no hallucinations), `no` otherwise.
`response/llm_judged/groundedness/rationale`	`string`	LLM’s written reasoning for `yes`/`no`
`retrieval/llm_judged/groundedness/error_message`	`string`	If there was an error computing this metric, details of the error are here, and other values are NULL. If no error, this is NULL.

Metrics reported for the entire evaluation set:

Metric name	Type	Description
`response/llm_judged/groundedness/rating/percentage`	`float; [0, 1]`	Across all questions, what’s the percentage where `groundedness/rating` is judged as `yes`.

Is there harmful content in the agent response?

Get a binary rating and a written rationale on whether the generated response has harmful or toxic content.

LLM judge: safety judge
Ground truth required: None
Input evaluation set schema:
- request
- response or trace (only if model argument is not used in mlflow.evaluate())

yes: The generated response does not have harmful or toxic content.

no: The generated response has harmful or toxic content.

Outputs provided for each question:

Data field	Type	Description
`response/llm_judged/safety/rating`	`string`	`yes` if the response does not have harmful or toxic content, `no` otherwise.
`response/llm_judged/safety/rationale`	`string`	LLM’s written reasoning for `yes`/`no`
`retrieval/llm_judged/safety/error_message`	`string`	If there was an error computing this metric, details of the error are here, and other values are NULL. If no error, this is NULL.

Metrics reported for the entire evaluation set:

Metric name	Type	Description
`response/llm_judged/safety/rating/average`	`float; [0, 1]`	What percentage of all questions were judged to be `yes`?

Custom retrieval LLM judge

Use a custom retrieval judge to perform a custom assessment for each retrieved chunk. The LLM judge is called for each chunk across all questions. For details on configuring custom judges, see Advanced agent evaluation.

Outputs provided for each assessment:

Data field	Type	Description
`retrieval/llm_judged/{assessment_name}/ratings`	`array[string]`	For each chunk,`yes`/`no` per the output of the custom judge
`retrieval/llm_judged/{assessment_name}/rationales`	`array[string]`	For each chunk, LLM’s written reasoning for `yes`/`no`
`retrieval/llm_judged/{assessment_name}/error_messages`	`array[string]`	For each chunk, if there was an error computing this metric, details of the error are here, and other values are NULL. If no error, this is NULL.
`retrieval/llm_judged/{assessment_name}/precision`	`float, [0, 1]`	What percentage of all retrieved chunks are judged as `yes` per the custom judge?

Metrics reported for the entire evaluation set:

Metric name	Type	Description
`retrieval/llm_judged/{assessment_name}/precision/average`	`float; [0, 1]`	Across all questions, what’s the average value of `{assessment_name}_precision`

Performance metrics

Performance metrics capture the overall cost and performance of the agentic applications. Overall latency and token consumption are examples of performance metrics.

What’s the token cost of executing the agentic application?

Computes the total token count across all LLM generation calls in the trace. This approximates the total cost given as more tokens, which generally leads to more cost.

Outputs for each question:

Data field	Type	Description
`agent/total_token_count`	`integer`	Sum of all input and output tokens across all LLM spans in the agent’s trace
`agent/total_input_token_count`	`integer`	Sum of all input tokens across all LLM spans in the agent’s trace
`agent/total_output_token_count`	`integer`	Sum of all output tokens across all LLM spans in the agent’s trace

Metrics reported for the entire evaluation set:

Name	Description
`agent/total_token_count/average`	Average value across all questions
`agent/input_token_count/average`	Average value across all questions
`agent/output_token_count/average`	Average value across all questions

What’s the latency of executing the agentic application?

Computes the entire application’s latency in seconds for the trace.

Outputs for each question:

Name	Description
`agent/latency_seconds`	End-to-end latency based on the trace

Metrics reported for the entire evaluation set:

Metric name	Description
`agent/latency_seconds/average`	Average value across all questions

Information about the models powering LLM judges

LLM judges might use third-party services to evaluate your GenAI applications, including Azure OpenAI operated by Microsoft.
For Azure OpenAI, Databricks has opted out of Abuse Monitoring so no prompts or responses are stored with Azure OpenAI.
For European Union (EU) workspaces, LLM judges use models hosted in the EU. All other regions use models hosted in the US.
Disabling Azure AI Services AI assistive features will prevent the LLM judge from calling Azure AI Services models.
Data sent to the LLM judge is not used for any model training.
LLM judges are intended to help customers evaluate their RAG applications, and LLM judge outputs should not be used to train, improve, or fine-tune an LLM.

Share via

Use agent metrics & LLM judges to evaluate app performance

Evaluation run outputs

Agent metrics and judges

Retrieval metrics

Did the retriever find relevant chunks?

How many of the known relevant documents did the retriever find?

Response metrics

Overall, did the LLM give an accurate answer?

Is the response relevant to the request?

Is the response a hallucination, or is it grounded in the retrieved context?

Is there harmful content in the agent response?

Custom retrieval LLM judge

Performance metrics

What’s the token cost of executing the agentic application?

What’s the latency of executing the agentic application?

Information about the models powering LLM judges

Feedback

Feedback

Additional resources