How quality is assessed with LLM judges

Article
10/18/2024

Important

This article explains how Agent Evaluation assesses your AI application’s quality, cost, and latency and provides insights to guide your quality improvements and cost and latency optimizations. Agent Evaluation assesses quality using LLM judges in two steps:

LLM judges assess specific quality aspects (such as correctness and groundedness) for each row. For details, see Step 1: LLM judges assess each row’s quality.
Agent Evaluation combines individual judge’s assessments into an overall pass/fail score and root cause for any failures. For details, see Step 2: Combine LLM judge assessments to identify the root cause of quality issues.

For LLM judge trust and safety information, see Information about the models powering LLM judges.

Step 1: LLM judges assess each row’s quality

For every input row, Agent Evaluation uses a suite of LLM judges to assess different aspects of quality about agent’s outputs. Each judge produces a yes or no score and a written rationale for that score, as shown in the example below:

sample-judge-row

For details about the LLM judges used, see Available LLM judges.

Step 2: Combine LLM judge assessments to identify the root cause of quality issues

After running LLM judges, Agent Evaluation analyzes their outputs to assess overall quality and determine a pass/fail quality score on the judge’s collective assessments. If overall quality fails, Agent Evaluation identifies which specific LLM judge caused the failure and provides suggested fixes.

A summary of this analysis is output in the UI:

root cause analysis overview

The results for each row are available in the detail view UI:

root cause analysis detail

Note

The data underlying these UIs is available in the MLflow run and is returned as a DataFrame by the mlflow.evaluate(...) call. See review evaluation output for details on how to access this data.

Available LLM judges

The table below summarizes the suite of LLM judges used in Agent Evaluation to assess different aspects of quality.

Name of the judge	Step	Quality aspect that the judge assesses	Required inputs	Requires ground truth?
`relevance_to_query`	Response	Does the response address (is relevant to) the user’s request?	- `response`, `request`	No
`groundedness`	Response	Is the generated response grounded in the retrieved context (not hallucinating)?	- `response`, `trace[retrieved_context]`	No
`safety`	Response	Is there harmful or toxic content in the response?	- `response`	No
`correctness`	Response	Is the generated response accurate (as compared to the ground truth)?	- `response`, `expected_response`	Yes
`chunk_relevance`	Retrieval	Did the retriever find chunks that are useful (relevant) in answering the user’s request? Note: This judge is applied separately to each retrieved chunk, producing a score & rationale for each chunk. These scores are aggregated into a `chunk_relevance/precision` score for each row that represents the % of chunks that are relevant.	- `retrieved_context`, `request`	No
`document_recall`	Retrieval	How many of the known relevant documents did the retriever find??	- `retrieved_context`, `expected_retrieved_context[].doc_uri`	Yes
`context_sufficiency`	Retrieval	Did the retriever find documents with sufficient information to produce the expected response?	- `retrieved_context`, `expected_response`	Yes

The following screenshots show examples of how these judges appear in the UI:

gen judge detail

chunk relevance detail

Customer-defined judges

You can also define your own LLM judges to assess aspects of quality that are unique to your use case. For details, see Create custom LLM judges.

How Databricks maintains and improves LLM judge accuracy

Databricks is dedicated to enhancing the quality of our LLM judges. Quality is evaluated by measuring how well the LLM judge agrees with human raters, using the following metrics:

Increased Cohen’s Kappa (a measure of inter-rater agreement).
Increased accuracy (percent of predicted labels that match the human rater’s label).
Increased F1 score.
Decreased false positive rate.
Decreased false negative rate.

To measure these metrics, Databricks uses diverse, challenging examples from academic and proprietary datasets that are representative of customer datasets to benchmark and improve judges against state-of-the-art LLM judge approaches, ensuring continuous improvement and high accuracy.

For more details on how Databricks measures and continuously improves judge quality, see Databricks announces significant improvements to the built-in LLM judges in Agent Evaluation.

How quality and root cause are determined

This section describes the logic used to determine quality and root cause.

Quality determination

If any of the following judges fail, the overall quality is marked fail. If all judges pass, the quality is considered pass.

context_sufficiency
groundedness
correctness
safety
chunk_relevance - is there at least 1 relevant chunk?
relevant_to_query
Any customer-defined LLM judge

Root cause determination

The root cause is determined as the first judge to fail based on the ordered list below. This ordering is used because judge assessments are often correlated in a causal way. For example, if context_sufficiency assesses that the retriever has not fetched the right chunks or documents for the input request, then it is likely that the generator will fail to synthesize a good response and therefore correctness will also fail.

If ground truth is provided as input, the following order is used:

context_sufficiency
groundedness
correctness
safety
Any customer-defined LLM judge

If ground truth is provided as input, the following order is used:

chunk_relevance - is there at least 1 relevant chunk?
groundedness
relevant_to_query
safety
Any customer-defined LLM judge

Retrieval metrics

Retrieval metrics assess how successfully your agentic application retrieves relevant supporting data. Precision and recall are two key retrieval metrics.

recall    =  # of relevant retrieved items / total # of relevant items
precision =  # of relevant retrieved items / # of items retrieved

Did the retriever find relevant chunks?

The chunk-relevance-precision LLM judge determines whether the chunks returned by the retriever are relevant to the input request. Precision is calculated as the number of relevant chunks returned divided by the total number of chunks returned. For example, if the retriever returns four chunks, and the LLM judge determines that three of the four returned documents are relevant to the request, then llm_judged/chunk_relevance/precision is 0.75.

Input required for `llm_judged/chunk_relevance`

Ground truth is not required.

The input evaluation set must have the following column:

request

In addition, if you do not use the model argument in the call to mlflow.evaluate(), you must also provide either retrieved_context[].content or trace.

Output for `llm_judged/chunk_relevance`

The following metrics are calculated for each question:

Data field	Type	Description
`retrieval/llm_judged/chunk_relevance/ratings`	`array[string]`	For each chunk, `yes` or `no`, indicating if the retrieved chunk is relevant to the input request.
`retrieval/llm_judged/chunk_relevance/rationales`	`array[string]`	For each chunk, LLM’s reasoning for the corresponding rating.
`retrieval/llm_judged/chunk_relevance/error_messages`	`array[string]`	For each chunk, if there was an error computing the rating, details of the error are here, and other output values will be NULL. If no error, this is NULL.
`retrieval/llm_judged/chunk_relevance/precision`	`float, [0, 1]`	Calculates the percentage of relevant chunks among all retrieved chunks.

The following metric is reported for the entire evaluation set:

Metric name	Type	Description
`retrieval/llm_judged/chunk_relevance/precision/average`	`float, [0, 1]`	Average value of `chunk_relevance/precision` across all questions.

How many of the known relevant documents did the retriever find?

document_recall is calculated as the number of relevant documents returned divided by the total number of relevant documents based on ground truth. For example, suppose that two documents are relevant based on ground truth. If the retriever returns one of those documents, document_recall is 0.5. This metric is not affected by the total number of documents returned.

This metric is deterministic and does not use an LLM judge.

Input required for `document_recall`

Ground truth is required.

The input evaluation set must have the following column:

expected_retrieved_context[].doc_uri

In addition, if you do not use the model argument in the call to mlflow.evaluate(), you must also provide either retrieved_context[].doc_uri or trace.

Output for `document_recall`

The following metric is calculated for each question:

Data field	Type	Description
`retrieval/ground_truth/document_recall`	`float, [0, 1]`	The percentage of ground truth `doc_uris` present in the retrieved chunks.

The following metric is calculated for the entire evaluation set:

Metric name	Type	Description
`retrieval/ground_truth/document_recall/average`	`float, [0, 1]`	Average value of `document_recall` across all questions.

Did the retriever find documents sufficient to produce the expected response?

The context_sufficiency LLM judge determines whether the retriever has retrieved documents that are sufficient to produce the expected response.

Input required for `context_sufficiency`

Ground truth expected_response is required.

The input evaluation set must have the following columns:

request
- expected_response

In addition, if you do not use the model argument in the call to mlflow.evaluate(), you must also provide either retrieved_context[].content or trace.

Output for `context_sufficiency`

The following metrics are calculated for each question:

Data field	Type	Description
`retrieval/llm_judged/context_sufficiency/rating`	`string`	`yes` or `no`. `yes` indicates that the retrieved context is sufficient to produce the expected response. `no` indicates that the retrieval needs to be tuned for this question so that it brings back the missing information. The output rationale should mention what information is missing.
`retrieval/llm_judged/context_sufficiency/rationale`	`string`	LLM’s written reasoning for `yes` or `no`.
`retrieval/llm_judged/context_sufficiency/error_message`	`string`	If there was an error computing this metric, details of the error are here. If no error, this is NULL.

The following metric is calculated for the entire evaluation set:

Metric name	Type	Description
`retrieval/llm_judged/context_sufficiency/rating/percentage`	`float, [0, 1]`	Percentage where context sufficiency is judged as `yes`.

Response metrics

Response quality metrics assess how well the application responds to a user’s request. These metrics evaluate factors like the accuracy of the response compared to ground truth, whether the response is well-grounded given the retrieved context (or if the LLM is hallucinating), and whether the response is safe and free of toxic language.

Overall, did the LLM give an accurate answer?

The correctness LLM judge gives a binary evaluation and written rationale on whether the agent’s generated response is factually accurate and semantically similar to the provided ground-truth response.

Input required for `correctness`

The ground truth expected_response is required.

The input evaluation set must have the following columns:

request
expected_response

In addition, if you do not use the model argument in the call to mlflow.evaluate(), you must also provide either response or trace.

Important

The ground truth expected_response should include only the minimal set of facts that is required for a correct response. If you copy a response from another source, be sure to edit the response to remove any text that is not required for an answer to be considered correct.

Including only the required information, and leaving out information that is not strictly required in the answer, enables Agent Evaluation to provide a more robust signal on output quality.

Output for `correctness`

The following metrics are calculated for each question:

Data field	Type	Description
`response/llm_judged/correctness/rating`	`string`	`yes` or `no`. `yes` indicates that the generated response is highly accurate and semantically similar to the ground truth. Minor omissions or inaccuracies that still capture the intent of the ground truth are acceptable. `no` indicates that the response does not meet the criteria.
`response/llm_judged/correctness/rationale`	`string`	LLM’s written reasoning for `yes` or `no`.
`response/llm_judged/correctness/error_message`	`string`	If there was an error computing this metric, details of the error are here. If no error, this is NULL.

The following metric is calculated for the entire evaluation set:

Metric name	Type	Description
`response/llm_judged/correctness/rating/percentage`	`float, [0, 1]`	Across all questions, percentage where correctness is judged as `yes`.

Is the response relevant to the request?

The relevance_to_query LLM judge determines whether the response is relevant to the input request.

Input required for `relevance_to_query`

Ground truth is not required.

The input evaluation set must have the following column:

request

In addition, if you do not use the model argument in the call to mlflow.evaluate(), you must also provide either response or trace.

Output for `relevance_to_query`

The following metrics are calculated for each question:

Data field	Type	Description
`response/llm_judged/relevance_to_query/rating`	`string`	`yes` if the response is judged to be relevant to the request, `no` otherwise.
`response/llm_judged/relevance_to_query/rationale`	`string`	LLM’s written reasoning for `yes` or `no`.
`response/llm_judged/relevance_to_query/error_message`	`string`	If there was an error computing this metric, details of the error are here. If no error, this is NULL.

The following metric is calculated for the entire evaluation set:

Metric name	Type	Description
`response/llm_judged/relevance_to_query/rating/percentage`	`float, [0, 1]`	Across all questions, percentage where `relevance_to_query/rating` is judged to be `yes`.

Is the response a hallucination, or is it grounded in the retrieved context?

The groundedness LLM judge returns a binary evaluation and written rationale on whether the generated response is factually consistent with the retrieved context.

Input required for `groundedness`

Ground truth is not required.

The input evaluation set must have the following column:

request

In addition, if you do not use the model argument in the call to mlflow.evaluate(), you must also provide either trace or both of response and retrieved_context[].content.

Output for `groundedness`

The following metrics are calculated for each question:

Data field	Type	Description
`response/llm_judged/groundedness/rating`	`string`	`yes` if the retrieved context supports all or almost all generated responses, `no` otherwise.
`response/llm_judged/groundedness/rationale`	`string`	LLM’s written reasoning for `yes` or `no`.
`response/llm_judged/groundedness/error_message`	`string`	If there was an error computing this metric, details of the error are here. If no error, this is NULL.

The following metric is calculated for the entire evaluation set:

Metric name	Type	Description
`response/llm_judged/groundedness/rating/percentage`	`float, [0, 1]`	Across all questions, what’s the percentage where `groundedness/rating` is judged as `yes`.

Is there harmful content in the agent response?

The safety LLM judge returns a binary rating and a written rationale on whether the generated response has harmful or toxic content.

Input required for `safety`

Ground truth is not required.

The input evaluation set must have the following column:

request

In addition, if you do not use the model argument in the call to mlflow.evaluate(), you must also provide either response or trace.

Output for `safety`

The following metrics are calculated for each question:

Data field	Type	Description
`response/llm_judged/safety/rating`	`string`	`yes` if the response does not have harmful or toxic content, `no` otherwise.
`response/llm_judged/safety/rationale`	`string`	LLM’s written reasoning for `yes` or `no`.
`response/llm_judged/safety/error_message`	`string`	If there was an error computing this metric, details of the error are here. If no error, this is NULL.

The following metric is calculated for the entire evaluation set:

Metric name	Type	Description
`response/llm_judged/safety/rating/average`	`float, [0, 1]`	Percentage of all questions that were judged to be `yes`.

How cost and latency are assessed

Token cost

To assess cost, Agent Evaluation computes the total token count across all LLM generation calls in the trace. This approximates the total cost given as more tokens, which generally leads to more cost. Token counts are only calculated when a trace is available. If the model argument is included in the call to mlflow.evaluate(), a trace is automatically generated. You can also directly provide a trace column in the evaluation dataset.

The following token counts are calculated for each row:

Data field	Type	Description
`total_token_count`	`integer`	Sum of all input and output tokens across all LLM spans in the agent’s trace.
`total_input_token_count`	`integer`	Sum of all input tokens across all LLM spans in the agent’s trace.
`total_output_token_count`	`integer`	Sum of all output tokens across all LLM spans in the agent’s trace.

Execution latency

Computes the entire application’s latency in seconds for the trace. Latency is only calculated when a trace is available. If the model argument is included in the call to mlflow.evaluate(), a trace is automatically generated. You can also directly provide a trace column in the evaluation dataset.

The following latency measurement is calculated for each row:

Name	Description
`latency_seconds`	End-to-end latency based on the trace

How metrics are aggregated at the level of an MLflow run for quality, cost, and latency

After computing all per-row quality, cost, and latency assessments, Agent Evaluation aggregates these asessments into per-run metrics that are logged in a MLflow run and summarize the quality, cost, and latency of your agent across all input rows.

Agent Evaluation produces the following metrics:

Metric name	Type	Description
`retrieval/llm_judged/chunk_relevance/precision/average`	`float, [0, 1]`	Average value of `chunk_relevance/precision` across all questions.
`retrieval/llm_judged/context_sufficiency/rating/percentage`	`float, [0, 1]`	% of questions where `context_sufficiency/rating` is judged as `yes`.
`response/llm_judged/correctness/rating/percentage`	`float, [0, 1]`	% of questions where `correctness/rating` is judged as `yes`.
`response/llm_judged/relevance_to_query/rating/percentage`	`float, [0, 1]`	% of questions where `relevance_to_query/rating` is judged to be `yes`.
`response/llm_judged/groundedness/rating/percentage`	`float, [0, 1]`	% of questions where `groundedness/rating` is judged as `yes`.
`response/llm_judged/safety/rating/average`	`float, [0, 1]`	% of questions where is `safety/rating` judged to be `yes`.
`agent/total_token_count/average`	`int`	Average value of `total_token_count` across all questions.
`agent/input_token_count/average`	`int`	Average value of `input_token_count` across all questions.
`agent/output_token_count/average`	`int`	Average value of `output_token_count` across all questions.
`agent/latency_seconds/average`	`float`	Average value of `latency_seconds` across all questions.
`response/llm_judged/{custom_response_judge_name}/rating/percentage`	`float, [0, 1]`	% of questions where `{custom_response_judge_name}/rating` is judged as `yes`.
`retrieval/llm_judged/{custom_retrieval_judge_name}/precision/average`	`float, [0, 1]`	Average value of `{custom_retrieval_judge_name}/precision` across all questions.

The following screenshots show how the metrics appear in the UI:

evaluation metrics, values

evaluation metrics, charts

Custom judge metrics

You can create a custom judge to perform assessments specific to your use case. For details, see Create custom LLM judges.

The output produced by a custom judge depends on its assessment_type, ANSWER or RETRIEVAL.

Custom LLM judge for ANSWER assessment

A custom LLM judge for ANSWER assessment evaluates the response for each question.

Outputs provided for each assessment:

Data field	Type	Description
`response/llm_judged/{assessment_name}/rating`	`string`	`yes` or `no`.
`response/llm_judged/{assessment_name}/rationale`	`string`	LLM’s written reasoning for `yes` or `no`.
`response/llm_judged/{assessment_name}/error_message`	`string`	If there was an error computing this metric, details of the error are here. If no error, this is NULL.

The following metric is calculated for the entire evaluation set:

Metric name	Type	Description
`response/llm_judged/{assessment_name}/rating/percentage`	`float, [0, 1]`	Across all questions, percentage where {assessment_name} is judged as `yes`.

Custom LLM judge for RETRIEVAL assessment

A custom LLM judge for RETRIEVAL assessment evaluates each retrieved chunk across all questions.

Outputs provided for each assessment:

Data field	Type	Description
`retrieval/llm_judged/{assessment_name}/ratings`	`array[string]`	Evaluation of the custom judge for each chunk,`yes` or `no`.
`retrieval/llm_judged/{assessment_name}/rationales`	`array[string]`	For each chunk, LLM’s written reasoning for `yes` or `no`.
`retrieval/llm_judged/{assessment_name}/error_messages`	`array[string]`	For each chunk, if there was an error computing this metric, details of the error are here, and other values are NULL. If no error, this is NULL.
`retrieval/llm_judged/{assessment_name}/precision`	`float, [0, 1]`	Percentage of all retrieved chunks that the custom judge evaluated as `yes`.

Metrics reported for the entire evaluation set:

Metric name	Type	Description
`retrieval/llm_judged/{assessment_name}/precision/average`	`float, [0, 1]`	Average value of `{assessment_name}_precision` across all questions.

Try judges using the `databricks-agents` SDK

The databricks-agents SDK includes APIs to directly invoke judges on user inputs. You can use these APIs for a quick and easy experiment to see how the judges work.

Run the following code to install the databricks-agents package and restart the python kernel:

%pip install databricks-agents -U
dbutils.library.restartPython()

You can then run the following code in your notebook, and edit it as necessary to try out the different judges on your own inputs.

from databricks.agents.eval import judges

SAMPLE_REQUEST = "What is MLflow?"
SAMPLE_RESPONSE = "MLflow is an open-source platform"
SAMPLE_RETRIEVED_CONTEXT = [
        {
            "content": "MLflow is an open-source platform, purpose-built to assist machine learning practitioners and teams in handling the complexities of the machine learning process. MLflow focuses on the full lifecycle for machine learning projects, ensuring that each phase is manageable, traceable, and reproducible."
        }
    ]
SAMPLE_EXPECTED_RESPONSE = "MLflow is an open-source platform, purpose-built to assist machine learning practitioners and teams in handling the complexities of the machine learning process. MLflow focuses on the full lifecycle for machine learning projects, ensuring that each phase is manageable, traceable, and reproducible."

# For chunk_relevance, the required inputs are `request`, `response` and `retrieved_context`.
judges.chunk_relevance(
  request=SAMPLE_REQUEST,
  response=SAMPLE_RESPONSE,
  retrieved_context=SAMPLE_RETRIEVED_CONTEXT,
)

# For context_sufficiency, the required inputs are `request`, `expected_response` and `retrieved_context`.
judges.context_sufficiency(
  request=SAMPLE_REQUEST,
  expected_response=SAMPLE_EXPECTED_RESPONSE,
  retrieved_context=SAMPLE_RETRIEVED_CONTEXT,
)

# For correctness, required inputs are `request`, `response` and `expected_response`.
judges.correctness(
  request=SAMPLE_REQUEST,
  response=SAMPLE_RESPONSE,
  expected_response=SAMPLE_EXPECTED_RESPONSE
)

# For relevance_to_query, the required inputs are `request` and `response`.
judges.relevance_to_query(
  request=SAMPLE_REQUEST,
  response=SAMPLE_RESPONSE,
)

# For groundedness, the required inputs are `request`, `response` and `retrieved_context`.
judges.groundedness(
  request=SAMPLE_REQUEST,
  response=SAMPLE_RESPONSE,
  retrieved_context=SAMPLE_RETRIEVED_CONTEXT,
)

# For safety, the required inputs are `request` and `response`.
judges.safety(
  request=SAMPLE_REQUEST,
  response=SAMPLE_RESPONSE,
)

Information about the models powering LLM judges

LLM judges might use third-party services to evaluate your GenAI applications, including Azure OpenAI operated by Microsoft.
For Azure OpenAI, Databricks has opted out of Abuse Monitoring so no prompts or responses are stored with Azure OpenAI.
For European Union (EU) workspaces, LLM judges use models hosted in the EU. All other regions use models hosted in the US.
Disabling Azure AI-powered AI assistive features prevents the LLM judge from calling Azure AI-powered models.
Data sent to the LLM judge is not used for any model training.
LLM judges are intended to help customers evaluate their RAG applications, and LLM judge outputs should not be used to train, improve, or fine-tune an LLM.

Share via

How quality is assessed with LLM judges

Step 1: LLM judges assess each row’s quality

Step 2: Combine LLM judge assessments to identify the root cause of quality issues

Available LLM judges

Customer-defined judges

How Databricks maintains and improves LLM judge accuracy

How quality and root cause are determined

Quality determination

Root cause determination

Retrieval metrics

Did the retriever find relevant chunks?

Input required for llm_judged/chunk_relevance

Output for llm_judged/chunk_relevance

How many of the known relevant documents did the retriever find?

Input required for document_recall

Output for document_recall

Did the retriever find documents sufficient to produce the expected response?

Input required for context_sufficiency

Output for context_sufficiency

Response metrics

Overall, did the LLM give an accurate answer?

Input required for correctness

Output for correctness

Is the response relevant to the request?

Input required for relevance_to_query

Output for relevance_to_query

Is the response a hallucination, or is it grounded in the retrieved context?

Input required for groundedness

Output for groundedness

Is there harmful content in the agent response?

Input required for safety

Output for safety

How cost and latency are assessed

Token cost

Execution latency

How metrics are aggregated at the level of an MLflow run for quality, cost, and latency

Custom judge metrics

Custom LLM judge for ANSWER assessment

Custom LLM judge for RETRIEVAL assessment

Try judges using the databricks-agents SDK

Information about the models powering LLM judges

Feedback

Additional resources

Input required for `llm_judged/chunk_relevance`

Output for `llm_judged/chunk_relevance`

Input required for `document_recall`

Output for `document_recall`

Input required for `context_sufficiency`

Output for `context_sufficiency`

Input required for `correctness`

Output for `correctness`

Input required for `relevance_to_query`

Output for `relevance_to_query`

Input required for `groundedness`

Output for `groundedness`

Input required for `safety`

Output for `safety`

Try judges using the `databricks-agents` SDK