Share via

How to Evaluate a Custom RAG Pipeline in Azure AI Foundry (Not Built-in)?

Parul Paul 20 Reputation points
2026-02-23T10:35:40.2633333+00:00

Hi,

I’m working with a custom Retrieval-Augmented Generation (RAG) pipeline that is not built using the built-in features of Azure AI Foundry. I would like to evaluate its performance using the evaluation tools available in AI Foundry.

However, I’m a bit confused about the correct approach:

Should I select Agent, Model, or Dataset for evaluation?

Since my RAG pipeline includes retrieval + generation logic externally, what is the recommended way to plug it into the evaluation workflow?

How can I properly structure inputs (queries, context, responses) for accurate evaluation?

Are there any best practices for evaluating custom RAG systems (e.g., groundedness, relevance, faithfulness)?

Any guidance, examples, or documentation references would be really helpful. Thanks

Foundry Tools
Foundry Tools

Formerly known as Azure AI Services or Azure Cognitive Services is a unified collection of prebuilt AI capabilities within the Microsoft Foundry platform

0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Alex Burlachenko 19,465 Reputation points Volunteer Moderator
    2026-02-27T08:44:53.6233333+00:00

    Parul Paul hi,

    if ur RAG pipeline is fully custom and not built using Foundry’s built-in RAG or Agent features, then u should not select Agent or Model for evaluation. Those are for things deployed inside Foundry. In ur case u should use Dataset-based evaluation. The correct pattern is run ur full custom RAG pipeline externally (retrieval + generation) collect the outputs and then upload a structured evaluation dataset into Foundry.

    Structure ur JSONL could be like this

    { "question": "What is the SLA for premium tier?", "context": "Retrieved chunks that were passed to the model...", "answer": "Model’s generated response", "ground_truth": "Expected correct answer" }

    Then use Foundry evaluators such as

    Groundedness (checks answer vs context)

    Relevance (checks answer vs question)

    Similarity\correctness (answer vs ground_truth)

    Foundry does not need to execute ur pipeline. It only needs question context answer ground_truth (if available).

    As I know, the best practice for custom RAG evaluation is always log the exact retrieved context used during generation store final model output exactly as returned keep evaluation dataset separate from training data use multiple evaluators not just similarity

    So assume workflow for issue could be like run ur RAG > export results > convert to JSONL > upload dataset > run evaluation on dataset. Go ahead to check it.

    Foundry is evaluating outputs, not orchestrating ur custom pipeline.

    rgds,

    Alex

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.