Share via

Request example for multigrader with custom grader in Reinforcement fine tuning in AI foundry.

Oliver Su (Artech Consulting LLC) 60 Reputation points Microsoft Employee
2025-10-24T18:52:49.33+00:00

Hi there, my custom grader is working fine when using it separately but when i combine with it with other in-built grader, it always failed. In the website tutorial, there is no template for multiple grader which includes custom grader, could u give an example please?

This is what i have.

{
"name":"sample_multi_grader",
"type":"multi",
"graders":{"ext_text_similarity":{"name":"ext_text_similarity",
"type":"text_similarity",
"input":"{{sample.output_json.ext_text}}",
"reference":"{{item.ext_text}}",
"evaluation_metric":"fuzzy_match"},

"custom_check":{
"type":"python",
"source":"{import re ....}",
}
},
"calculate_output":"0.5 * ext_text_similarity + 0.5 * custom_check"
}
Foundry Tools
Foundry Tools

Formerly known as Azure AI Services or Azure Cognitive Services is a unified collection of prebuilt AI capabilities within the Microsoft Foundry platform


2 answers

Sort by: Most helpful
  1. Anonymous
    2025-10-28T11:53:19.8433333+00:00

    Hi Oliver Su (Artech Consulting LLC)

    Example of how to define the input and reference for custom grader in the multi grader scenario please

    {
      "model": "gpt-5-mini-2025-08-07",
      "method": {
        "type": "reinforcement",
        "reinforcement": {
          "hyperparameters": {
            "n_epochs": 3,
            "batch_size": 8,
            "eval_interval": 1,
            "eval_samples": 5
          },
          "grader": {
            "name": "summary_quality_multigrader",
            "type": "multi",
            "graders": {
              "text_sim": {
                "name": "text_similarity_grader",
                "type": "text_similarity",
                // Model Output → Reference mapping
                "input": "{{sample.output_json.response}}",
                "reference": "{{item.reference.answer}}",
                "evaluation_metric": "fuzzy_match"
              },
              "custom_quality": {
                "name": "custom_summary_quality",
                "type": "python",
                // Model Output → Reference mapping
                "input": "{{sample.output_json.response}}",
                "reference": "{{item.reference.answer}}",
                "source": "def grade(sample_text: str, reference_text: str) -> float:\n"
                          "    # Reward mention of 'AI Foundry' and brevity (< 20 words)\n"
                          "    if not sample_text:\n"
                          "        return 0.0\n"
                          "    score = 0.0\n"
                          "    if 'AI Foundry' in sample_text:\n"
                          "        score += 0.5\n"
                          "    if len(sample_text.split()) < 20:\n"
                          "        score += 0.5\n"
                          "    return min(score, 1.0)"
              }
            },
            // Weighted aggregation across graders
            "calculate_output": "0.6 * text_sim + 0.4 * custom_quality",
            "invalid_grade": 0.0
          }
        }
      }
    }
    

    Dataset: JSONL with clear split of input (what the model sees) and reference (what graders use).

    Bindings: In each grader, set "input" to model output path and "reference" to ground truth path.

    Custom grader: Python function returning a score in [0,1] (optionally a dict with score/reason if supported).

    Aggregation: Use a weighted expression like "0.6 * text_sim + 0.4 * custom_quality".

    Validation: Provide an invalid_grade fallback for edge cases.

    I Hope this helps.

    Thank you!

    Was this answer helpful?

    0 comments No comments

  2. Azar 31,720 Reputation points MVP Volunteer Moderator
    2025-10-24T19:19:45.3366667+00:00

    Hi there Oliver Su (Artech Consulting LLC)

    Thanks for using QandA platform

    the multi-grader setup in Azure AI Foundry is a bit picky when combining built-in and custom graders. The main thing to check is that each grader inside your graders block explicitly defines both input and reference, even for the custom Python grader. Also, make sure the names you use in calculate_output exactly match the grader keys. For example, you can structure it like this: one grader for text similarity and another for your custom check, then combine them with something like "calculate_output": "0.5 * ext_text_similarity + 0.5 * custom_check". The custom grader’s source should return a numeric value (like 0 or 1). Once you align those details, it should work fin

    If this helps kindly accept the answer

    Was this answer helpful?


Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.