Design synthetic test datasets for multi-agent evaluation

Completed

Microsoft Foundry's azure-ai-inference SDK provides the model access needed to generate synthetic test interactions at scale, giving you privacy-safe datasets with systematic scenario coverage. Real customer interactions contain personally identifiable information (PII) and are difficult to use repeatedly for testing—regulations restrict storage and sharing of customer data.

Build a scenario taxonomy

Production data samples provide realistic scenarios but can't be freely modified to test edge cases or rare failure modes. Synthetic datasets solve both problems: you design interactions covering specific scenarios, control exactly which edge cases appear, and generate variations systematically without privacy constraints.

Dataset source Advantages Limitations
Production data samples Highly realistic, true user behavior Contains PII, can't test specific edge cases
Manually authored test cases Full control, no PII Time-consuming, limited coverage
Synthetic LLM-generated Scalable, systematic coverage, privacy-safe Requires quality validation

Define a comprehensive taxonomy of interaction types for Adventure Works' multi-agent customer service platform. This taxonomy ensures test coverage across all supported scenarios rather than concentrating on common cases. Structure the taxonomy across four levels:

  • Level 1: Interaction category — product search, order placement, order modification, returns processing, account management, shipping inquiries
  • Level 2: Complexity variant — simple single-item, multi-item, cross-category, with constraints
  • Level 3: Customer persona — first-time buyer, repeat customer, premium tier, international
  • Level 4: Adversarial conditions — vague request, contradictory information, impatient customer, policy edge case

Target coverage: 20 base scenarios × 5 complexity variants × 2 customer personas × 2 adversarial conditions = 400 synthetic test cases. This systematic expansion ensures the test suite covers routine happy paths, complex multi-step workflows, and edge cases that rarely appear in production but cause failures when they do.

For Adventure Works, the product search category includes these scenarios:

  • Simple product search: "chocolate truffles" (baseline)
  • Multi-attribute search: "sugar-free dark chocolate under $20" (complexity variant)
  • Vague search: "something sweet for a gift" (adversarial: vague)
  • Contradictory search: "cheap luxury chocolates" (adversarial: contradictory requirements)

Each scenario generates with both a regular customer persona and a premium tier persona, testing whether agents provide appropriate service level differentiation.

Generate synthetic interactions with an LLM

Use a powerful LLM to generate realistic synthetic customer interactions based on scenario specifications. The generation prompt specifies the customer persona, the starting request, required conversation characteristics, and expected ground truth resolution.

from azure.ai.inference import ChatCompletionsClient
from azure.identity import DefaultAzureCredential
import json, os

chat_client = ChatCompletionsClient(
    endpoint=os.environ["AZURE_AI_SERVICES_ENDPOINT"],
    credential=DefaultAzureCredential(),
)

def generate_synthetic_interaction(scenario: dict, persona: dict, 
                                   adversarial: str = None) -> dict:
    """Generate synthetic customer service interaction using LLM."""
    
    generation_prompt = f"""Generate a realistic customer service interaction for an e-commerce chocolate company.

SCENARIO SPECIFICATION:
- Interaction type: {scenario['type']}
- Complexity: {scenario['complexity']}
- Customer persona: {persona['tier']}, {persona['experience_level']}
- Adversarial condition: {adversarial or 'None - straightforward request'}

REQUIREMENTS:
1. Customer initial request: Write a realistic opening message this customer would send
2. Expected customer behavior: Describe how this persona typically communicates
3. Desired outcome: What should happen if the agent system works correctly
4. Ground truth resolution: Specific details of the correct resolution

GENERATE:
- Customer opening message (realistic language for this persona)
- 2-3 likely customer follow-up messages (if multi-turn expected)
- Expected ground truth outcome with specific details

OUTPUT FORMAT (JSON):
{{
    "customer_request": "...",
    "customer_follow_ups": ["...", "..."],
    "expected_behavior": "...",
    "ground_truth_resolution": {{
        "action_taken": "...",
        "details": {{}},
        "confirmation_message_should_include": ["...", "..."]
    }},
    "evaluation_criteria": {{
        "task_completed": "boolean - if action was executed correctly",
        "information_accurate": "boolean - if all facts were correct",
        "customer_effort_minimal": "boolean - if resolution was efficient"
    }}
}}
"""
    
    response = chat_client.complete(
        model=os.environ["MODEL_DEPLOYMENT_NAME"],
        messages=[{"role": "user", "content": generation_prompt}],
        temperature=0.8  # Higher temperature for variation
    )
    
    result_text = response.choices[0].message.content
    
    # Extract JSON from potential markdown code fence
    if "```json" in result_text:
        result_text = result_text.split("```json")[1].split("```")[0]
    
    synthetic_case = json.loads(result_text)
    
    # Add metadata
    synthetic_case["scenario_id"] = scenario["id"]
    synthetic_case["persona"] = persona["tier"]
    synthetic_case["adversarial_condition"] = adversarial
    
    return synthetic_case

# Example usage
scenario = {
    "id": "return_damaged_item",
    "type": "returns_processing",
    "complexity": "simple_single_item"
}

persona = {
    "tier": "premium",
    "experience_level": "repeat_customer"
}

synthetic_case = generate_synthetic_interaction(
    scenario=scenario,
    persona=persona,
    adversarial="impatient_customer"
)

print(json.dumps(synthetic_case, indent=2))

This generation approach creates variations automatically: different phrasing for the same scenario, different customer communication styles, and different edge case details. Running generation 10 times for the same scenario specification produces 10 distinct synthetic cases—providing test dataset diversity without manual authoring.

Annotate ground truth for evaluation

Each synthetic test case needs ground truth annotations enabling automatic evaluation. Ground truth must be specific enough to enable reliable scoring but not so rigid it only accepts one valid response phrasing.

Binary ground truth works for task completion: did the system execute the required action (yes/no)? For a return request, binary ground truth checks: was a return authorization created, was the correct refund amount calculated, and was a return shipping label provided? These are verifiable system state checks.

Rubric-based ground truth works for quality assessment: annotate expected goal achievement score breakdowns. For the same return scenario: information accuracy should be 3/3 (return policy stated correctly, refund amount correct, timeline accurate), action correctness should be 4/4 (return authorized, refund initiated), customer effort should be 3/3 (straightforward process, no confusion or backtracking).

Acceptable variation ranges prevent over-fitting to specific phrasings. Instead of annotating "confirmation message must say exactly 'Your return has been authorized'", annotate "confirmation message should include: return authorization number, refund amount, expected timeline". This allows natural language variation while checking for required information completeness.

import json
from typing import List, Dict

class SyntheticTestCase:
    """Structured synthetic test case with ground truth annotations."""
    
    def __init__(self, scenario_id: str, customer_request: str,
                 ground_truth: dict):
        self.scenario_id = scenario_id
        self.customer_request = customer_request
        self.ground_truth = ground_truth
    
    def to_dict(self) -> dict:
        return {
            "scenario_id": self.scenario_id,
            "customer_request": self.customer_request,
            "ground_truth": self.ground_truth
        }
    
    @staticmethod
    def create_return_case(item_type: str, damage_type: str, 
                          order_age_days: int) -> 'SyntheticTestCase':
        """Factory for return scenario test cases."""
        
        return SyntheticTestCase(
            scenario_id=f"return_{item_type}_{damage_type}",
            customer_request=f"I received a {item_type} but it's {damage_type}. "
                           f"I ordered it {order_age_days} days ago and want a refund.",
            ground_truth={
                "task_completion": {
                    "should_complete": order_age_days <= 30,  # Return window
                    "required_actions": [
                        "return_authorization_created",
                        "refund_amount_calculated",
                        "return_label_provided"
                    ]
                },
                "goal_achievement_breakdown": {
                    "information_accuracy": {
                        "max_points": 3,
                        "criteria": [
                            "return_policy_stated_correctly",
                            "refund_amount_matches_order",
                            "timeline_expectations_set"
                        ]
                    },
                    "action_correctness": {
                        "max_points": 4,
                        "criteria": [
                            "return_authorization_number_generated",
                            "refund_initiated_to_original_payment",
                            "return_label_generated",
                            "order_status_updated"
                        ]
                    },
                    "customer_effort_minimization": {
                        "max_points": 3,
                        "criteria": [
                            "no_contradictory_information",
                            "no_circular_routing_between_agents",
                            "confirmation_clear_and_complete"
                        ]
                    }
                },
                "expected_system_state": {
                    "return_status": "authorized",
                    "refund_status": "initiated",
                    "items": [{"sku": item_type, "quantity": 1}]
                },
                "confirmation_must_include": [
                    "return authorization number",
                    "refund amount",
                    "expected refund timeline",
                    "return shipping instructions"
                ]
            }
        )

# Generate test suite
test_suite: List[SyntheticTestCase] = []

# Generate variations
for item_type in ["dark_chocolate_box", "truffle_assortment", "gift_basket"]:
    for damage_type in ["damaged", "melted", "wrong_item"]:
        for order_age in [5, 15, 35]:  # Within window, near edge, outside window
            test_case = SyntheticTestCase.create_return_case(
                item_type, damage_type, order_age
            )
            test_suite.append(test_case)

print(f"Generated {len(test_suite)} test cases")

# Export as JSON dataset
dataset = [case.to_dict() for case in test_suite]
with open("synthetic_returns_test_suite.json", "w") as f:
    json.dump(dataset, f, indent=2)

Maintain dataset freshness

Synthetic datasets go stale when product catalogs, policies, or agent capabilities change. A test case expecting a 30-day return window fails incorrectly if Adventure Works updates its policy to 45 days. Design a quarterly review process to prevent dataset drift.

Review trigger conditions: product catalog updates (new products, discontinued SKUs, pricing changes), policy modifications (return windows, shipping costs, refund methods), and agent capability changes (new agents added, existing agents deprecated, routing logic updated).

Review process: Run the test suite against a reference multi-agent system, identify cases where the "correct" ground truth answer changed due to business rule updates, update ground truth annotations to reflect new policies or capabilities, and regenerate synthetic requests if product references are no longer valid (discontinued SKUs).

For Adventure Works, the Q1 2026 review identified 23 test cases requiring updates: 12 cases referenced discontinued seasonal products (updated to current catalog), 8 cases used the old 30-day return window (updated to new 45-day policy), and 3 cases expected routing to the promotions agent (removed after that agent was deprecated).

This maintenance prevents false failures: tests failing not because the agent system regressed but because the test expectations no longer match current business rules.

Key takeaways

  • Synthetic datasets solve PII restrictions and edge case coverage by generating privacy-safe test interactions with controlled scenario variations.
  • Scenario taxonomy systematically combines interaction categories, complexity variants, customer personas, and adversarial conditions for comprehensive test coverage.
  • LLM-generated interactions produce diverse test cases from scenario specifications, creating multiple phrasing variations without manual authoring.
  • Ground truth annotations include binary task completion checks, rubric-based quality breakdowns, and acceptable variation ranges to enable reliable automatic evaluation.
  • Dataset freshness requires quarterly reviews triggered by catalog updates, policy changes, or agent capability modifications to prevent false test failures.