Apply Git-based workflows to optimization experiments
Optimization experiments require systematic organization to track which changes were tested and what results they produced. Git-based workflows enable you to test agent variants safely, document evaluation results, and compare experiments to identify which configuration performs best.
- Create branch: Create experiment branch for each variant
- Add test prompts: Store test prompts in experiment folder
- Run evaluation script: Deploy agent version, run test prompts, capture responses
- Score responses: Manually evaluate responses for quality metrics
- Compare and decide: Review results across branches, merge successful experiments
Create experiment branches
Each optimization experiment lives in its own branch, keeping experimental changes separate from your production agent. Create one branch per experiment variant to isolate what changed—testing a new prompt, different model, or configuration adjustment one at a time. This controlled approach lets you attribute performance changes to specific modifications rather than mixing multiple changes in one branch.
With the Adventure Works Trail Guide Agent, you create experiment branches to test different variants:
main # Production baseline (prompt v1)
experiment/prompt-v2-concise # Test shorter, more focused prompt
experiment/prompt-v2-detailed # Test enhanced prompt with examples
experiment/gpt4o-mini-model # Test GPT-4o-mini model
experiment/token-optimization # Reduce token usage
When an experiment proves successful through evaluation, you merge it to main. For failed experiments, you can either keep the branch as documentation of what didn't work (preventing future teams from repeating unsuccessful approaches) or delete the branch to remove clutter (if the evaluation results are already committed and documented).
Store test prompts and run evaluation script
Each experiment branch organizes files in a consistent structure that separates code, prompts, and evaluation data:
adventure-works-agent/
├── agent.py # Agent creation script
├── run-agent.py # Script to run agent with test prompts
├── prompts/
│ ├── system-prompt-v1.txt # Production prompt
│ └── system-prompt-v2-concise.txt # Experimental variant
├── test-prompts/
│ ├── scottish-highlands-march.txt # Digital nomad weekend hike
│ ├── family-london-trails.txt # Family with teenagers
│ ├── five-day-backpacking.txt # Experienced hiker extended trip
│ ├── ambiguous-hiking-gear.txt # Edge case: vague request
│ └── incomplete-scotland-trip.txt # Edge case: missing details
└── experiments/
├── prompt-v2-concise/
│ ├── agent-responses.json # Raw agent outputs
│ └── evaluation.csv # Manual quality scores and observations
├── gpt4o-mini-model/
│ ├── agent-responses.json
│ └── evaluation.csv
└── token-optimization/
├── agent-responses.json
└── evaluation.csv
The prompts/ folder stores different prompt versions as .txt files that agent.py loads when creating agent versions. The test-prompts/ folder contains individual .txt files for each test scenario, with descriptive names that indicate what user need they represent. The run-agent.py script loads these test prompt files, calls the agent for each one, and captures responses. Each experiment has its own folder in experiments/ containing only its results.
The test prompt files contain your 5-10 test scenarios from Unit 2. The run-agent.py script automates the testing workflow:
- Check out experiment branch:
git checkout experiment/prompt-v2-concise - Deploy agent version:
python agent.py(creates agent version in Microsoft Foundry) - Run evaluation:
python run-agent.py(loads test prompts, calls agent for each prompt, captures responses, saves toagent-responses.json)
The script captures agent responses from the API and saves them to agent-responses.json. You then create an evaluation.csv file where you manually score each response using the same format that Microsoft Foundry portal uses for evaluation exports.
Score responses manually
Review the agent responses captured in agent-responses.json. For quick manual testing, a best practice is to choose three to five evaluation criteria that matter most for your use case, plus an optional open field for additional comments. Create an evaluation.csv file with these columns to match the portal's export format:
| Test Prompt | Agent Response | Intent Resolution | Relevance | Groundedness | Comments |
|---|---|---|---|---|---|
| scottish-highlands-march | For hiking in the Scottish Highlands in March... | 5 | 5 | 4 | Excellent gear recommendations |
| family-london-trails | For easy trails near London with teenagers... | 4 | 4 | 5 | Good beginner advice |
| five-day-backpacking | For a five-day backpacking trip... | 5 | 5 | 5 | Comprehensive list |
| ambiguous-hiking-gear | What type of hiking are you planning... | 3 | 3 | 4 | Asked clarifying questions |
| incomplete-scotland-trip | For Scotland hiking, I'd recommend... | 4 | 4 | 4 | Made reasonable assumptions |
Include test prompt filename, agent response excerpt, your quality scores (1-5 scale), and comments about response quality.
Tip
Align your evaluation format with what can be evaluated through the Microsoft Foundry portal and with automatic evaluations. When you use consistent evaluation criteria and file formats across manual testing, portal evaluations, and automated testing, you make it easy to consolidate test results from different team members and evaluation methods.
Compare experiments and decide
After completing evaluations across multiple experiment branches, use your CSV data to compare performance and make evidence-based decisions. Check out each experiment branch and review its evaluation.csv to see how it performed. Note the key findings from each branch, then create a comparison to identify which variant meets your success criteria.
For the Adventure Works experiments, you might document your comparison:
| Experiment branch | Key observations | Meets criteria? |
|---|---|---|
| main (baseline) | Solid responses, some verbosity | Yes (4.2 avg) |
| prompt-v2-concise | Maintains quality, more focused | Yes (4.4 avg) |
| gpt4o-mini-model | Lower quality on complex prompts | No (4.1 avg, below 4.2 threshold) |
If prompt-v2-concise meets your quality threshold and improves conciseness, use Git to merge the winning experiment:
git checkout main
git merge experiment/prompt-v2-concise
git tag promoted-to-prod-2026-02-17
git push origin main --tags
For experiments that don't meet criteria, document why before deciding whether to keep or delete the branch: "gpt4o-mini-model: Quality dropped below 4.2 threshold on complex trip planning prompts. Not recommended for production."
With Git workflows established for organizing experiments, you're ready to execute the actual evaluations by running agents against test prompts and systematically scoring the results.