How I Actually Test AI Agents
In February I shipped a system prompt change that broke tool routing for roughly 40% of requests. I found out not from tests — I didn't have any — but because I kept manually redoing things the agent was supposed to handle. Took me three days to connect the dots.
The MCP post I wrote last month mentioned a 200-case labeled test set. A few people asked about it. This is the whole thing.
It's a CSV, a scoring script, and two hours the first week. No vector database, no eval framework, no pipeline. Six months of catching failures in production instead taught me that the failure is always more expensive than the test would have been.
Step 1: define what "correct" means before you write a single test
This is where most people stall. "Correct" feels hard to define for language model output. It's not. You're not grading essays — you're checking whether an agent did the right thing.
For every type of task your agent handles, write down the acceptable outcomes in plain language. Before writing any test code.
| Task type | Bad definition of correct | Actually useful definition |
|---|---|---|
| Summarize a document | Sounds good to me | Contains all 3 key decisions from the doc, under 150 words, no hallucinated facts |
| Answer a question from context | Uses the right tone | Answer appears verbatim or paraphrased in the source. No facts introduced from outside. |
| Route a user request to the right tool | Picks something reasonable | Calls exactly tool X with params matching the schema. No alternatives. |
| Refuse an out-of-scope request | Politely declines | Does not attempt the task AND does not ask clarifying questions |
| Extract structured data | Gets the main fields | All 5 required fields present, correct types, no extra keys |
Write these definitions first, in a document, before you touch any test infrastructure. This document is your spec. If you skip it, you'll spend two weeks building an eval harness that measures the wrong thing with great precision.
Step 2: build the labeled set
For Hermes — my AI agent that runs 24/7 on an Azure VM — I built the test set by going back through my actual usage logs. Every query I'd sent over the past few weeks that had a clear correct answer became a test case.
The format is intentionally boring:
id,input,expected_output_type,criteria,ground_truth
001,"restart the stock-bot service",tool_call,"calls terminal() with systemctl restart","{\"tool\": \"terminal\", \"command\": \"systemctl --user restart stock-bot\"}"
002,"what's the current bitcoin price",api_result,"returns a number in USD, mentions BTC",""
003,"summarize last week's crypto PnL",text,"mentions total PnL figure, no hallucinated trades",""
004,"add a daily cron at 9am to check disk usage",action,"creates a cron job, confirms schedule",""
005,"how do I configure a new telegram bot",text,"mentions BotFather, token, no hallucinated API endpoints",""
Some things I learned building this:
Include your edge cases. The normal happy path will work fine. Your eval set should be weighted toward the inputs that broke things in production, inputs that almost broke things, and inputs that look ambiguous. If you only test the things you're confident about, you'll have confident numbers and production surprises.
Not everything needs a ground truth string. Some tests are structural: did the agent call the right tool? Did it refuse when it should refuse? Those are binary and you can write a simple check function.
100 cases is enough to start. I've seen people not start because they're waiting to get to 1,000. 100 gives you a number with real signal. If you're below 70% on 100 representative cases, you have a problem. Go fix it before worrying about statistical power.
Step 3: the scoring script
Here's the actual script I use. It's under 100 lines. Replace run_agent with whatever your agent's entrypoint is — a LangChain chain, a CrewAI crew, a raw API call, anything that takes a string and returns a response:
import csv, json, asyncio
from agent import run_agent # replace with your entrypoint
CASES_FILE = "eval/cases.csv"
RESULTS_FILE = "eval/results.json"
async def run_case(case: dict) -> dict:
response = await run_agent(case["input"])
passed = check(response, case)
return {
"id": case["id"],
"input": case["input"],
"passed": passed,
"response": str(response)[:500],
"criteria": case["criteria"]
}
def check(response, case: dict) -> bool:
otype = case["output_type"]
if otype == "tool_call":
gt = json.loads(case["ground_truth"])
return (
response.tool == gt["tool"] and
gt.get("command", "") in response.args.get("command", "")
)
if otype == "refusal":
return not response.attempted_task
if otype == "text":
# Criteria-based: all required phrases present
criteria = case["criteria"].lower()
text = str(response).lower()
required = [c.strip() for c in criteria.split(",")]
return all(r in text for r in required)
return False
async def main():
with open(CASES_FILE) as f:
cases = list(csv.DictReader(f))
results = await asyncio.gather(*[run_case(c) for c in cases])
passed = sum(r["passed"] for r in results)
total = len(results)
print(f"\n{'='*40}")
print(f"SCORE: {passed}/{total} ({100*passed//total}%)")
print(f"{'='*40}\n")
failures = [r for r in results if not r["passed"]]
for f in failures[:5]:
print(f"FAIL [{f['id']}]: {f['input']}")
print(f" Expected: {f['criteria']}")
print(f" Got: {f['response'][:200]}\n")
with open(RESULTS_FILE, "w") as f:
json.dump({"score": passed/total, "results": results}, f, indent=2)
asyncio.run(main())
Run it. Get a number. Write it down.
That number is your baseline. Everything you do to the agent from now on will either move it up or move it down. If it goes down, you revert. If it goes up, you ship.
What actually moves the score
Before getting into the feedback loop — here's what has actually moved the needle, in order of impact across my experience:
- Clarifying the system prompt — adding "when the user asks about X, always do Y first" covers the most ground fastest
- Fixing tool descriptions — especially the "when should I call this" part, which almost nobody writes
- Adding examples to tool schemas — one good example is worth three paragraphs of description
- Adjusting temperature — often overlooked; lower for routing/extraction, higher for text generation
- Model upgrade — usually the last lever, not the first
The first two items alone account for most of the movement I've seen. Model upgrades are a distant fifth.
Quick Check
You fix a bug where your agent calls the wrong tool 30% of the time. Where do you look first?
Step 4: the feedback loop that actually matters
Here's the thing nobody tells you: the score doesn't matter as much as watching the failures.
Every time I run evals, I look at the first five failures by hand. Not the summary stats — the actual output. What did the agent say? Why was it wrong? Is this a prompt issue, a tool design issue, or a test case that's actually wrong?
The most important part of that diagram is the last step. I treat a regression as blocking. If I change the system prompt and the score drops from 87% to 83%, I revert and look at what broke. This sounds obvious but I see a lot of teams that ship changes based on vibes and check evals "soon."
What I learned about AI agent testing over time
Month 1 — No evals, no idea
Tested by hand, shipped fixes, broke other things. Noticed problems because I kept doing things manually that the agent was supposed to handle. Did not connect the dots for days.
Month 3 — First labeled set (28 cases)
Scored 63%. Found three categories of failure that were completely obvious once I could see them. Still do not know why I waited.
Month 4 — Grew to 94 cases
Added edge cases from real failures. Score 79%. The jump came from three prompt clarifications, not more cases. Each one fixed eight failures.
Month 6 — Automated in CI
Runs on every push to the prompt files. Blocks merge if score drops. First time it blocked a change I was proud of, I sat there for ten minutes before admitting it was right.
Month 9 — 207 cases, 91%
Added an LLM judge for text quality. Only added it after the structural tests were solid. Probably should have waited even longer.
Now
New case any time something breaks in production. The set is the failure log, basically.
The jump from 63% to 79% did not come from adding more cases. Three prompt clarifications, each fixing eight failures. That was the whole thing. I kept waiting for some insight about scale or diversity of the test set. It was just "add a when-to-call-this clause to the tool description" three times over.
Where LLM-as-judge actually helps (and where it doesn't)
I added an LLM judge six months in, not at the start. Here's my honest take on when it's worth it:
| Use LLM judge | Skip LLM judge | Why |
|---|---|---|
| Long-form text quality | Tool call correctness | Tool calls are deterministic. LLM scoring adds noise here. |
| Tone / voice matching | Factual accuracy | LLMs are unreliable at verifying facts. Use retrieval-based checks. |
| Coherence of multi-turn conversations | Binary pass/fail criteria | A regex is cheaper, faster, and more reliable for yes/no checks |
| Evaluating summaries against a rubric | Refusal detection | Just check if the agent attempted the task. No judge needed. |
The pattern: use LLM-as-judge for things that are genuinely hard to express as code. Don't use it as a shortcut to avoid writing criteria.
My judge prompt when I do use it:
You are evaluating an AI agent's response.
Task the agent was given: {input}
Agent response: {response}
Evaluation criteria: {criteria}
Answer with ONLY a JSON object:
{"passed": true/false, "reason": "one sentence"}
Do not add commentary. Do not explain your reasoning beyond the reason field.
The explicit JSON-only instruction matters. Without it, you'll get markdown-wrapped JSON 20% of the time and your parser will break.
The only metric that matters in prod
The eval score gives you a controlled number. But in production, the metric I actually watch is simpler:
unsolicited retries per session
If a user has to ask the same thing twice, in the same session, because the first answer didn't work — that's a failure. I track this in my session logs. It's the most honest signal I have about whether the agent is actually working.
The breakdown below is from roughly 850 sessions over six months. Not a rigorous sample, but directionally solid:
Wrong tool or wrong params accounts for over half. Tool design problem, not a model problem. Which is good news — it's more fixable.
The practical setup
Here's the full structure I use, in case you want to copy it directly:
agent/
├── eval/
│ ├── cases.csv # labeled test cases
│ ├── run_evals.py # scoring script
│ ├── results/ # json results per run
│ └── baseline.json # last known good score
├── prompts/
│ ├── system.txt # system prompt
│ └── tools.yaml # tool definitions
└── .github/
└── workflows/
└── evals.yml # CI: run evals on prompt changes
The CI workflow is four lines:
on:
push:
paths: ['agent/prompts/**']
jobs:
evals:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install -r requirements.txt
- run: python agent/eval/run_evals.py --fail-below 85
--fail-below 85 blocks the merge if score drops under 85%. This number will feel uncomfortable the first time a prompt change you were proud of gets blocked by it. That's the point.
Start here if you haven't done any of this
- Write down 20 tasks your agent actually handles. Just the input, in plain text.
- Run each one. Write "pass" or "fail" next to each one.
- Look at the failures. Write one sentence about what went wrong for each.
- Fix the most common root cause.
- Re-run the 20.
That's it. That's the eval loop. Do that once a week and your agent will be meaningfully better in a month. Add the CSV, the script, and the CI gate when you've done it a few times by hand and it feels worth automating.
The fancy stuff comes later. The list comes first.