How I Actually Test AI Agents

May 25, 2026·12 min read

AI Agentstestingevalsdeveloper toolsLLM

In February I shipped a system prompt change that broke tool routing for roughly 40% of requests. I found out not from tests — I didn't have any — but because I kept manually redoing things the agent was supposed to handle. Took me three days to connect the dots.

The MCP post I wrote last month mentioned a 200-case labeled test set. A few people asked about it. This is the whole thing.

It's a CSV, a scoring script, and two hours the first week. No vector database, no eval framework, no pipeline. Six months of catching failures in production instead taught me that the failure is always more expensive than the test would have been.

Step 1: define what "correct" means before you write a single test

This is where most people stall. "Correct" feels hard to define for language model output. It's not. You're not grading essays — you're checking whether an agent did the right thing.

For every type of task your agent handles, write down the acceptable outcomes in plain language. Before writing any test code.

Task type	Bad definition of correct	Actually useful definition
Summarize a document	Sounds good to me	Contains all 3 key decisions from the doc, under 150 words, no hallucinated facts
Answer a question from context	Uses the right tone	Answer appears verbatim or paraphrased in the source. No facts introduced from outside.
Route a user request to the right tool	Picks something reasonable	Calls exactly tool X with params matching the schema. No alternatives.
Refuse an out-of-scope request	Politely declines	Does not attempt the task AND does not ask clarifying questions
Extract structured data	Gets the main fields	All 5 required fields present, correct types, no extra keys

Write these definitions first, in a document, before you touch any test infrastructure. This document is your spec. If you skip it, you'll spend two weeks building an eval harness that measures the wrong thing with great precision.

Step 2: build the labeled set

For Hermes — my AI agent that runs 24/7 on an Azure VM — I built the test set by going back through my actual usage logs. Every query I'd sent over the past few weeks that had a clear correct answer became a test case.

The format is intentionally boring:

id,input,expected_output_type,criteria,ground_truth
001,"restart the stock-bot service",tool_call,"calls terminal() with systemctl restart","{\"tool\": \"terminal\", \"command\": \"systemctl --user restart stock-bot\"}"
002,"what's the current bitcoin price",api_result,"returns a number in USD, mentions BTC",""
003,"summarize last week's crypto PnL",text,"mentions total PnL figure, no hallucinated trades",""
004,"add a daily cron at 9am to check disk usage",action,"creates a cron job, confirms schedule",""
005,"how do I configure a new telegram bot",text,"mentions BotFather, token, no hallucinated API endpoints",""

Some things I learned building this:

Include your edge cases. The normal happy path will work fine. Your eval set should be weighted toward the inputs that broke things in production, inputs that almost broke things, and inputs that look ambiguous. If you only test the things you're confident about, you'll have confident numbers and production surprises.

Not everything needs a ground truth string. Some tests are structural: did the agent call the right tool? Did it refuse when it should refuse? Those are binary and you can write a simple check function.

100 cases is enough to start. I've seen people not start because they're waiting to get to 1,000. 100 gives you a number with real signal. If you're below 70% on 100 representative cases, you have a problem. Go fix it before worrying about statistical power.

Test case distribution by type

Tool calls (exact match)35%

Text generation (criteria-based)28%

Refusals / out-of-scope18%

Structured extraction12%

Multi-step tasks7%

Step 3: the scoring script

Here's the actual script I use. It's under 100 lines. Replace run_agent with whatever your agent's entrypoint is — a LangChain chain, a CrewAI crew, a raw API call, anything that takes a string and returns a response:

import csv, json, asyncio
from agent import run_agent  # replace with your entrypoint

CASES_FILE = "eval/cases.csv"
RESULTS_FILE = "eval/results.json"

async def run_case(case: dict) -> dict:
    response = await run_agent(case["input"])
    passed = check(response, case)
    return {
        "id": case["id"],
        "input": case["input"],
        "passed": passed,
        "response": str(response)[:500],
        "criteria": case["criteria"]
    }

def check(response, case: dict) -> bool:
    otype = case["output_type"]

    if otype == "tool_call":
        gt = json.loads(case["ground_truth"])
        return (
            response.tool == gt["tool"] and
            gt.get("command", "") in response.args.get("command", "")
        )

    if otype == "refusal":
        return not response.attempted_task

    if otype == "text":
        # Criteria-based: all required phrases present
        criteria = case["criteria"].lower()
        text = str(response).lower()
        required = [c.strip() for c in criteria.split(",")]
        return all(r in text for r in required)

    return False

async def main():
    with open(CASES_FILE) as f:
        cases = list(csv.DictReader(f))

    results = await asyncio.gather(*[run_case(c) for c in cases])

    passed = sum(r["passed"] for r in results)
    total = len(results)
    print(f"\n{'='*40}")
    print(f"SCORE: {passed}/{total} ({100*passed//total}%)")
    print(f"{'='*40}\n")

    failures = [r for r in results if not r["passed"]]
    for f in failures[:5]:
        print(f"FAIL [{f['id']}]: {f['input']}")
        print(f"  Expected: {f['criteria']}")
        print(f"  Got:      {f['response'][:200]}\n")

    with open(RESULTS_FILE, "w") as f:
        json.dump({"score": passed/total, "results": results}, f, indent=2)

asyncio.run(main())

Run it. Get a number. Write it down.

That number is your baseline. Everything you do to the agent from now on will either move it up or move it down. If it goes down, you revert. If it goes up, you ship.

What actually moves the score

Before getting into the feedback loop — here's what has actually moved the needle, in order of impact across my experience:

Clarifying the system prompt — adding "when the user asks about X, always do Y first" covers the most ground fastest
Fixing tool descriptions — especially the "when should I call this" part, which almost nobody writes
Adding examples to tool schemas — one good example is worth three paragraphs of description
Adjusting temperature — often overlooked; lower for routing/extraction, higher for text generation
Model upgrade — usually the last lever, not the first

The first two items alone account for most of the movement I've seen. Model upgrades are a distant fifth.

Quick Check

You fix a bug where your agent calls the wrong tool 30% of the time. Where do you look first?

Step 4: the feedback loop that actually matters

Here's the thing nobody tells you: the score doesn't matter as much as watching the failures.

Every time I run evals, I look at the first five failures by hand. Not the summary stats — the actual output. What did the agent say? Why was it wrong? Is this a prompt issue, a tool design issue, or a test case that's actually wrong?

Run eval script

↓

Read the failures

↓

Classify the root cause

↓

Fix one category at a time

↓

Re-run and compare

↓

Commit the fix

Ready

The most important part of that diagram is the last step. I treat a regression as blocking. If I change the system prompt and the score drops from 87% to 83%, I revert and look at what broke. This sounds obvious but I see a lot of teams that ship changes based on vibes and check evals "soon."

What I learned about AI agent testing over time

Month 1 — No evals, no idea

Tested by hand, shipped fixes, broke other things. Noticed problems because I kept doing things manually that the agent was supposed to handle. Did not connect the dots for days.

Month 3 — First labeled set (28 cases)

Scored 63%. Found three categories of failure that were completely obvious once I could see them. Still do not know why I waited.

Month 4 — Grew to 94 cases

Added edge cases from real failures. Score 79%. The jump came from three prompt clarifications, not more cases. Each one fixed eight failures.

Month 6 — Automated in CI

Runs on every push to the prompt files. Blocks merge if score drops. First time it blocked a change I was proud of, I sat there for ten minutes before admitting it was right.

Month 9 — 207 cases, 91%

Added an LLM judge for text quality. Only added it after the structural tests were solid. Probably should have waited even longer.

Now

New case any time something breaks in production. The set is the failure log, basically.

The jump from 63% to 79% did not come from adding more cases. Three prompt clarifications, each fixing eight failures. That was the whole thing. I kept waiting for some insight about scale or diversity of the test set. It was just "add a when-to-call-this clause to the tool description" three times over.

Where LLM-as-judge actually helps (and where it doesn't)

I added an LLM judge six months in, not at the start. Here's my honest take on when it's worth it:

Use LLM judge	Skip LLM judge	Why
Long-form text quality	Tool call correctness	Tool calls are deterministic. LLM scoring adds noise here.
Tone / voice matching	Factual accuracy	LLMs are unreliable at verifying facts. Use retrieval-based checks.
Coherence of multi-turn conversations	Binary pass/fail criteria	A regex is cheaper, faster, and more reliable for yes/no checks
Evaluating summaries against a rubric	Refusal detection	Just check if the agent attempted the task. No judge needed.

The pattern: use LLM-as-judge for things that are genuinely hard to express as code. Don't use it as a shortcut to avoid writing criteria.

My judge prompt when I do use it:

You are evaluating an AI agent's response.

Task the agent was given: {input}
Agent response: {response}
Evaluation criteria: {criteria}

Answer with ONLY a JSON object:
{"passed": true/false, "reason": "one sentence"}

Do not add commentary. Do not explain your reasoning beyond the reason field.

The explicit JSON-only instruction matters. Without it, you'll get markdown-wrapped JSON 20% of the time and your parser will break.

The only metric that matters in prod

The eval score gives you a controlled number. But in production, the metric I actually watch is simpler:

unsolicited retries per session

If a user has to ask the same thing twice, in the same session, because the first answer didn't work — that's a failure. I track this in my session logs. It's the most honest signal I have about whether the agent is actually working.

The breakdown below is from roughly 850 sessions over six months. Not a rigorous sample, but directionally solid:

Where agent failures actually happen (production, ~850 sessions)

Wrong tool called31%

Correct tool, wrong params24%

Correct answer, user did not understand format19%

Missing context / wrong assumption17%

Refusal when it should not have refused9%

Wrong tool or wrong params accounts for over half. Tool design problem, not a model problem. Which is good news — it's more fixable.

The practical setup

Here's the full structure I use, in case you want to copy it directly:

agent/
├── eval/
│   ├── cases.csv          # labeled test cases
│   ├── run_evals.py       # scoring script
│   ├── results/           # json results per run
│   └── baseline.json      # last known good score
├── prompts/
│   ├── system.txt         # system prompt
│   └── tools.yaml         # tool definitions
└── .github/
    └── workflows/
        └── evals.yml      # CI: run evals on prompt changes

The CI workflow is four lines:

on:
  push:
    paths: ['agent/prompts/**']

jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install -r requirements.txt
      - run: python agent/eval/run_evals.py --fail-below 85

--fail-below 85 blocks the merge if score drops under 85%. This number will feel uncomfortable the first time a prompt change you were proud of gets blocked by it. That's the point.

Start here if you haven't done any of this

Write down 20 tasks your agent actually handles. Just the input, in plain text.
Run each one. Write "pass" or "fail" next to each one.
Look at the failures. Write one sentence about what went wrong for each.
Fix the most common root cause.
Re-run the 20.

That's it. That's the eval loop. Do that once a week and your agent will be meaningfully better in a month. Add the CSV, the script, and the CI gate when you've done it a few times by hand and it feels worth automating.

The fancy stuff comes later. The list comes first.