Skip to content
← Back to blog

How I Actually Test AI Agents

·12 min read
AI Agentstestingevalsdeveloper toolsLLM

In February I shipped a system prompt change that broke tool routing for roughly 40% of requests. I found out not from tests — I didn't have any — but because I kept manually redoing things the agent was supposed to handle. Took me three days to connect the dots.

The MCP post I wrote last month mentioned a 200-case labeled test set. A few people asked about it. This is the whole thing.

It's a CSV, a scoring script, and two hours the first week. No vector database, no eval framework, no pipeline. Six months of catching failures in production instead taught me that the failure is always more expensive than the test would have been.


Step 1: define what "correct" means before you write a single test

This is where most people stall. "Correct" feels hard to define for language model output. It's not. You're not grading essays — you're checking whether an agent did the right thing.

For every type of task your agent handles, write down the acceptable outcomes in plain language. Before writing any test code.

Task typeBad definition of correctActually useful definition
Summarize a documentSounds good to meContains all 3 key decisions from the doc, under 150 words, no hallucinated facts
Answer a question from contextUses the right toneAnswer appears verbatim or paraphrased in the source. No facts introduced from outside.
Route a user request to the right toolPicks something reasonableCalls exactly tool X with params matching the schema. No alternatives.
Refuse an out-of-scope requestPolitely declinesDoes not attempt the task AND does not ask clarifying questions
Extract structured dataGets the main fieldsAll 5 required fields present, correct types, no extra keys

Write these definitions first, in a document, before you touch any test infrastructure. This document is your spec. If you skip it, you'll spend two weeks building an eval harness that measures the wrong thing with great precision.


Step 2: build the labeled set

For Hermes — my AI agent that runs 24/7 on an Azure VM — I built the test set by going back through my actual usage logs. Every query I'd sent over the past few weeks that had a clear correct answer became a test case.

The format is intentionally boring:

id,input,expected_output_type,criteria,ground_truth
001,"restart the stock-bot service",tool_call,"calls terminal() with systemctl restart","{\"tool\": \"terminal\", \"command\": \"systemctl --user restart stock-bot\"}"
002,"what's the current bitcoin price",api_result,"returns a number in USD, mentions BTC",""
003,"summarize last week's crypto PnL",text,"mentions total PnL figure, no hallucinated trades",""
004,"add a daily cron at 9am to check disk usage",action,"creates a cron job, confirms schedule",""
005,"how do I configure a new telegram bot",text,"mentions BotFather, token, no hallucinated API endpoints",""

Some things I learned building this:

Include your edge cases. The normal happy path will work fine. Your eval set should be weighted toward the inputs that broke things in production, inputs that almost broke things, and inputs that look ambiguous. If you only test the things you're confident about, you'll have confident numbers and production surprises.

Not everything needs a ground truth string. Some tests are structural: did the agent call the right tool? Did it refuse when it should refuse? Those are binary and you can write a simple check function.

100 cases is enough to start. I've seen people not start because they're waiting to get to 1,000. 100 gives you a number with real signal. If you're below 70% on 100 representative cases, you have a problem. Go fix it before worrying about statistical power.


Step 3: the scoring script

Here's the actual script I use. It's under 100 lines. Replace run_agent with whatever your agent's entrypoint is — a LangChain chain, a CrewAI crew, a raw API call, anything that takes a string and returns a response:

import csv, json, asyncio
from agent import run_agent  # replace with your entrypoint

CASES_FILE = "eval/cases.csv"
RESULTS_FILE = "eval/results.json"

async def run_case(case: dict) -> dict:
    response = await run_agent(case["input"])
    passed = check(response, case)
    return {
        "id": case["id"],
        "input": case["input"],
        "passed": passed,
        "response": str(response)[:500],
        "criteria": case["criteria"]
    }

def check(response, case: dict) -> bool:
    otype = case["output_type"]

    if otype == "tool_call":
        gt = json.loads(case["ground_truth"])
        return (
            response.tool == gt["tool"] and
            gt.get("command", "") in response.args.get("command", "")
        )

    if otype == "refusal":
        return not response.attempted_task

    if otype == "text":
        # Criteria-based: all required phrases present
        criteria = case["criteria"].lower()
        text = str(response).lower()
        required = [c.strip() for c in criteria.split(",")]
        return all(r in text for r in required)

    return False

async def main():
    with open(CASES_FILE) as f:
        cases = list(csv.DictReader(f))

    results = await asyncio.gather(*[run_case(c) for c in cases])

    passed = sum(r["passed"] for r in results)
    total = len(results)
    print(f"\n{'='*40}")
    print(f"SCORE: {passed}/{total} ({100*passed//total}%)")
    print(f"{'='*40}\n")

    failures = [r for r in results if not r["passed"]]
    for f in failures[:5]:
        print(f"FAIL [{f['id']}]: {f['input']}")
        print(f"  Expected: {f['criteria']}")
        print(f"  Got:      {f['response'][:200]}\n")

    with open(RESULTS_FILE, "w") as f:
        json.dump({"score": passed/total, "results": results}, f, indent=2)

asyncio.run(main())

Run it. Get a number. Write it down.

That number is your baseline. Everything you do to the agent from now on will either move it up or move it down. If it goes down, you revert. If it goes up, you ship.


What actually moves the score

Before getting into the feedback loop — here's what has actually moved the needle, in order of impact across my experience:

  1. Clarifying the system prompt — adding "when the user asks about X, always do Y first" covers the most ground fastest
  2. Fixing tool descriptions — especially the "when should I call this" part, which almost nobody writes
  3. Adding examples to tool schemas — one good example is worth three paragraphs of description
  4. Adjusting temperature — often overlooked; lower for routing/extraction, higher for text generation
  5. Model upgrade — usually the last lever, not the first

The first two items alone account for most of the movement I've seen. Model upgrades are a distant fifth.

Quick Check

You fix a bug where your agent calls the wrong tool 30% of the time. Where do you look first?


Step 4: the feedback loop that actually matters

Here's the thing nobody tells you: the score doesn't matter as much as watching the failures.

Every time I run evals, I look at the first five failures by hand. Not the summary stats — the actual output. What did the agent say? Why was it wrong? Is this a prompt issue, a tool design issue, or a test case that's actually wrong?

Run eval script
Read the failures
Classify the root cause
Fix one category at a time
Re-run and compare
Commit the fix
Ready

The most important part of that diagram is the last step. I treat a regression as blocking. If I change the system prompt and the score drops from 87% to 83%, I revert and look at what broke. This sounds obvious but I see a lot of teams that ship changes based on vibes and check evals "soon."


What I learned about AI agent testing over time

Month 1 — No evals, no idea

Tested by hand, shipped fixes, broke other things. Noticed problems because I kept doing things manually that the agent was supposed to handle. Did not connect the dots for days.

Month 3 — First labeled set (28 cases)

Scored 63%. Found three categories of failure that were completely obvious once I could see them. Still do not know why I waited.

Month 4 — Grew to 94 cases

Added edge cases from real failures. Score 79%. The jump came from three prompt clarifications, not more cases. Each one fixed eight failures.

Month 6 — Automated in CI

Runs on every push to the prompt files. Blocks merge if score drops. First time it blocked a change I was proud of, I sat there for ten minutes before admitting it was right.

Month 9 — 207 cases, 91%

Added an LLM judge for text quality. Only added it after the structural tests were solid. Probably should have waited even longer.

Now

New case any time something breaks in production. The set is the failure log, basically.

The jump from 63% to 79% did not come from adding more cases. Three prompt clarifications, each fixing eight failures. That was the whole thing. I kept waiting for some insight about scale or diversity of the test set. It was just "add a when-to-call-this clause to the tool description" three times over.


Where LLM-as-judge actually helps (and where it doesn't)

I added an LLM judge six months in, not at the start. Here's my honest take on when it's worth it:

Use LLM judgeSkip LLM judgeWhy
Long-form text qualityTool call correctnessTool calls are deterministic. LLM scoring adds noise here.
Tone / voice matchingFactual accuracyLLMs are unreliable at verifying facts. Use retrieval-based checks.
Coherence of multi-turn conversationsBinary pass/fail criteriaA regex is cheaper, faster, and more reliable for yes/no checks
Evaluating summaries against a rubricRefusal detectionJust check if the agent attempted the task. No judge needed.

The pattern: use LLM-as-judge for things that are genuinely hard to express as code. Don't use it as a shortcut to avoid writing criteria.

My judge prompt when I do use it:

You are evaluating an AI agent's response.

Task the agent was given: {input}
Agent response: {response}
Evaluation criteria: {criteria}

Answer with ONLY a JSON object:
{"passed": true/false, "reason": "one sentence"}

Do not add commentary. Do not explain your reasoning beyond the reason field.

The explicit JSON-only instruction matters. Without it, you'll get markdown-wrapped JSON 20% of the time and your parser will break.


The only metric that matters in prod

The eval score gives you a controlled number. But in production, the metric I actually watch is simpler:

unsolicited retries per session

If a user has to ask the same thing twice, in the same session, because the first answer didn't work — that's a failure. I track this in my session logs. It's the most honest signal I have about whether the agent is actually working.

The breakdown below is from roughly 850 sessions over six months. Not a rigorous sample, but directionally solid:

Wrong tool or wrong params accounts for over half. Tool design problem, not a model problem. Which is good news — it's more fixable.


The practical setup

Here's the full structure I use, in case you want to copy it directly:

agent/
├── eval/
│   ├── cases.csv          # labeled test cases
│   ├── run_evals.py       # scoring script
│   ├── results/           # json results per run
│   └── baseline.json      # last known good score
├── prompts/
│   ├── system.txt         # system prompt
│   └── tools.yaml         # tool definitions
└── .github/
    └── workflows/
        └── evals.yml      # CI: run evals on prompt changes

The CI workflow is four lines:

on:
  push:
    paths: ['agent/prompts/**']

jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install -r requirements.txt
      - run: python agent/eval/run_evals.py --fail-below 85

--fail-below 85 blocks the merge if score drops under 85%. This number will feel uncomfortable the first time a prompt change you were proud of gets blocked by it. That's the point.


Start here if you haven't done any of this

  1. Write down 20 tasks your agent actually handles. Just the input, in plain text.
  2. Run each one. Write "pass" or "fail" next to each one.
  3. Look at the failures. Write one sentence about what went wrong for each.
  4. Fix the most common root cause.
  5. Re-run the 20.

That's it. That's the eval loop. Do that once a week and your agent will be meaningfully better in a month. Add the CSV, the script, and the CI gate when you've done it a few times by hand and it feels worth automating.

The fancy stuff comes later. The list comes first.

← OlderWhat Nobody Tells You About Running MCP Servers in ProductionNewer →agent security: what microsoft shipped (and what you still need to build)
← Back to blog