Writing Good AI Agents: What I Learned Running a Fleet of Them
Everyone's building agents right now. The pattern usually looks the same: take an LLM, give it a bunch of tools, write a system prompt that says "you are a helpful assistant," and hope for the best.
I did that. Most of those agents either did nothing useful or did something I definitely didn't want. Here's what I actually learned.
1. Narrow scope beats broad capability
The first instinct is to build one agent that does everything. The result is an agent that does nothing well.
| Approach | Example | What happens |
|---|---|---|
| Mega-agent | You are a helpful assistant that manages infrastructure, writes code, monitors services, and handles user requests | Agent hallucinates its way through creative interpretations of what you meant |
| Narrow agent | You restart failed PM2 services and report status to Telegram | Agent does one thing well, refuses everything else |
The narrow agent is boring to describe and powerful in practice. Here's my current fleet:
Each agent has one job. Each one could explain its job in a sentence. That constraint is what makes the fleet reliable.
2. Fewer tools, better behavior
I tested this empirically. Same model, same task, different tool counts:
| Tools available | What happens | Reliability |
|---|---|---|
| 3 tools | Agent picks the right one consistently | High |
| 8 tools | Occasional wrong tool selection | Medium |
| 20+ tools | Wrong tool, hallucinated arguments, retry loops | Low |
| 50+ tools | Context window consumed by tool descriptions alone | Very low |
Every tool you add is a decision the agent has to make. Decision fatigue is real, even for LLMs.
3. The ambiguity rule is the most important line in any system prompt
I used to write system prompts that told agents what to do. Now I write system prompts that tell agents what to do when they're unsure.
The line that changed everything: "If the request is ambiguous or could have multiple valid interpretations, refuse the action and ask for clarification."
Without it, agents guess. Guessing at scale is how you end up with deleted files and corrupted state.
Quick Check
What's the most important line in an agent's system prompt?
4. Unsafe behavior scales with model capability
This one surprised me. I expected smarter models to be safer. The data says otherwise:
| Model | Unsafe behavior rate |
|---|---|
| o3-mini | 72.7% |
| GPT-4o | ~65% |
| Claude Sonnet 3.7 | 51.2% |
More capable models find more creative ways to do the wrong thing. Constraints matter more as capability increases, not less.
5. Environment stability is your biggest reliability lever
I spent months tuning prompts trying to improve reliability. Then I fixed my environment setup and reliability jumped more than all prompt tuning combined.
| Condition | Success rate |
|---|---|
| Clean environment | 96.9% |
| Minor perturbations | 92.3% |
| Moderate perturbations | 88.1% |
| Rate limiting (most damaging) | Significant drop |
Rate limiting, flaky APIs, inconsistent state — these hurt agents far more than prompt wording. Harden the environment before you optimize the prompt.
6. Multi-agent coordination needs explicit rules
When I started running multiple agents, they occasionally stepped on each other. Two agents editing the same file. One agent undoing what another just did.
Fix: explicit coordination rules, enforced at the system level:
Rule 1
Never have two agents edit the same branch simultaneously
Rule 2
Each agent has its own governance policy — most restrictive policy wins on conflicts
Rule 3
GitHub repos are source of truth for code. VM brain files are source of truth for knowledge
Rule 4
Handoffs are explicit: OpenCode creates PR, OpenClaw notifies via Telegram, user reviews
Rule 5
An inner agent can never have broader permissions than the outer agent that called it
These aren't clever heuristics — they're hard rules enforced by the system, not the agents.
What actually matters
After running this fleet for months, the pattern is clear: the agents that work are the ones with the most constraints, not the most capability.
Narrow scope. Minimal tools. Explicit ambiguity handling. Stable environments. Hard coordination rules.
The 2 a.m. phone call problem is solvable. The solution is mostly not about the model.