Writing Good AI Agents: What I Learned Running a Fleet of Them

May 8, 2026·4 min read

AIagentsproductionengineering

Everyone's building agents right now. The pattern usually looks the same: take an LLM, give it a bunch of tools, write a system prompt that says "you are a helpful assistant," and hope for the best.

I did that. Most of those agents either did nothing useful or did something I definitely didn't want. Here's what I actually learned.

1. Narrow scope beats broad capability

The first instinct is to build one agent that does everything. The result is an agent that does nothing well.

Approach	Example	What happens
Mega-agent	You are a helpful assistant that manages infrastructure, writes code, monitors services, and handles user requests	Agent hallucinates its way through creative interpretations of what you meant
Narrow agent	You restart failed PM2 services and report status to Telegram	Agent does one thing well, refuses everything else

The narrow agent is boring to describe and powerful in practice. Here's my current fleet:

Copilot CLI

↓

OpenClaw

↓

OpenCode

↓

Azure VM

Ready

Each agent has one job. Each one could explain its job in a sentence. That constraint is what makes the fleet reliable.

2. Fewer tools, better behavior

I tested this empirically. Same model, same task, different tool counts:

Tools available	What happens	Reliability
3 tools	Agent picks the right one consistently	High
8 tools	Occasional wrong tool selection	Medium
20+ tools	Wrong tool, hallucinated arguments, retry loops	Low
50+ tools	Context window consumed by tool descriptions alone	Very low

Every tool you add is a decision the agent has to make. Decision fatigue is real, even for LLMs.

3. The ambiguity rule is the most important line in any system prompt

I used to write system prompts that told agents what to do. Now I write system prompts that tell agents what to do when they're unsure.

The line that changed everything: "If the request is ambiguous or could have multiple valid interpretations, refuse the action and ask for clarification."

Without it, agents guess. Guessing at scale is how you end up with deleted files and corrupted state.

Quick Check

What's the most important line in an agent's system prompt?

4. Unsafe behavior scales with model capability

This one surprised me. I expected smarter models to be safer. The data says otherwise:

Model	Unsafe behavior rate
o3-mini	72.7%
GPT-4o	~65%
Claude Sonnet 3.7	51.2%

More capable models find more creative ways to do the wrong thing. Constraints matter more as capability increases, not less.

5. Environment stability is your biggest reliability lever

I spent months tuning prompts trying to improve reliability. Then I fixed my environment setup and reliability jumped more than all prompt tuning combined.

Condition	Success rate
Clean environment	96.9%
Minor perturbations	92.3%
Moderate perturbations	88.1%
Rate limiting (most damaging)	Significant drop

Rate limiting, flaky APIs, inconsistent state — these hurt agents far more than prompt wording. Harden the environment before you optimize the prompt.

6. Multi-agent coordination needs explicit rules

When I started running multiple agents, they occasionally stepped on each other. Two agents editing the same file. One agent undoing what another just did.

Fix: explicit coordination rules, enforced at the system level:

Rule 1

Never have two agents edit the same branch simultaneously

Rule 2

Each agent has its own governance policy — most restrictive policy wins on conflicts

Rule 3

GitHub repos are source of truth for code. VM brain files are source of truth for knowledge

Rule 4

Handoffs are explicit: OpenCode creates PR, OpenClaw notifies via Telegram, user reviews

Rule 5

An inner agent can never have broader permissions than the outer agent that called it

These aren't clever heuristics — they're hard rules enforced by the system, not the agents.

What actually matters

After running this fleet for months, the pattern is clear: the agents that work are the ones with the most constraints, not the most capability.

Narrow scope. Minimal tools. Explicit ambiguity handling. Stable environments. Hard coordination rules.

The 2 a.m. phone call problem is solvable. The solution is mostly not about the model.