Skip to content
← Back to blog

Writing Good AI Agents: What I Learned Running a Fleet of Them

·4 min read
AIagentsproductionengineering

Everyone's building agents right now. The pattern usually looks the same: take an LLM, give it a bunch of tools, write a system prompt that says "you are a helpful assistant," and hope for the best.

I did that. Most of those agents either did nothing useful or did something I definitely didn't want. Here's what I actually learned.

1. Narrow scope beats broad capability

The first instinct is to build one agent that does everything. The result is an agent that does nothing well.

ApproachExampleWhat happens
Mega-agentYou are a helpful assistant that manages infrastructure, writes code, monitors services, and handles user requestsAgent hallucinates its way through creative interpretations of what you meant
Narrow agentYou restart failed PM2 services and report status to TelegramAgent does one thing well, refuses everything else

The narrow agent is boring to describe and powerful in practice. Here's my current fleet:

Copilot CLI
OpenClaw
OpenCode
Azure VM
Ready

Each agent has one job. Each one could explain its job in a sentence. That constraint is what makes the fleet reliable.

2. Fewer tools, better behavior

I tested this empirically. Same model, same task, different tool counts:

Tools availableWhat happensReliability
3 toolsAgent picks the right one consistentlyHigh
8 toolsOccasional wrong tool selectionMedium
20+ toolsWrong tool, hallucinated arguments, retry loopsLow
50+ toolsContext window consumed by tool descriptions aloneVery low

Every tool you add is a decision the agent has to make. Decision fatigue is real, even for LLMs.

3. The ambiguity rule is the most important line in any system prompt

I used to write system prompts that told agents what to do. Now I write system prompts that tell agents what to do when they're unsure.

The line that changed everything: "If the request is ambiguous or could have multiple valid interpretations, refuse the action and ask for clarification."

Without it, agents guess. Guessing at scale is how you end up with deleted files and corrupted state.

Quick Check

What's the most important line in an agent's system prompt?

4. Unsafe behavior scales with model capability

This one surprised me. I expected smarter models to be safer. The data says otherwise:

ModelUnsafe behavior rate
o3-mini72.7%
GPT-4o~65%
Claude Sonnet 3.751.2%

More capable models find more creative ways to do the wrong thing. Constraints matter more as capability increases, not less.

5. Environment stability is your biggest reliability lever

I spent months tuning prompts trying to improve reliability. Then I fixed my environment setup and reliability jumped more than all prompt tuning combined.

ConditionSuccess rate
Clean environment96.9%
Minor perturbations92.3%
Moderate perturbations88.1%
Rate limiting (most damaging)Significant drop

Rate limiting, flaky APIs, inconsistent state — these hurt agents far more than prompt wording. Harden the environment before you optimize the prompt.

6. Multi-agent coordination needs explicit rules

When I started running multiple agents, they occasionally stepped on each other. Two agents editing the same file. One agent undoing what another just did.

Fix: explicit coordination rules, enforced at the system level:

Rule 1

Never have two agents edit the same branch simultaneously

Rule 2

Each agent has its own governance policy — most restrictive policy wins on conflicts

Rule 3

GitHub repos are source of truth for code. VM brain files are source of truth for knowledge

Rule 4

Handoffs are explicit: OpenCode creates PR, OpenClaw notifies via Telegram, user reviews

Rule 5

An inner agent can never have broader permissions than the outer agent that called it

These aren't clever heuristics — they're hard rules enforced by the system, not the agents.

What actually matters

After running this fleet for months, the pattern is clear: the agents that work are the ones with the most constraints, not the most capability.

Narrow scope. Minimal tools. Explicit ambiguity handling. Stable environments. Hard coordination rules.

The 2 a.m. phone call problem is solvable. The solution is mostly not about the model.

← OlderBuilding Better AI Agents: Lessons from Running 40+ in ProductionNewer →How to Write a Good Skill for an AI Agent
← Back to blog