Skip to content
← Back to blog

How to Write a Good Skill for an AI Agent

·10 min read
AI AgentsMCPdeveloper toolsLLMautomation

Last year I gave one of my agents a "deploy" skill. Vague instructions, broad scope, optimistic vibes. It deployed alright. The wrong branch. To production. At 11pm.

That's when I stopped thinking about skills as prompts I write once and forget, and started treating them like contracts. Precise ones. With error handling.

I've built over 50 skills for Hermes, my personal AI agent running 24/7 on an Azure VM. Most of the early ones were embarrassingly bad — not because the agent was dumb, but because I was writing for the wrong reader. I was writing for myself. The agent needed something different.

Here's what actually works.


First, let's get the naming sorted

The ecosystem has a naming problem. Tools. Skills. Plugins. Actions. MCP servers. Everyone means roughly the same thing and calls it something different.

TermFrameworkWhat it actually is
ToolLangChain, CrewAI, smolagentsA function the LLM can call by name
SkillHermes, Semantic KernelMarkdown instructions + optional scripts the agent loads on demand
PluginSemantic Kernel, ChatGPTA tool bundled with a manifest
MCP ServerAnthropic MCP, ClaudeA process that exposes tools over a standard protocol
ActionComposio, ArcadeA pre-built tool with auth already handled

The implementation differs. The design problems don't. Whether you're writing a LangChain tool, an MCP server, or a SKILL.md file — the same mistakes will sink you.


The thing nobody tells you about LLMs as callers

We've spent years learning how to design good APIs. Small functions. Composable. One responsibility each. Callers chain them together.

That works when the caller is a developer. A developer reads your docs. Knows the data model. Handles errors like an adult.

An LLM does none of that reliably.

AssumptionHuman developerLLM caller
Knows the data modelYes — reads docs firstNo — guesses from the parameter name
Handles multi-step flowsComposes calls in orderSkips steps, forgets IDs, hallucinates params
Reads error messagesAs debugging infoAs literal instructions to follow
Retries on failureWith backoff and reasoningImmediately, same params, same result
Knows what it does not knowUsuallyAlmost never

That last one is what gets you. The agent is not a more demanding client. It's an honest one — it follows your spec exactly and can't compensate for the parts you left vague. Every gap in your description is a decision point the agent will handle wrong.


What the ecosystem looks like right now

Before getting into how to write skills, it's worth knowing where things stand. The tool-use world has moved fast.

2023 — Function calling lands

OpenAI ships function calling. Agents can finally take real actions. Everyone wraps their APIs in a weekend.

Nov 2024 — MCP launches

Anthropic publishes the Model Context Protocol. ~210 servers at launch. Most people ignore it.

Mar 2025 — OpenAI adopts MCP

OpenAI adds MCP support. Monthly downloads jump from 5M to 22M overnight. Now everyone pays attention.

Q2 2025 — Protocol war over

ChatGPT ships MCP. Cursor, Windsurf, Zed all add support in the same quarter.

Q1 2026 — 9400+ servers

92% of new agent frameworks ship MCP support by default. 78% of enterprise AI teams have it in production.

MCP won. Not because it's technically better — it's actually ~10% slower than native function calling. It won because writing one MCP server takes 4 hours instead of 18. Boring wins.

Repos worth bookmarking:

  • modelcontextprotocol/servers (⭐ 60k) — Anthropic's reference implementations. Start here if you're building MCP tools.
  • ComposioHQ/composio (⭐ 25k) — 1,000+ pre-built integrations with OAuth already handled. If your tool needs auth, Composio probably does it already.
  • e2b-dev/e2b (⭐ 17k) — sandboxed code execution for agents. If your tool runs code, use this. Don't let agents run arbitrary code in your real environment.
  • huggingface/smolagents (⭐ 15k) — the most interesting take: tools are just Python functions, no JSON schema required. Worth reading even if you never use it.

Five things that actually make a skill work

1. Flatten the call chain

The most common mistake I made early on: building composable micro-tools because that's what good software looks like.

# Feels clean. Breaks constantly.
get_user_id(email)          #  "user_abc123"
get_user_prefs(user_id)     #  {theme: "dark"}
update_pref(user_id, key, value)

The agent calls get_user_id, gets an ID back — then forgets it by the time it calls update_pref. Or it hallucinates an ID that looks plausible. Or it skips the middle call entirely because the goal already felt achieved.

# One call. No forgetting.
update_user_preference_by_email(email, key, value)
Agent receives task
Bad: multi-step
Good: single tool
Confirmation returned
Ready

Every step you add is a chance for the agent to go off-script. Fewer steps, fewer chances.

2. Make errors tell the agent what to do next

// Useless
{"error": "invalid_request", "status": 400}

// Actually helpful
{
  "error": "date_range_invalid",
  "detail": "start_date must be before end_date",
  "received": {"start": "2026-06-01", "end": "2026-05-01"},
  "recovery": "swap the two dates and retry"
}

Agents read error messages as instructions. Literally. If your error says "invalid request," the agent will either retry with identical params or invent a creative fix. Give it the actual path forward instead.

3. Name parameters like you're writing for someone who can't Google anything

The agent has no schema to look up. It infers what to pass from the parameter name alone. user_id looks like a UUID — the agent doesn't have one. user_email_address looks like something it already knows.

Bad nameGood nameWhy
user_iduser_email_addressAgent has the email. It never has the UUID.
store_idstore_location_namedowntown branch works. A UUID never will.
msgmessage_body_plain_textRemoves any ambiguity about format
tsevent_timestamp_iso8601Agent knows ISO 8601. Not your epoch.
typenotification_channelemail or sms is obvious. type 2 is not.

4. Write descriptions that answer four questions

The description is what the agent reads when deciding whether to call your tool. It needs to cover:

  • What does this do?
  • When should I call it — not just what, but when?
  • What comes back?
  • What are the failure cases?
# Answers nothing useful
description: "Sends a notification to a user"

# Answers all four
description: >
  Send a notification to a specific user by email address.
  Use this when a task requires alerting a user about a change or completed action.
  Returns {success: true, delivered_at: ISO8601} on success.
  Fails with user_not_found if the email does not exist  if that happens,
  call list_users() first to confirm the address before retrying.

The "when should I call it" part is the one people skip. It's the most important part.

5. Make mutations idempotent — agents retry constantly

WildToolBench (a 2026 benchmark across 57 models) found 15–30% of all tool calls are retries. If your tool creates a record on every call, you'll end up with duplicates. Every time.

def send_email(to: str, subject: str, body: str, idempotency_key: str = None):
    key = idempotency_key or hashlib.sha256(f"{to}{subject}".encode()).hexdigest()
    if cache.get(key):
        return {"status": "already_sent", "skipped": True}
    # ... actually send it
    cache.set(key, True, ttl=86400)

Hash the inputs. Cache the result. Skip on duplicate. Three lines that save a lot of cleanup.


How I actually write skills for Hermes

My setup uses Markdown files. Each SKILL.md has a frontmatter block, plain-English numbered steps, and a ## Pitfalls section. The agent loads it on demand — only when the description matches the task at hand.

---
name: deploy-service
description: "Deploy a systemd service. Use when the user asks to start,
  stop, restart, or update a long-running process on the VM."
version: 1.2.0
---

# Deploy a service

1. Check if service exists: `systemctl --user status {name}.service`
2. If updating: pull latest code, run tests, then restart
3. After restart: wait 5s, check logs for errors
4. Report back: service name, current status, last 3 log lines

## Pitfalls
- Never use `systemctl` without `--user` flag on this VM  it will fail silently
- If restart fails, check port conflicts first: `ss -tlnp | grep {port}`

The Pitfalls section is the most valuable part. Every entry in it came from something that actually went wrong. It's where the real knowledge lives.

Task arrives
Agent scans skill list
Loads SKILL.md
Executes
Verifies
Ready

The key thing: the agent loads the skill at decision time, not at startup. Skills are lazy. This means the description has to be precise enough to trigger on the right tasks — not so broad that it fires constantly, not so narrow that it never fires.


The uncomfortable benchmark result

WildToolBench tested 57 models across realistic multi-turn tool-use scenarios. The finding: no model breaks 15% session accuracy. Not GPT-4o. Not Claude 4.

The bottleneck isn't the framework. It isn't even model capability in isolation.

It's tool design.

Models that were fine-tuned specifically for tool use performed worse than general-purpose models on novel tool combinations. Specialization creates brittleness. The best results came from general models paired with well-designed tools.

Quick Check

What's the #1 driver of agent tool-call failures in production?


Where to actually start

  1. Pre-built integrationsComposio. 1,000+ services, OAuth handled. Don't re-implement auth.
  2. Cross-platform protocolMCP. Write once, works in Claude, ChatGPT, Cursor, everything.
  3. Code executionE2B. Sandboxed Python/JS for agents that run code. Not optional.
  4. Design inspirationsmolagents. Tools as plain Python functions. No schema tax. Worth an hour of your time.

The pattern that works: fewer tools, flatter calls, richer descriptions, recoverable errors.

Every abstraction you add is a step the agent can fail on. Every vague parameter is a UUID the agent doesn't have. Every terse error message is an instruction to retry the exact same wrong thing.

Write for the honest reader that follows specs literally and can't ask follow-up questions.

Once you do that, the tools actually start working.

← OlderWriting Good AI Agents: What I Learned Running a Fleet of ThemNewer →What Nobody Tells You About Running MCP Servers in Production
← Back to blog