inwo inwo.
← all posts

Ponytail: The Best Code Is the Code Your Agent Never Writes

June 27, 2026 · Shingo Nakamura · AI

Ask a coding agent for a date picker and you can watch it talk itself into work nobody requested: install flatpickr, write a wrapper component, add a stylesheet, open a discussion about timezones. Four hundred lines later you have a date input the browser has shipped natively since 2014. The agent was trying to be helpful, and “helpful” came out as a dependency, an abstraction, and a config layer for a problem that didn’t have one.

Ponytail is a small open-source skill built to stop exactly that. It injects a “lazy senior developer” ruleset into the agent’s context at session start — the archetype with the long ponytail and oval glasses who has been at the company longer than version control, looks at your fifty lines, says nothing, and replaces them with one. The README’s own one-liner: “He says nothing. He writes one line. It works.”

This post covers what Ponytail is, how its decision ladder works, how to install it across the many agents it supports, what independent reviewers actually found when they tested it, and the real before/after numbers from a benchmark that got more honest because a critic forced it to. It also covers where it falls short — because a skill about cutting the unnecessary doesn’t get to skip its own caveats.

What it is

Ponytail is a portable ruleset — roughly a 100-line Markdown skill plus per-agent adapter plumbing — not a model, a wrapper, or a tool that runs your code. It changes the behaviour of the agent writing the code: before producing anything, the agent is told to question whether the work needs to exist, whether the standard library or the platform already covers it, and only then to write the minimum that works. It’s MIT-licensed, created by DietrichGebert, and at the time of writing sits around 60k GitHub stars after a steep, viral climb.

It is aimed at people who live in a coding agent daily — Claude Code, Codex, Copilot CLI, Cursor, Gemini CLI and others — and who are tired of reviewing AI-generated bloat. The substance is old and uncontroversial: it’s the YAGNI principle from the 1990s, packaged as an always-on reflex for an agent instead of a habit you have to remember to enforce.

Why it matters

The pitch is not “make the agent write worse code.” It’s “stop the agent padding the solution.” The benefits, as the project frames them:

  • Less code to own. The biggest, most defensible win is fewer lines committed — and the lines you never wrote can’t carry bugs, CVEs, or maintenance.
  • Lower token cost, as a side effect. Fewer output lines mean fewer output tokens, and less code in context means fewer input tokens on later turns. On models that follow the rules, that compounds into measurable cost and latency savings. (This is a side effect, not the headline — see the benchmark caveats below.)
  • Safety is explicitly protected. “Lazy” is scoped to mean efficient, not negligent: trust-boundary validation, error handling that prevents data loss, security, and accessibility are declared off-limits for deletion. That distinction is what separates it from a blunt “write one-liners” instruction.
  • Portability. The same ruleset ships in many forms (hooks, skills, plugins, plain rules files), so you aren’t locked to one agent.

How it works

Before writing code, the agent stops at the first rung of a ladder that holds:

1. Does this need to exist?   → no: skip it (YAGNI)
2. Already in this codebase?  → reuse it, don't rewrite
3. Stdlib does it?            → use it
4. Native platform feature?   → use it
5. Installed dependency?      → use it
6. One line?                  → one line
7. Only then: the minimum that works

The crucial detail — and the one that decides whether it helps or hurts — is that the ladder runs after the agent understands the problem, not instead of reading the code. The skill’s instruction is to read the task and the code it touches, trace the real flow, and only then climb. Lazy about the solution, never about reading. Rung order matters too: an already-installed dependency (rung 5) is meant to win over a native element (rung 4), so in a project with a real design system the agent should reach for your component, not a bare <input>if it actually reads package.json first.

Delivery differs per host: Claude Code and Codex use two small Node.js lifecycle hooks (a SessionStart hook injects the ruleset every session); OpenCode appends it to the system prompt each turn; Gemini loads it as always-on context; Cursor/Windsurf/Cline/Kiro use plain rules files you copy in. There’s also a standalone ponytail-mcp server. Every adapter reads from one canonical SKILL.md, and a CI check fails if any copy drifts.

Getting started

On Claude Code it’s a two-step plugin install (sent as two separate prompts):

/plugin marketplace add DietrichGebert/ponytail
/plugin install ponytail@ponytail

It’s then active every session at the full level by default. Three intensity levels exist — lite (suggests the lazier path, lets you decide), full (enforces the ladder), and ultra (YAGNI extremist, “for when the codebase has wronged you personally”) — plus off. Set a default with the PONYTAIL_DEFAULT_MODE env var or ~/.config/ponytail/config.json. The plugin also adds commands: /ponytail-review (scan the current diff for over-engineering and hand back a delete-list), /ponytail-audit (the whole repo), and /ponytail-debt (track the shortcuts you deferred).

The desktop app has no /plugin command — you add it through the UI (Customize → add a plugin from the repository). Codex, Copilot CLI, Gemini/Antigravity CLI, OpenCode, Pi, Devin, OpenClaw and several editors each have their own one-liner or rules-file path; the repo’s agent-portability doc maps which file goes where. A practical note from one reviewer’s install: the package pulls in no external dependencies and audits clean, which is consistent with its own philosophy.

In practice

The canonical example is the date picker: with Ponytail the agent emits <input type="date"> and a one-line comment instead of a library wrapper — 404 lines down to 23 in the project’s own run, and a color picker from 287 to 23, because it reaches for native <input> types instead of components.

Independent hands-on testing backs the basic effect. Security/DevOps reviewer Mehdi Rahmani ran it through an OpenAI-compatible proxy on a real task — “build an accessible React date picker” — and watched the output drop from 31 lines to 15, with the agent choosing the native, accessible element rather than a calendar library. He also ran part of the repo’s suite (19 integration tests passing) and confirmed the hooks aren’t decorative. His framing of the real value: Ponytail “formalizes a hygiene that many teams ask for without ever writing it down” — delete before adding, prefer native, reject speculative abstractions.

Where it gets interesting is the case the benchmark didn’t cover. As Yash Desai’s DEV writeup and a Hacker News thread both flagged: if your project already ships shadcn/ui or another component library, the right answer is your styled <DatePicker /> (rung 5), not the native <input> (rung 4) — and the public benchmark was run on a repo with no component library, so the “native wins” result was correct there but says nothing about a mature design system. Whether Ponytail makes the right call then depends entirely on the agent actually reading your installed dependencies before it climbs, which is probabilistic, not guaranteed. The practical fix is to pair it with an explicit project rule in your AGENTS.md naming your design system.

How it compares

The benchmark’s most useful feature is that it pits Ponytail against the two obvious “do I even need this?” alternatives: caveman (a terse-prose skill — the control for “is this just brevity?”) and a seven-word prompt, “Follow YAGNI principles, and prefer one-liner solutions” (the control for “does a whole skill beat a one-liner?”). Measured against the same no-skill agentic baseline (Haiku 4.5):

vs no-skill baselineLOCtokenscosttimesafe
Ponytail−54%−22%−20%−27%100%
caveman (terse-prose control)−20%+7%+3%+2%100%
“YAGNI + one-liners” prompt−33%−14%−21%−30%95%

Be fair to the alternatives: the seven-word prompt is genuinely close on code, cost and time — that’s the whole point of the comparison, and it’s why the safety column matters. Ponytail is the only arm that cuts every metric and stays at 100% on the adversarial safety tier; the bare one-liner prompt dropped a guard (a path-traversal check) in one case, scoring 95%. caveman makes prose terse but builds normally, so it barely touches code and slightly raises tokens. The honest read: Ponytail’s edge over “just prompt YAGNI” is modest on size and real on keeping the guards in.

Performance and benchmarks

The numbers above are worth trusting partly because they were rebuilt under fire. The original Ponytail benchmark was single-shot — one prompt, one completion, count the lines — and reported 80–94% less code. Colin Eberhardt, CTO of Scott Logic, took it apart fairly: the no-skill baseline was a chatty model padding answers with prose and multiple options, so “lines of the answer” counted commentary, not code — which inflated the gap. He then showed his seven-word YAGNI prompt beat Ponytail on its own benchmark, and argued the project was “little more than YAGNI” riding a hype wave.

To the author’s credit, the response was to fix it rather than argue. The benchmark was rebuilt as a real agentic test: a headless Claude Code session (claude -p, the actual product, not a bare API model) editing tiangolo/full-stack-fastapi-template — a real FastAPI + React repo — scored on the git diff it leaves, across 12 feature tickets, n=4, with each arm isolated in its own process. They also caught and disclosed a contamination bug in an earlier agentic run where the SessionStart hook had fired on the baseline arm too (so the baseline was secretly running Ponytail), fixed it, and published anyway. Colin acknowledged the correction publicly. That self-correction is the reason to trust the rest.

Three honest caveats the project and its reviewers are upfront about:

  • The flashy 80–94% is superseded. Against a fair agentic baseline that’s the per-task ceiling (where an agent badly over-builds, like the date picker), not the average. The defensible average is ~54% less code.
  • Token savings are not a law. The rule was never “fewest tokens” — it’s “write only what’s needed.” Lower cost and latency follow on models that obey the ladder, but the project notes a terse reasoning model that burns thinking tokens deliberating the rungs can go the other way (it cites GPT-5.5 doing exactly that). Don’t promise yourself −22% on every model.
  • It needs a model that follows instructions. Run on a small local model (llama3.2 3B), the result was noise — one run under baseline, the next 50% over. The project published that too. Ponytail is tuned for frontier models.

You can reproduce the single-shot run with npx promptfoo eval -c benchmarks/promptfooconfig.yaml, and the full agentic method, per-task tables and limitations are in the writeup.

Ponytail vs caveman

It’s tempting to lump Ponytail and caveman together — both are one-file skills, both promise “less,” both turn up in the same token-efficiency conversation. But they fix different problems, and seeing the split tells you which one (or both) you actually want.

The core difference: caveman trims the words, Ponytail trims the work. caveman is a communication-style skill — it makes the agent answer like a caveman, dropping articles, filler and pleasantries, while leaving the code it writes untouched. Ponytail is a solution-design skill — it makes the agent choose the simplest implementation (reuse, stdlib, a native <input>, one line), while leaving how it talks untouched. caveman makes the mouth smaller; Ponytail makes the diff smaller.

DimensionPonytailcaveman
What it changesthe code that ships (the solution)the agent’s prose (the messages)
Mechanisma YAGNI decision ladder applied before codinga terseness instruction that strips filler
Primary metriclines of code / diff sizeoutput tokens
Token effect~−22%, as a side effect of less codethe whole point: ~65–75% on explanation-heavy tasks
On a coding diffbig wins where the agent over-buildslittle to none — it builds normally, can even add tokens
Safetynever cuts validation, security, accessibilitycode unchanged, so safety is unaffected
Best onfeature/coding work that invites over-engineeringchat, explanations, code reviews, Q&A

That table also resolves the apparent contradiction in Ponytail’s own benchmark, where caveman posted +7% tokens: that test scored a coding diff, and caveman doesn’t shorten code — its terse prose has little to bite on there, and the steering overhead can nudge tokens up. Put caveman on an explanation-heavy task instead (its home turf) and it cuts 65–75% of output tokens. Neither figure is wrong; they measure different jobs.

When to reach for caveman: the agent is verbose — long explanations, padded reviews, chatty Q&A — and you want lower output-token cost and answers you can read at a glance. It shines on explanation-heavy work and barely touches the code.

When to reach for Ponytail: the agent over-builds — pulls in a dependency, writes a wrapper, adds an abstraction for a one-line problem. You want a smaller diff, less to review and maintain, and the cost/latency drop that follows from writing less. It shines on feature work and does nothing on code that’s already minimal.

The honest answer is usually “both.” They don’t overlap — one shrinks what the agent says, the other shrinks what it builds — so running them together compounds: caveman keeps the chatter short while Ponytail keeps the code small. The one caveat is philosophical: on a terse reasoning model that spends thinking tokens deliberating Ponytail’s ladder, total tokens can rise (the project cites GPT-5.5), so pair them deliberately and measure on your own model instead of trusting either headline number.

Tradeoffs

The honest cons:

  • It’s an instruction layer, not magic. Mehdi Rahmani’s verdict is the one to keep in mind: “Ponytail is still an instruction layer” — if your agent ignores context, your host doesn’t load skills, or your model has poor code discipline, it won’t fix that. It raises the probability of simplicity; it doesn’t replace a technical review.
  • The design-system blind spot. In a mature codebase with a component library, the correctness of rung 5 over rung 4 hinges on the agent reading your dependencies. Untested in the public benchmark; mitigate with an explicit project rule.
  • ultra can over-cut. Pushed hard, the laziness can argue away an abstraction you actually need. The levels exist precisely so you take back control; you have to know when.
  • The “thin repo” criticism is real. Skeptics on Hacker News called it a giant repo for a ~100-line prompt and asked, only half-joking, whether it’s “the new leftpad.” Fair: the logic is small and the rest is multi-agent plumbing. Whether that’s elegant or overblown is a judgment call — and a slightly ironic one for a project about doing less.

Takeaway

Ponytail is a low-risk experiment with a clear job: bias your agent toward the simplest thing that works, with the safety guards nailed down so “simple” doesn’t become “unsafe.” Reach for it if you live in Claude Code, Codex, OpenCode or Gemini CLI, your agents over-build, and you’re on per-token billing — especially on greenfield or backend work. Treat it with more care in a codebase with an established design system, pair it with explicit project rules, and don’t expect the headline 80–94% or guaranteed token savings on every model. The one thing to remember: it shifts probabilities, it doesn’t replace your review — and the best line of code it produces is still the one it talked your agent out of writing.