Ralph: Put a Coding Agent in a While Loop and Walk Away

The cheapest agentic technique anyone has shipped this year is also the dumbest, and that is the point. Ralph is a Bash loop: feed a coding agent the same prompt file, over and over, and let it grind on your task until it declares itself done. No orchestration framework, no multi-agent mesh, no clever planner. As Geoffrey Huntley, who coined it, puts it: “Ralph is a Bash loop.”

It is named after Ralph Wiggum from The Simpsons — not because the output is stupid, but because the strategy is: persist through failure, try again, don’t overthink. Huntley’s own framing is that the technique is “deterministically bad in an undeterministic world” — the failures are predictable, and predictable failures can be engineered around.

This post covers what Ralph is, how the loop actually works, the two implementations most people will reach for — Anthropic’s ralph-wiggum plugin and snarktank/ralph — and where they diverge in a way that matters. Then, because the marketing around Ralph leans hard on a few screenshots, a critical review at the end that separates the technique from the anecdotes.

What it is

Ralph is a methodology, not a product: run an autonomous AI coding agent in a loop where each pass gets the same instructions, works on the codebase, and the loop only stops when a completion condition is met or an iteration cap is hit. The memory between passes is not the model’s context — it’s the artifacts on disk: the files the agent already wrote, the git history, and a running notes/plan file. The agent re-reads its own past work each loop and pushes the project one step further.

It is aimed at greenfield projects with automatic verification — tasks where a test suite, type checker, or linter can tell the agent (and you) whether the last pass made things better or worse. It is explicitly not aimed at production debugging or design work that needs human judgment.

Why it matters

The appeal is that it collapses a lot of agent tooling into something you can reason about in one sentence:

It removes orchestration complexity. No agent-to-agent protocol, no scheduler service. Huntley’s argument is direct: non-deterministic agents wired together like microservices are “a red hot mess.” Ralph is monolithic — one process, one task per loop.
It turns “walk away” into a real workflow. With clear completion criteria and verification, you can start a loop and come back to a finished (or honestly-blocked) task.
It makes failure cheap and informative. Each broken pass is data: you read what went wrong and add a “sign” to the prompt so the next pass avoids it. The work is in tuning the prompt, not babysitting each step.
The cost can be absurdly low when it works. Tokens are the only input; for well-scoped greenfield work the wins can be large (see the benchmarks section — and its caveats).

How it works

The core is a feedback loop with three moving parts: a fixed prompt, a verification gate, and persistent on-disk memory. In its purest form it’s a shell one-liner Huntley published:

while :; do cat PROMPT.md | claude-code ; done

Every iteration the agent reads the prompt, does one thing, runs the verification (tests/typecheck), and writes its progress to disk. The discipline that makes it work, per Huntley:

One item per loop. You have ~170k usable context; spend as little as possible. Big tasks overflow context and produce garbage, so each pass tackles a single right-sized story.
Back-pressure is mandatory. The loop is only as good as what can reject bad output — a type system, a test run, a static analyser, CI staying green. Without it, broken code compounds across iterations.
Notes for future loops. Because each pass may start fresh, the agent leaves itself breadcrumbs — a fix_plan.md/progress.txt, and updates to AGENTS.md so the next pass inherits discovered conventions and gotchas.

The two popular implementations realise this differently, and the difference is the single most important thing to understand before picking one (see How it compares).

Getting started

The fastest on-ramp inside Claude Code is Anthropic’s plugin. You run one command to start the loop in your current session; a Stop hook then blocks the agent from exiting and re-feeds the prompt until the completion phrase appears or the iteration cap is hit:

/ralph-loop "Build a REST API for todos. CRUD, input validation, tests with >80% coverage, README. Output <promise>COMPLETE</promise> when done." --completion-promise "COMPLETE" --max-iterations 50

--max-iterations is your safety net against infinite loops on impossible tasks; the plugin docs recommend always setting it. /cancel-ralph stops an active loop.

snarktank/ralph is the other route — an external bash loop that spawns a fresh agent each iteration. Note the two pieces are installed separately: the marketplace gives you the spec skills, but the loop runner (ralph.sh) is a file you copy from a clone of the repo.

# 1. get the loop script into your repo (from a clone of snarktank/ralph)
mkdir -p scripts/ralph
cp /path/to/ralph/ralph.sh scripts/ralph/
chmod +x scripts/ralph/ralph.sh

# 2. (optional) add the PRD/ralph spec skills via the Claude Code marketplace
/plugin marketplace add snarktank/ralph
/plugin install ralph-skills@ralph-marketplace

# 3. after generating prd.json, run the loop (Claude Code; default 10 iterations)
./scripts/ralph/ralph.sh --tool claude 20

In practice

A realistic snarktank/ralph workflow is spec-first. You don’t hand it a vague goal — you have a conversation that produces a Product Requirements Document, then convert that to prd.json, a list of small user stories each with a passes: false flag. The loop then, per the project’s README: creates a feature branch, picks the highest-priority unfinished story, implements that single story, runs typecheck and tests, commits only if they pass, flips the story to passes: true, appends what it learned to progress.txt, and repeats until every story passes or it runs out of iterations.

The right granularity is the whole game. “Add a database column and migration” or “add a filter dropdown to a list” is a good Ralph story. “Build the entire dashboard” or “add authentication” is not — it overflows one context window and the pass produces poor code. The skill you’re actually exercising is decomposition and writing verifiable acceptance criteria, not prompting tricks.

How it compares

Both popular implementations descend from Huntley’s technique, but they make opposite choices about context — and that choice is the thing to get right.

Dimension	`ralph-wiggum` (Anthropic plugin)	`snarktank/ralph` (Ryan Carson)	Pure Huntley loop
Loop mechanism	Stop hook inside one live session	External bash loop, new process per pass	External bash loop
Context per iteration	One live session (not reset per loop)	Fresh each pass (clean context)	Fresh each pass
Memory between passes	Files + git in the running session	Git history + `progress.txt` + `prd.json`	Files + git
Task structuring	Free-form prompt + completion promise	Spec → PRD → `prd.json` stories	Free-form prompt + `fix_plan.md`
Agents supported	Claude Code	Claude Code or Amp	Any uncapped agent
Setup cost	One slash command	Copy scripts, generate a spec	Write your own loop

The honest divergence: the official plugin keeps the loop inside one continuous session rather than spawning a fresh process each pass. That cuts against the clean-context-per-loop principle the rest of Huntley’s writing insists on (“the more you use the context window, the worse the outcomes”) — even granting that Claude Code’s auto-compaction will eventually trim a long session, a re-fed prompt in a running session is not the same as the fresh instance Huntley describes. He himself flags, in a video linked from his Ralph article, that the Claude Code plugin “isn’t it.” He doesn’t spell out the objection in text, so I won’t put words in his mouth — but the in-session design is the clearest architectural reason the plugin diverges from the technique as he describes it. If you want Ralph the way its author means it, the fresh-process loop (snarktank/ralph or your own bash one-liner) is closer.

Performance and benchmarks

There are no controlled benchmarks for Ralph — there’s no leaderboard, no reproducible suite. What exists is a handful of anecdotes its proponents circulate, and they should be read as anecdotes, not measurements:

“6 repos overnight” at a Y Combinator hackathon — this traces to a third-party field report (repomirror), not a metric Huntley measured himself. It’s a demo result, single occurrence.
A $50k contract delivered for $297 in API cost — sourced from an iMessage screenshot a contractor shared with Huntley, using Amp. One data point, self-reported, with no breakdown of failed runs or the engineer’s own time tuning it.
The “CURSED” programming language, built largely by Ralph over months — impressive as an existence proof, but it took continuous expert tuning and, by Huntley’s own account, the repo was “full of garbage, temporary files, and binaries” along the way.

The honest read: these show the ceiling is surprisingly high for greenfield work in skilled hands. None of them establish a median outcome, a failure rate, or a cost distribution that includes the runs that didn’t ship. Treat the numbers as “this is possible,” not “this is what you’ll get.”

Tradeoffs

Greenfield only — and the creator means it. Huntley: “There’s no way in heck would I use Ralph in an existing code base.” He pegs it at getting greenfield projects ~90% of the way. Most working engineers spend their days in brownfield codebases, where Ralph is the wrong tool.
No tests, no Ralph. The entire safety mechanism is your verification loop. Weak tests mean broken code compounds silently across iterations — the failure mode is worse than one bad commit.
Cost is unbounded by default. “All you need are tokens” cuts both ways: an unguided loop on a poorly-specified task burns money producing nothing. The iteration cap is a band-aid, not a budget.
It is not hands-off, despite the framing. The “walk away” pitch hides the operator skill required — writing specs, designing back-pressure, spotting bad patterns and adding “signs.” Huntley is explicit that senior expertise is non-negotiable: anyone claiming a tool does 100% without an engineer is “peddling horseshit.”
Non-determinism is the Achilles’ heel. The agent runs ripgrep, wrongly concludes something isn’t implemented, and builds a duplicate. Huntley names this directly as Ralph’s core weakness; it’s tunable, not eliminable.

Takeaway

Use Ralph when you have a greenfield task, real automatic verification, and the willingness to write a good spec and tune the loop — then it can do a startling amount of work for very little money. Don’t use it on a production codebase, on tasks without a hard pass/fail signal, or expecting the headline anecdotes to be your average case. And if you want the technique as its author intends it, prefer the fresh-context loop over the in-session plugin. The one thing to remember: Ralph doesn’t make the agent smart — it makes persistence cheap, and shifts all the skill into your specs and your tests.

Critical review

Stripping the framing: Ralph is a real, useful idea wrapped in a marketing layer that’s easy to mistake for evidence.

The technique is sound and the honesty in Huntley’s own writing is its best feature — he tells you it’s greenfield-only, that it’s deterministically bad, that you’ll wake up to broken builds, that engineers are still required. That candor is more trustworthy than most tool pitches. But the numbers doing the persuading ($297 for $50k, six repos overnight) are screenshots and field reports with n≈1 and obvious survivorship bias: you hear about the contract that worked, not the loops that burned tokens and shipped nothing. Anyone citing those figures as proof of ROI is over-reading them.

The sharpest practical trap is the one most Claude Code users will walk into: the official ralph-wiggum plugin is the path of least resistance, and it’s the implementation that most contradicts the technique. Keeping the loop in one continuous session fights the fresh-context principle that makes Ralph work — and the technique’s own creator says the plugin “isn’t it.” If you try Ralph via the plugin, get mediocre results, and conclude “Ralph doesn’t work,” you may have just tested the weakest version of it.

Finally, the economics deserve scrutiny the boosters skip. “Cheap” is true only conditional on a good spec and strong tests; absent those, the loop is a token furnace. And the whole thing rests on senior judgment to write specs, design back-pressure, and recognize when Ralph is off the rails — which means it’s less “replace engineers” and more “a very fast junior who needs a very good lead.” My read: a legitimately good technique for a narrow slice of work (greenfield, well-specified, well-tested), badly served by hype that implies it generalizes. Worth learning. Not worth believing the screenshots.

Sources

Geoffrey Huntley — “Ralph Wiggum as a software engineer” (original technique, the while : loop, the $50k/$297 and CURSED anecdotes, the “greenfield only” and “engineers still needed” statements)
anthropics/claude-code — ralph-wiggum plugin README (Stop-hook mechanism, /ralph-loop, --completion-promise, --max-iterations, “real-world results”)
snarktank/ralph README (fresh-context-per-iteration design, PRD/prd.json workflow, Amp/Claude Code support, marketplace install)
repomirror field report (the YC hackathon “6 repos overnight” claim, via Huntley)