inwo inwo.
← all posts

What Is a Harness? The Scaffolding That Turns a Model Into an Agent

June 2, 2026 · Shingo Nakamura · AI

A large language model, on its own, does exactly one thing: it takes some text and predicts more text. It can’t open a file, run a command, search the web, or remember what it did thirty seconds ago. So how does a chat assistant edit your code, or an agent book a flight, or a benchmark run the same model across a thousand tasks?

The answer is the harness — the layer of code wrapped around the model that turns a text-prediction function into something that can actually do things. You’re using one right now. This post explains what a harness is in AI, the two common senses of the word, how it works under the hood, and when the term matters.

What it is

A harness is the software scaffolding around an AI model that connects it to the outside world and drives it in a loop. The model is the engine; the harness is the chassis, transmission, and controls that make it a drivable car. Same engine plus different harnesses produces very different behavior.

The word shows up in two related senses. An agent harness is the runtime that lets a model use tools and work toward a goal over multiple steps — what powers a coding assistant or an autonomous agent. An evaluation harness (or “eval harness”) is the framework that runs a model over a standardized set of test tasks and scores the results. Both are “the stuff around the model,” just pointed at different jobs: getting work done versus measuring how well it works.

Why it matters

  • The model alone is inert. Every capability you associate with “AI agents” — reading files, running code, calling APIs, browsing — lives in the harness, not the model weights. No harness, no actions.
  • Same model, very different results. Drop the same frontier model into two harnesses and you can get wildly different quality, because the harness decides which tools exist, what goes into the context window, and how errors are handled. This is why leaderboard scores are really model + harness scores, not the model in isolation.
  • It’s where the engineering happens. Prompt construction, tool design, context management, retries, safety limits — the unglamorous plumbing that separates a flaky demo from a reliable agent is harness work.
  • It’s the part you can change. You usually can’t retrain the model, but you can swap or extend the harness: add a tool, change the loop, tighten permissions. Most practical “make the agent better” work is harness work.

How it works

At its core a harness runs an agent loop. The model never acts directly; it emits text saying which tool it wants to use, and the harness does the acting. The cycle:

  1. The harness builds a prompt — system instructions, the conversation so far, the list of available tools, and any prior tool results.
  2. The model responds, either with a final answer or a tool call (“read config.py”, “run pytest”).
  3. The harness executes that tool, captures the result, and feeds it back into the context.
  4. Repeat until the model produces a final answer or a limit is hit.

The shape that matters is the cycle — the model proposes, the harness acts, the result comes back, and round it goes until there’s an answer:

User request Harness: build prompt + tools Model: generate Harness: run tool Return answer final answer tool call result → context
The agent loop: the model never acts directly — it asks, the harness acts, and the result feeds the next turn.

In rough pseudocode, the whole idea fits in a few lines:

context = [system_prompt, user_request]
while True:
    response = model.generate(context, tools=available_tools)
    if response.is_final():
        return response.text
    result = run_tool(response.tool_call)   # the harness does the real work
    context.append(response)
    context.append(result)

Everything else a harness does is making that loop robust and useful: deciding what to keep in the context window (and compacting it when it overflows), defining the tools and parsing the model’s calls, enforcing permissions and step limits, retrying on failure, and tracking cost and tokens. An evaluation harness runs a similar loop but swaps the goal: instead of one open-ended task, it iterates over a dataset of test cases, runs the model on each, and scores the output against an expected answer or check.

Getting started

You rarely build a harness from scratch — you use one. The fastest way to feel what a harness is: run a terminal coding agent, which is a harness you can watch work. pi, for example, is a deliberately minimal one:

npm install -g --ignore-scripts @earendil-works/pi-coding-agent
export ANTHROPIC_API_KEY=sk-ant-...
pi -p "Summarize this codebase"

When you run that, the model isn’t reading your files — the harness is. It exposes read, bash, grep and friends to the model, runs the loop above, and shows you each tool call as it happens. Switch the model with a flag and the same harness drives a different engine. That swap is the clearest demonstration of where the model ends and the harness begins.

In practice

A few concrete things harnesses are:

  • Coding agents. Claude Code, pi, and similar CLIs are agent harnesses specialized for software work: their tools are file edits, shell commands, and search, and their loop is tuned for “change code, run tests, repeat.”
  • Agent frameworks. Libraries like LangGraph (part of the LangChain ecosystem) are harnesses you assemble yourself — you define the tools, the state, and the control flow, and the framework runs the loop. Reach for these when you’re building an agent rather than using one.
  • Evaluation harnesses. EleutherAI’s lm-evaluation-harness is the de facto standard for benchmarking language models across hundreds of tasks; OpenAI Evals is a framework plus registry for the same job. For terminal agents, Terminal-Bench and the SWE-bench family run the agent-plus-model through real tasks and score completion.

The throughline: in all of these, the interesting design decisions — which tools, what context, how to score — live in the harness.

How it compares

It helps to place the harness against the words it’s often confused with.

TermWhat it isRelationship to the harness
ModelThe trained network that predicts textThe engine the harness drives
HarnessThe code that runs the model in a loop with toolsThe thing this post is about
FrameworkA library for building a harness (LangGraph, etc.)A toolkit you use to assemble one
AgentA model + harness aimed at a goalWhat you get when a harness drives a model

And the two senses of “harness,” side by side:

DimensionAgent harnessEvaluation harness
GoalGet a task doneMeasure how well a model does tasks
Loop runs overOne open-ended requestA dataset of test cases
OutputThe completed workA score / report
ExamplesClaude Code, pi, LangGraph appslm-evaluation-harness, OpenAI Evals

Performance and benchmarks

The most important thing to understand about harnesses and numbers: a benchmark score measures the model and the harness together, never the harness alone. When a leaderboard like Terminal-Bench says “Agent X + Model Y scored 80%,” swapping either the agent or the model changes the number. A better harness — sharper tool design, smarter context management, better error recovery — can lift a fixed model’s score noticeably, which is exactly why teams invest in the scaffolding and not just the model choice. The flip side: a lot of the spread between competing agents driving the same frontier model is smaller than the marketing suggests, because the model is doing most of the heavy lifting. Read leaderboards as pairings, not as rankings of either piece on its own.

Tradeoffs

“Harness” is a concept, not a product, so the tradeoffs are about the design choices it forces:

  • More capable means more surface area. A harness that can run shell commands and edit files can also break things. Capability and risk grow together, which is why permissions, sandboxing, and step limits are core harness concerns, not afterthoughts.
  • Minimal vs batteries-included. A thin harness is predictable and easy to reason about but makes you assemble the workflow you want; a heavy one works on day one but hides what’s happening and is harder to debug. Neither is “right” — it depends on whether you want control or convenience.
  • It can mask the model. Because the harness shapes so much, a great score or a bad failure can be hard to attribute. Was it the model or the scaffolding? Good harnesses make their tool calls and context visible so you can tell.

Takeaway

A harness is the scaffolding that turns a text-predicting model into something that takes actions or gets measured — the loop, the tools, the context management, and the guardrails around the engine. Remember the two senses: an agent harness does work, an evaluation harness scores it. And remember the one rule that trips people up most: whenever you see an AI agent’s capabilities or benchmark numbers, you’re looking at the model and its harness together — change the harness, and you change what the same model can do.