Headroom: A Compression Layer That Shrinks Everything Your Agent Reads
Most of what an AI coding agent reads is filler. A code search returns a hundred near-identical hits, a log dump is 90% timestamps, a JSON tool response repeats the same keys ten thousand times — and you pay for every token of it on the way into the model. The agent only needed the one FATAL line, but it was billed for the whole haystack.
Headroom attacks that directly. It’s a context-compression layer that sits between your agent and the LLM and shrinks everything flowing in — tool outputs, logs, RAG results, files, conversation history — before it reaches the provider. Its claim is “same answers, fraction of the tokens,” and unlike most compressors it runs locally and keeps the originals so the model can pull back the full text on demand.
This post covers what Headroom is, how its route-then-compress design works, the several ways you can drop it into a stack, what its benchmarks do and don’t prove, and the honest costs — because a tool that touches every token your agent reads is also a tool that can break in a lot of places.
What it is
Headroom is an open-source (Apache 2.0) compression layer for AI agents, published as headroom-ai on PyPI and npm and developed by @chopratejas, now under the headroomlabs-ai org. It is not a model or an agent — it’s middleware that intercepts the text going to an LLM, detects what kind of content it is, compresses it with a matching algorithm, and forwards a smaller prompt. Compression is reversible: originals are cached locally and the model can retrieve them through a tool call if it turns out it needs the full version.
The category is “context engineering at the transport layer.” It targets heavy agent users — people running Claude Code, Codex, Cursor and friends all day — who want token savings without rewriting their app, and who often work across several agents and want a shared memory between them. It requires Python 3.10+. The compression itself runs on your machine and the cached originals stay local — only the smaller, compressed prompt goes on to the LLM provider, never your context to a third-party compressor.
Why it matters
The benefits, as the project frames them and as the design supports:
- It compresses the bulk, not your prompt. The savings come from tool outputs, logs, RAG chunks and history — the high-volume content you don’t hand-write — which is where the redundancy lives. Compression is still lossy, so “you don’t lose what matters” is a goal, not a guarantee (see Tradeoffs).
- Zero code changes, if you want. A drop-in proxy means any language or agent can route through it without touching application code.
- Reversible by design. Because originals are cached (the project calls this CCR), aggressive compression is safer: if the model needs the dropped detail, it asks for it.
- Local-first. Compression runs on your machine, which matters for private codebases — contrast with hosted compression APIs that require sending your context to a third party.
- It also trims what the model writes back. Output tokens cost more than input on frontier models; Headroom can optionally shape responses to cut ceremony and skip deep reasoning on routine turns (see In practice).
How it works
Headroom’s core is a router followed by specialized compressors. When content arrives, a ContentRouter detects its type and dispatches it to the right algorithm rather than applying one lossy trick to everything:
agent / app
│ prompts · tool outputs · logs · RAG · files
▼
CacheAligner → ContentRouter → ┬ SmartCrusher (JSON)
├ CodeCompressor (AST: Python/JS/Go/Rust/Java/C++)
└ Kompress-base (prose, a HuggingFace model)
│ + CCR (originals cached locally, retrievable on demand)
▼
LLM provider (Anthropic · OpenAI · Bedrock · …)
Each piece has a job. SmartCrusher collapses repetitive JSON (arrays of dicts, nested objects). CodeCompressor is AST-aware, so it shrinks source by structure rather than blindly truncating. Kompress-base is the project’s own trained model for prose (the shipped model card is Kompress-v2-base). CacheAligner stabilizes the prompt prefix so the provider’s KV cache actually hits — important, because naive compression can break prompt caching and cost you more. And CCR stores the originals locally; if the model decides it needs the uncompressed text, it calls a retrieval tool to get it back, within a configured time-to-live.
Those are the main text/code/JSON paths, but not the whole inventory — the project advertises “6 algorithms” and also ships image compression (a trained ML router, claimed 40–90% reduction), an IntelligentContext score-based context fitter, and a SharedContext for passing compressed state between agents in a multi-agent workflow. The router-plus-compressors core above is what drives the headline token savings; treat the list here as the load-bearing subset rather than the full component map.
You can consume all of this in four ways: as a library (compress(messages) inline in Python or TypeScript), as a proxy (zero code changes, any OpenAI-compatible client), as an agent wrap (one command around Claude Code, Codex, Cursor, Aider, Copilot), or as an MCP server exposing headroom_compress, headroom_retrieve and headroom_stats.
Getting started
Install the package with the extras you want, then pick a mode. The fastest payoff is wrapping a coding agent:
pip install "headroom-ai[all]" # Python (npm install headroom-ai for Node)
headroom wrap claude # wrap Claude Code through the compressor
headroom perf # see the savings
If you’d rather not wrap the agent, run the proxy and point any OpenAI-compatible client at it — no application code changes:
headroom proxy --port 8787
One install caveat worth knowing up front: the package builds a Rust core (via maturin) and fetches two runtime assets over TLS — an ONNX runtime and the kompress-base model from HuggingFace. In corporate networks with SSL inspection this can fail with CERTIFICATE_VERIFY_FAILED; the README documents the workarounds (install Rust first, or use a prebuilt wheel with --only-binary, and trust the corporate CA for the asset downloads). It’s all surmountable, but it tells you this is heavier than a single script.
In practice
Day to day in wrap or proxy mode the compression is invisible — you work normally and pay for fewer input tokens. Two features are worth turning on deliberately.
Output shaping. Everything above shrinks what you send; this trims what the model writes back, which on Opus-class models costs several times more per token. It appends a short “be terse, don’t restate context” note to the system prompt (so the cache still hits) and dials thinking effort down when a turn is just the model resuming after a tool result. It’s off by default:
export HEADROOM_OUTPUT_SHAPER=1
headroom proxy --port 8787
What’s notably honest here is the measurement. Output savings are counterfactual — you can’t see what the model would have written — so Headroom reports an estimate with a confidence interval rather than a clean made-up number, and lets you leave 10% of traffic unshaped as a real control group (HEADROOM_OUTPUT_HOLDOUT=0.1) to get a measured figure instead.
Failure learning. headroom learn mines your failed sessions and writes corrections into CLAUDE.md / AGENTS.md / GEMINI.md, so recurring mistakes become standing instructions. Combined with the cross-agent memory store (shared across Claude, Codex and Gemini with auto-dedup), it’s aiming to be more than a compressor.
How it compares
Context compression has a small but real field, and Headroom’s pitch is breadth: it covers every content type, runs locally, and is reversible, where most alternatives pick one lane. Be fair about what that costs in simplicity, though — the narrow tools are narrow on purpose.
| Tool | Scope | Local | Reversible | Setup |
|---|---|---|---|---|
| Headroom | All context — tools, RAG, logs, files, history | Yes | Yes | Library · proxy · wrap · middleware · MCP |
| RTK | CLI command outputs only | Yes | No | CLI wrapper |
| lean-ctx | CLI commands, MCP tools, editor rules | Yes | No | CLI wrapper · MCP |
| Hosted (Compresr, Token Co.) | Text sent to their API | No | No | Hosted API call |
| OpenAI Compaction | Conversation history | No | No | Provider-native |
The honest reading: RTK and lean-ctx are simpler and do one thing well — and Headroom actually ships RTK’s binary and can use lean-ctx as its CLI tool, so they’re complements as much as competitors. Provider-native compaction (OpenAI’s) is zero-setup but single-provider and not reversible. Hosted services are the only ones that send your context off-machine, which is exactly the tradeoff Headroom’s local-first design avoids. If all you want is leaner output on a single agent, a lightweight tool like caveman is far less machinery; Headroom is for when you want input compression, cross-agent memory, and reversibility together.
Performance and benchmarks
Headroom publishes numbers, and they’re the most important thing to read carefully, because they are self-reported — run by the project, with a reproduce script, but not independently audited. Two tables matter.
Savings on the project’s own agent workloads range widely by task — roughly 92% on a 100-result code search and on an SRE incident-debugging trace, 73% on GitHub issue triage, and 47% on codebase exploration. The spread is the point: compression pays off hugely on repetitive bulk (search results, logs) and far less on already-dense content (exploring varied source files). Worth noting their own table undercuts their own headline: the advertised range is “60–95%,” yet the codebase-exploration row lands at 47% — below their stated floor. Read the 60–95% as a best-case band, not a typical one.
On accuracy, the project reports its compression is close to lossless on small standard benchmarks (N=100 each): GSM8K math unchanged (±0.000), TruthfulQA slightly up (+0.030), and SQuAD v2 and BFCL tool-calling at 97% with 19% and 32% compression respectively. You can reproduce these with python -m headroom.evals suite --tier 1.
Read honestly: N=100 is a small sample, and the headline 60–95% range comes from the workloads most favorable to compression. The accuracy numbers are reassuring as a sanity check that the compressor isn’t obviously destroying signal, not proof it never will on your data. The reproduce script is a genuine point in its favor — claims you can run yourself are worth more than claims you can’t — but “self-reported and reproducible” is not the same as “independently verified.”
Tradeoffs
- Large surface area, many failure points. A Rust core, an ONNX runtime, a HuggingFace model download, a proxy, MCP plumbing, a memory store. That’s a lot to install, keep updated, and debug — and the README’s own long section on corporate SSL failures is a tell. Compare a single-file tool that captures much of the output-side win with none of the infrastructure.
- Compression is lossy; reversibility has a cost. “Same answers” holds until it doesn’t. CCR mitigates by caching originals, but retrieval is an extra tool round-trip and latency, and it only helps if the model realizes it’s missing something. Adversarial or detail-dense tasks are where to watch for quality loss.
- The numbers are the vendor’s. Until there’s third-party benchmarking, treat the savings and accuracy figures as directional and verify on your own workload before trusting them in production.
- A proxy in the path is a dependency in the path. Routing every model call through a local service means its uptime, version, and config are now part of your critical path; a bad upgrade affects every agent at once.
- Won’t run everywhere. It needs to run local processes, so sandboxed or locked-down environments are out — the project says as much.
Takeaway
Reach for Headroom if you run agents heavily, want input and output token savings without rewriting your code, work across multiple agents and want shared memory, and value keeping compression local and reversible. Skip it if you only use one provider and its native compaction is enough, if you want the smallest possible moving-parts footprint, or if you can’t run local services. The one thing to remember: Headroom is the maximalist option in this space — broad, local, reversible, and genuinely thoughtful about honest measurement — but that breadth is infrastructure, and its impressive numbers are its own until someone else reproduces them. Try it on your real workload, watch the savings and the answer quality, and keep it only if both hold.
Sources
- headroomlabs-ai/headroom README (modes, ContentRouter and compressors, CCR reversibility, agent compatibility, output shaping,
headroom learn, the Proof and accuracy tables, install and corporate-SSL notes, the “Compared to” table) — note: the repository’s internal badges and clone URL referencechopratejas/headroom, the apparent original location - Headroom documentation (architecture, CCR, benchmarks methodology, limitations)
- Kompress-v2-base model card (the text-compression model)
- RTK and lean-ctx (the narrower CLI-compression alternatives referenced in the comparison)