I let coding agents do most of my work for two months and measured what each shipped change actually cost. The model with the highest token price shipped my commits for the least money — and the cheapest tokens were, by a wide margin, the most expensive thing I used.
Token dashboards answer "how much did I burn?" That's the wrong question. The one a team actually has is "what did we get for it?" — cost per unit of shipped work, with the human time included, comparable across models. So I tracked every agent session for two months — tokens, prompts, and what actually shipped — and divided the cost by the commits.
The ranking came out backwards from the invoice. Here's what the number is, and why it flips.
The metric is true cost per commit:
Compute is real, cache-aware, API-equivalent dollars from the token counts. The other two terms are the part every other tool ignores: the human cost. A model that's cheap per token but needs ten prompts and constant supervision to land a change isn't cheap — it's spending your time, which is the most expensive thing in the room.
The driver values are yours to set, because the honest answer is "it depends on you." For the numbers below I used $12 per interactive prompt (context-switch + review) and $200 per active engineering hour. Your numbers will differ — so I also show the compute-only figure and the prompts-per-commit, which don't depend on any of that. The ranking is carried by those objective columns; the dollar totals just put a price on them.
One developer (me), my real work, 23 Apr → 18 Jun 2026 (~8 weeks). Mostly Claude Code, with some Codex in the mix:
That gap — $6.90 of compute versus $81 of true cost — is the point. The tokens are the cheap part. Here's the per-model breakdown for the models I shipped real volume with:
| Model (primary) | Commits | Compute $/commit | Prompts/commit | True $/commit |
|---|---|---|---|---|
| Claude Fable 5 cheapest to ship | 96 | $9.40 | 1.0 | $57.85 |
| Claude Opus 4.7 | 317 | $6.51 | 1.7 | $74.61 |
| Claude Opus 4.8 | 314 | $7.50 | 2.2 | $94.36 |
| GPT‑5.5 priciest to ship | 38 | $4.78 | 10.6 | $182.54 |
Models with ≥10 commits. GPT‑5.5 is a small sample (38 commits via Codex) — held loosely, shown because the direction is stark. I also run Sonnet 4.6 as the automation lane (cron jobs, recaps), where there's no human in the loop; it lands those at ~$1.45 each — a different job, not part of this comparison.
Read the two ends of that table:
The cheap-token model was the expensive one. The expensive-token model was the cheap one. Every ranking that stops at the invoice gets this exactly backwards.
This is also why the new vendor analytics don't answer the question. Anthropic now ships a Claude Code analytics API that reports commits, estimated cost, and tool-acceptance — useful, and a sign the category is real. But it's single-vendor (Claude only — it can't see the GPT‑5.5 row above), it's raw counts plus estimated spend with no human-attention term, and there's no normalization — no $/commit you can compare across models, and nothing to compare you against anyone else.
It can tell you what you burned on Claude. It can't tell you the pricier model was the cheaper one — and it can't tell you whether your number is any good. That last question — is my $/commit good? — is the one thing a single machine can never answer, because the answer lives across many teams. That's the piece we're building next: anonymized, aggregates-only, k‑anonymous baselines across orgs. This post is the single-developer version of the question that benchmark answers at scale.
It runs locally. It reads the logs your agents already write — Claude Code, Codex, Gemini — and computes the table above for your work. No code, prompts, or transcripts ever leave your machine; there's no account and no proxy. It's open source (FSL‑1.1‑MIT).
Point it at a month of your own sessions and check the ranking. My bet: the model you think is expensive is carrying you, and the cheap one is quietly costing you a fortune in prompts.