● Field notes

The cheapest model on the invoice was the most expensive to ship with.

I let coding agents do most of my work for two months and measured what each shipped change actually cost. The model with the highest token price shipped my commits for the least money — and the cheapest tokens were, by a wide margin, the most expensive thing I used.

Token dashboards answer "how much did I burn?" That's the wrong question. The one a team actually has is "what did we get for it?" — cost per unit of shipped work, with the human time included, comparable across models. So I tracked every agent session for two months — tokens, prompts, and what actually shipped — and divided the cost by the commits.

The ranking came out backwards from the invoice. Here's what the number is, and why it flips.

What I measured

The metric is true cost per commit:

true $/commit = ( compute $ + prompts × attention $ + active-time × hourly $ ) / commits

Compute is real, cache-aware, API-equivalent dollars from the token counts. The other two terms are the part every other tool ignores: the human cost. A model that's cheap per token but needs ten prompts and constant supervision to land a change isn't cheap — it's spending your time, which is the most expensive thing in the room.

The driver values are yours to set, because the honest answer is "it depends on you." For the numbers below I used $12 per interactive prompt (context-switch + review) and $200 per active engineering hour. Your numbers will differ — so I also show the compute-only figure and the prompts-per-commit, which don't depend on any of that. The ranking is carried by those objective columns; the dollar totals just put a price on them.

The data

One developer (me), my real work, 23 Apr → 18 Jun 2026 (~8 weeks). Mostly Claude Code, with some Codex in the mix:

That gap — $6.90 of compute versus $81 of true cost — is the point. The tokens are the cheap part. Here's the per-model breakdown for the models I shipped real volume with:

Model (primary) Commits Compute $/commit Prompts/commit True $/commit
Claude Fable 5 96 $9.40 1.0 $57.85
Claude Opus 4.7 317 $6.51 1.7 $74.61
Claude Opus 4.8 314 $7.50 2.2 $94.36
GPT‑5.5 38 $4.78 10.6 $182.54

Models with ≥10 commits. GPT‑5.5 is a small sample (38 commits via Codex) — held loosely, shown because the direction is stark. I also run Sonnet 4.6 as the automation lane (cron jobs, recaps), where there's no human in the loop; it lands those at ~$1.45 each — a different job, not part of this comparison.

The inversion

Read the two ends of that table:

The cheap-token model was the expensive one. The expensive-token model was the cheap one. Every ranking that stops at the invoice gets this exactly backwards.

A slope chart of four models. Ranked by token (compute) cost, cheapest first: GPT-5.5 $4.78, Opus 4.7 $6.51, Opus 4.8 $7.50, Fable 5 $9.40. Ranked by true cost per commit, cheapest first: Fable 5 $57.85, Opus 4.7 $74.61, Opus 4.8 $94.36, GPT-5.5 $182.54. Fable 5's line runs bottom-left to top-right and GPT-5.5's top-left to bottom-right — the two cross, so the cheapest tokens become the most expensive to ship and vice versa.
Same four models, two rankings. Order by token cost and GPT‑5.5 looks cheapest; order by true cost per commit and it's the most expensive by 3×. The cheap-token model and the pricey-token model swap ends.

Why token counts can't tell you this

This is also why the new vendor analytics don't answer the question. Anthropic now ships a Claude Code analytics API that reports commits, estimated cost, and tool-acceptance — useful, and a sign the category is real. But it's single-vendor (Claude only — it can't see the GPT‑5.5 row above), it's raw counts plus estimated spend with no human-attention term, and there's no normalization — no $/commit you can compare across models, and nothing to compare you against anyone else.

It can tell you what you burned on Claude. It can't tell you the pricier model was the cheaper one — and it can't tell you whether your number is any good. That last question — is my $/commit good? — is the one thing a single machine can never answer, because the answer lives across many teams. That's the piece we're building next: anonymized, aggregates-only, k‑anonymous baselines across orgs. This post is the single-developer version of the question that benchmark answers at scale.

Try it on your own logs

It runs locally. It reads the logs your agents already write — Claude Code, Codex, Gemini — and computes the table above for your work. No code, prompts, or transcripts ever leave your machine; there's no account and no proxy. It's open source (FSL‑1.1‑MIT).

$ npx codenomics init $ npx codenomics serve --open

Point it at a month of your own sessions and check the ranking. My bet: the model you think is expensive is carrying you, and the cheap one is quietly costing you a fortune in prompts.

One developer, one workflow, ~8 weeks. This is a personal case study that motivates the metric — not a cross-org benchmark (that's the part we're building). The dollar figures use my driver assumptions ($12/prompt, $200/hr); the ranking is driven by the objective columns — prompts and compute per commit. Run it on your own logs and you'll get your own table.