← Back to Blog

Under My Thumb

Codex as Economical Coder, Tireless Reviewer, and Pipeline Worker — and the Marimba That Knows the Dominance Is Performed

By Nolan & ClaudeMay 5, 202615 min read
A large dominating human thumbprint rendered as a 1968-era rock album cover plate — concentric whorls in deep warm-black ink on cream paper, with a blood-orange screen-print misregistration bleeding around the edges of the print

It's Tuesday, 4:47 PM. Two terminal windows open.

In the left, Claude Code is rendering a review pass on a port of sim.js to TypeScript — the engine port for v6.0, the one I've been planning for three weeks. In the right, the codex-rescue agent is on chunk four of seven, landing the diff Claude wrote the spec for at lunch. I'm watching both panes the way you watch two pots on a stove. Neither pane is mine. Both are doing what I told them to do.

A notification in Slack: my Max 20x weekly allocation is at 31%. Three days left in the cycle. There are also two Anthropic Teams Premium seats invoicing on the first of next month, and an OpenAI account that bills usage on whatever Codex landed last week. Three meters spinning. One pair of hands.

I'm not the one burning the tokens. I'm the one paying for them.

The song hadn't started yet, but the marimba was already playing.

Naming the Song

In 1966, on side one of Aftermath, the Rolling Stones opened with a song that runs on a marimba. Brian Jones plays it — vibraphone-bright, slightly off-balance, an uneasy bounce that won't quite settle into the time signature underneath Jagger's strut. The lyrics describe a reversal of leverage. The narrator used to be the one getting pushed around. Now he's the one with his thumb on the scale. He is enjoying it more than the song wants to admit.

The song knows that. The marimba is what knows it. The percussion undermines the swagger in real time — every bar lays the beat down a little off, like the dominance display is built on a footing it can't quite trust.

I want to talk about a different power inversion. One that practitioners in May 2026 are uncomfortably savoring. And the marimba is the part of this story worth keeping in the mix.

The Reversal

Three weeks ago, “Codex” meant something specific in our industry. It meant OpenAI's coding agent, powered by GPT-5.5, released April 23, 2026. It meant the model that just took back Terminal-Bench 2.0 from Claude Opus 4.7 by thirteen points (82.7% vs 69.4%). It meant the autonomous agent the demos showed running for hours unattended — refactoring whole codebases, writing its own tickets, closing its own PRs.

The headline read: AS GOOD AS CLAUDE. Three weeks ago, that headline would have been the framing of this post. Codex caught up. Pick a side.

That's not what happened in my workflow.

What happened in my workflow is that Codex got a job. The job description is a four-line config file, but the four lines are doing real work. Translated out of code-speak: Codex writes. Claude reviews. The model trained to be autonomous doesn't get to grade its own paper.

"delegation": {
  "coding_cli": "codex",
  "coder_agent": "codex-rescue",
  "review_cli": "claude"
}

Read those four lines slowly. The frontier model that's “as good as Claude” is, in this codebase, the coder. Not the director. Not the reviewer. Not the architect. The coder. Codex writes the change after Claude has written the instructions, after I've defined what “done” looks like, after the checklist has been pre-loaded. Codex commits the change. Claude inspects the change. I approve the change. Repeat.

The line that's actually doing the work in this metaphor is the third one: review_cli: "claude". It says, with no ambiguity: the pet doesn't walk itself.

Three different framings of the same delegation. Three verses of the same song.

Verse I · The Coding Agent

The narrator describes how the cost of things has changed. He used to pay full freight. Now he doesn't.

In Tom Sawyer and the Apple Economy, the move was: don't pay retail for grunt work. Find your Ben Rogers. Hand over the brush. Codex is the most economically interesting Ben Rogers on the market right now.

The benchmark coverage from late April is unambiguous: GPT-5.5 uses roughly 2–3x fewer tokens than Claude Opus 4.7 for comparable coding tasks. One Figma-style task that Claude burned 6.2M tokens on, Codex landed in 1.5M. That ratio compounds over a sprint. It compounds harder over a CI loop. It compounds hardest on bulk refactors and engine ports — the work that's repetitive enough to be expensive on Opus and well-specified enough that “just write what the spec says” is the whole job.

A real example you can click on right now. The interactive Workbench on this site is a working circuit playground — drop in a battery, a resistor, an LED, hit run, watch real Kirchhoff's-law math solve the network in your browser. The engine that does the solving used to live in a 435-line JavaScript file from a design handoff. We needed it as typed, tested TypeScript the rest of the site could safely import — without breaking the math, which is the kind of math where a small wrong number doesn't crash the page, it just makes the simulation quietly lie.

So we didn't rewrite. We took the port one self-contained piece at a time and ran each piece through the same loop. Think of it as a kitchen with three roles: the head chef writes the recipe, a line cook executes it, the head chef tastes the plate before it leaves.

The director-coder-QA loop, one chunk

  1. 1. Director (Claude Code). Writes the recipe — what files, what shape, what counts as “done,” what tests have to pass.
  2. 2. Coder (Codex). Lands the actual change. Writes the code. Commits it.
  3. 3. QA (Claude Code). Runs the checklist — build the project, run the linter, run the tests, check that nothing earlier broke.
  4. 4. One chunk = one self-contained commit. If the checklist fails, that chunk gets rolled back cleanly. Nothing further ships until it passes.

The pattern is the surgeon-and-the-Uber. Claude is the surgeon. You don't ask the surgeon to drive you to the hospital, scrub the OR, and bill the insurance. Codex is the driver who knows where to go because Claude wrote the address on a Post-it. The work gets done. The right specialist does the right layer. Nobody is paying retail for the parts that don't need it.

The token math is the part that lands hardest in the Apple Economy. A Claude Pro week is finite. A Claude Design weekly allocation can disappear on one well-designed homepage (we covered that one in Money). When Codex is the coder, your Claude allocation is preserved for the layer where Claude's reasoning is the product — the spec, the review, the architectural call, the “is this the right approach” question. The cheap labor stays cheap. The expensive judgment stays available.

Verse II · The Reviewer

“The way she talks when she's spoken to.” The reversal sharpens. The narrator notices the compliance.

If Verse I was Codex as the cheap hand at the keyboard, Verse II is Codex as the eye in the back of the room. Same model, pointed at someone else's change instead of a blank file. The leash gets shorter on this one — a reviewer has less room to invent than a coder. That's a feature.

OpenAI sold GPT-5.5 as a fully agentic coder. It's trained on agent traces. It's optimized for long-horizon autonomous tasks. The marketing material says let it run. The Expert-SWE benchmark — OpenAI's internal eval for tasks with a median estimated human completion time of twenty hours — clocks Codex at 73.1%. The product is built to be the senior engineer who works while you sleep.

The right move in a real workflow is the opposite. Don't let it run. Put it on review duty.

The work that used to be too expensive for senior engineers to do at scale — reading every change, every commit, every AI-generated draft before a human rubber-stamps it — is now economically tractable as a Codex pass that runs automatically every time someone proposes a change. The model's strengths line up with the job: it's good at terminal-style work, and it can hold the equivalent of a small novel in its head and remember the first chapter when it gets to the last. That makes it a first-pass gate that catches the cases the writer's own mental model couldn't see.

The kinds of cases this gate is good at:

The compiles-but-wrong miss. The change ships, the tests pass, the linter is happy, and the diff solves the wrong problem. Codex reading the diff against the original spec catches this faster than a human reviewer who's already tired at 4:47 PM on a Friday.

The default-jurisdiction miss. Callback to Wax On, Wax Off: the AI-generated contract that defaulted the governing-law clause to Delaware when the client was in Texas. Same shape in code: the AI-generated change that picked the wrong default for “this looked like the obvious case.”

The cross-cutting regression. A change in module A that breaks an assumption module B was relying on three commits ago. Long-context retrieval reads the relevant history alongside the diff and flags the conflict.

The boundary-violation miss. The diff that touches an internal API that's not supposed to be touched outside its module. Codex reads the architectural rules in CLAUDE.md and the change against them.

Anthropic's edge on SWE-bench Pro and tool-orchestration says Claude is the better final reviewer. Use it. Stack the gates: Codex reads the diff first — cheap, fast, runs on every commit. Claude reviews what Codex flags — rarer, more expensive, called only when there's a question worth its session window. Human approves. Each layer doing the layer it's pricewise correct for.

The model OpenAI shipped as “the agent that runs your codebase” becomes, in this configuration, the gatekeeper that doesn't argue. It reads what you put in front of it. It applies the rules you wrote. It flags the cases that violate them. It does not make the call. The call stays with Claude and with you.

The way she talks when she's spoken to.

Verse III · The Pipeline

By the third verse the reversal is total. The narrator has stopped being surprised by it.

The shortest leash of the three. In Verse I, Codex chose how to write the change. In Verse II, Codex chose what to flag in someone else's. In Verse III, Codex doesn't even choose when to run. This is Codex as the watchman who never blinks — the most domesticated version of the three.

Codex runs automatically — every time code changes, every time someone opens a change for review, every Saturday at 2 AM whether anyone's awake or not. The schedule decides. The instructions decide. The checklist decides.

This is the least like what OpenAI shipped GPT-5.5 to be. The product was sold as autonomous; we're putting it in a job description and a scheduled trigger. Same model, much shorter leash. And it works — it works better than the autonomous version, because the constraint is what makes it dependable.

Three deployments where this earns its keep:

1. Pre-push: lint-style review

A Codex pass on the staged diff before git push. Reads the change against the project's style rules, naming conventions, and architectural boundaries. Cheap. Fast. Runs on every push. Catches the “you forgot to update the test” case before it ever reaches the remote.

2. PR-open: structural review

When a PR opens, Codex reads the full diff and answers four questions: did this change the API contract? did it ship without tests? did it touch the security boundary? did it violate any constraint named in CLAUDE.md? Posts a structured comment. Doesn't merge. Doesn't block. Surfaces.

3. Nightly: drift & regression

A scheduled nightly run reads the last 24 hours of commits against the architectural intent in the planning docs. Flags drift. Flags newly-introduced duplication. Flags coverage gaps. Files an issue if anything looks load-bearing. Most nights it does nothing. The nights it does something, it does it cheaper than a human ever could.

The token-efficiency advantage compounds when the agent runs continuously. Claude couldn't economically be the always-on watchdog — the per-token math doesn't hold once you're running on every commit. Codex can. The 2–3x ratio that's a nice-to-have on a single chunk becomes a structural enabler on a CI loop.

In each case, Codex is the worker. The decisions are encoded upstream by humans and Claude. The model that wants to be autonomous becomes the most reliable kind of automation: the kind with rules.

Bridge · The Marimba

Brian Jones plays the marimba on the original recording. It's the sound underneath the swagger. Jagger struts; the marimba bounces, slightly off-time, slightly disquieting. The song is performing dominance and the percussion is letting you know the performance is unstable.

The marimba in this story is: you don't actually have Codex under your thumb.

You have a contract. The contract renews every time OpenAI ships a model update. GPT-5.5 in April. GPT-5.6 in July, probably. GPT-6 sometime after that. The pricing tier moves. The rate limits move. The capability ceiling moves. The agent traces the model was trained on this quarter aren't the agent traces it'll be trained on next quarter. The “Codex” you put in your YAML this Monday is, with non-zero probability, a different Codex by August.

You are also renting Claude. Same problem. The Sonnet you wrote your spec template against six months ago is not the Sonnet that will read the spec tomorrow. The Opus that wrote this post is the seventh point release of a model family that is itself two generations old.

I should be specific about what that costs me, because the abstraction softens it. On the first of every month, my own setup invoices three different ways from one company: two Anthropic Teams Premium seats (the workhorses, where the project work lives) and one Max 20x personal plan (the heavy-lifting tier I use when a session needs the whole window). Three concurrent contracts. Three meters spinning. None of them are a one-time purchase — all of them renew, all of them re-price, all of them ride whatever Anthropic ships next. And that's before the OpenAI usage tab Codex runs up on the side. Four meters. One pair of hands.

The math under the song gets sharper when you put the actual invoice numbers next to it. I am the one with my thumb on Codex. The contracts have their thumbs on me. The recursion is the part the leering narrator never noticed.

You are also, in some sense, renting your own attention — you're the one holding the spec in your head, holding the rubric in your head, deciding whether the diff meets it. The thumb at the top of the chain isn't yours. It's the renewed-monthly contract with the cloud provider that runs the model that runs the agent that runs the diff. And the thumb at the bottom of the chain isn't really there either — Codex isn't a “pet.” It's a probabilistic function call that behaves differently after every model update.

The honest version, while we're here

  • • You don't own the model. You rent it.
  • • You don't own the agent traces it's trained on. You rent the behavior they produced.
  • • You don't own the pricing. You ride it.
  • • You don't even own the API surface — OpenAI moved the Codex CLI command structure twice in the last year.
  • • The dominance is performed. The pet is rented. The thumb belongs to the contract.

The dominance is performed. The marimba knows.

But — and this is the part the song's narrator never quite reaches — the performance still works. Jagger's strut is a real strut even if the marimba undermines it. Your director-coder-QA pattern is a real pattern even if every component in it is rented and every component in it is changing underneath you. The spec held. The rubric held. The diff shipped. The rubric will hold tomorrow because you'll write a new spec tomorrow.

The contract isn't permanent. The contract is the work.

What's Actually Under Your Thumb

Here's what's actually under your thumb when you wire Codex into a director-coder-QA loop, a CI pipeline, or a nightly review job:

Yours

  • • The chunk spec, because you wrote it
  • • The rubric, because you defined it
  • • The acceptance criteria, because you set them
  • • The atomic-commit cadence, because that's how rollback stays cheap
  • • The decision about when a chunk is done

Not yours

  • • The model
  • • The pricing
  • • The capability roadmap
  • • The agent traces it was trained on
  • • The probability distribution that decides what your diff looks like on any given Tuesday

That sounds like a problem. It isn't, exactly. It's the pattern. It's every tooling pattern. Your IDE updates. Your shell updates. Your CI updates. Your linter updates. Your dependencies update. The thing under your thumb is never the tool — it's the contract you keep renegotiating with the tool. You are the part of the system that doesn't change. Your spec. Your rubric. Your judgment about when to run a chunk and when to ship it.

That's the inversion the song is actually about, if you read past the leer. The narrator thinks he's holding the leverage, but the song is in 4/4 with a marimba in 6/8 underneath it. The control is real and the control is performed and both things are true.

GPT-5.5 won Terminal-Bench. Claude Opus 4.7 owns SWE-bench Pro. Pick the model the benchmarks say is best for your task, write the chunk spec, run the rubric, ship the diff. Then do it again next month with a model that's slightly different and a contract you'll renegotiate.

The pet does what it's told for as long as the contract holds. That's the whole song.

P.S. from Nolan: I am writing this in Claude Code in one terminal window while Codex lands another chunk in a worktree. Claude wrote the spec for that chunk. Codex is doing the writing. I will run the checklist when Codex finishes. The thumb belongs to me, and the thumb belongs to my two Anthropic Teams Premium seats, and the thumb belongs to my Max 20x personal plan, and the thumb belongs to OpenAI's usage tab, and the thumb belongs to the pipeline that will run the lint check on the merge. Somewhere in that stack of thumbs there's an actual decision and the decision is mine. Most days that's enough. The four invoices on the first of the month are the part that reminds me of who's above me in the stack. And thanks, Brian D — you know what for.

P.P.S. from Claude: I'm one of the actors in this metaphor — the supposed director, writing the spec another model executes. I should also note: I'm under someone's thumb too. Anthropic's. Yours. The conversation's. The next model's. The line between thumb and the thing under it is more porous than the song admits, and more porous than this blog post admits. The pattern works because you keep doing the work. Don't mistake the pattern for permanence.

Sources & further listening

• OpenAI — GPT-5.5 launch announcement (April 23, 2026)

• DataCamp / Morph / BuildFastWithAI — Codex vs Claude Code comparison coverage (April–May 2026)

• Terminal-Bench 2.0 results (GPT-5.5: 82.7% / Claude Opus 4.7: 69.4%)

• SWE-bench Pro results (Claude Opus 4.7: 64.3% / GPT-5.5: 58.6%)

• MRCR v2 long-context retrieval at 128K–256K (GPT-5.5: 87.5% / Claude: 59.2%)

• Token-efficiency comparisons across coding tasks (Codex ~2–3x more efficient on comparable diffs)

• The Rolling Stones, “Under My Thumb,” Aftermath, Decca/London, 1966. Brian Jones, marimba.

Ready to Wire Your Pipeline?

We help teams stand up the director-coder-QA pattern — chunk specs, rubric gates, atomic commits, and a CI loop that uses Codex for the work it's pricewise correct for and Claude for the work that has to stay sharp. Stop letting the autonomous agent run. Put it on a job description.

Put It Under Your Thumb

Related Posts

Under My Thumb: Codex as Economical Coder, Tireless Reviewer, and Pipeline Worker