Stop Measuring AI by Test-Pass Rate

The agent’s run went green. 14 tests, all passing. A 200-line diff.

A senior engineer looked at it for ninety seconds and closed it without merging.

The tests were right. The work was wrong.

Green Doesn’t Mean Good

Test-pass rate is the metric everyone reaches for because it’s easiest to count. It’s a number. It goes up. It fits on a slide.

But it answers the wrong question.

A passing suite tells you the code does what the code attempted to do. It says nothing about whether the attempt was sane. The agent can write a passing test for the wrong behaviour. It can solve the literal ticket while missing the actual problem. It can add a dependency you’d never accept, duplicate a function that already exists, or paper over a race with a sleep.

All green. All unmergeable.

This isn’t hypothetical. In studies of AI-agent pull requests that passed their tests, roughly half still would not be merged by the maintainers who owned the code. Functionally correct against the suite. A human who owned the code said no.

Half. Test-pass rate measured the floor. Mergeability measured the thing that matters.

The Metric That Counts

Here’s the question that actually predicts value:

Would a senior engineer merge this, unedited, into the codebase they’re responsible for?

Not “does it pass.” Not “is it close.” Would they take it as-is and own it.

That bar catches everything test-pass rate misses:

Does it solve the real problem, or just the ticket as written?
Does it fit existing patterns, or invent a new one nobody asked for?
Is it the right size, or did it touch fifty files when five would do?
Would the author be embarrassed to put their name on it?

This is the difference between code that runs and code worth keeping. It’s the same line we draw between AI that produces demos and AI that produces production work: vibe coding versus vibe engineering. It’s also the line between a pilot and production-ready AI, and the discipline at the heart of shipping production code with AI. Test-pass rate is happy to grade a demo. Mergeability won’t.

The catch: mergeability is harder to measure. You can’t run it in CI. It needs judgement, and judgement is exactly the thing people automate away when they’re chasing a number.

So measure it properly.

Speed That Feels Fast Isn’t Speed

Before you trust any agent’s output, separate the feeling of fast from actual fast.

METR ran a controlled study on experienced open-source developers using AI tools on their own repositories (the July 2025 study). With the tools they felt about 20% faster. They were actually roughly 19% slower than without. Perception and reality pointed in opposite directions.

Read that twice. The people doing the work could not tell, from the inside, that the tool was costing them time.

If your senior engineers can’t feel the slowdown, your dashboards certainly won’t show it. A test-pass-rate chart trends up while real throughput trends down, because the time disappears into review, rework, and the quiet tax of correcting plausible-looking mistakes. The pilot that feels productive but ships nothing is the same illusion, one level up. This one lives inside the individual coding task.

The only way out is measurement that doesn’t rely on how the work felt.

How to Benchmark Honestly

You don’t need a research lab. You need discipline and a fair fight. An honest AI-engineering benchmark has three properties. Skip any one and you’re measuring your own hopes.

1. Same agent, same tasks, two ways. Run identical tasks through two configurations head to head. Hold everything constant except the variable you’re testing. Then put four numbers next to each other for every task:

What You Measure	What It Tells You
Tokens consumed	What the answer cost — tokens are the chunks of text the model bills you for
Wall-clock speed	Whether it’s actually faster, not just feels faster
Accuracy	Did it pass the suite, the floor not the ceiling
Mergeability	Would a senior engineer take it as-is

A configuration that wins on tokens and speed but loses on mergeability hasn’t won. It’s produced cheaper garbage faster. Read the four together, or you’ll optimise the easy three and quietly destroy the fourth.

2. Plant tasks where the baseline should win. This is the part most teams skip, and the part that makes the suite trustworthy.

Include deliberate bias-guard tasks: problems where the simplest approach is the correct one, and the fancier agent should not improve on it. If your clever configuration “wins” on those too, your benchmark is rigged. A real result has to be able to come back negative. A suite where your preferred approach always wins isn’t a benchmark. It’s a press release. Bias guards are how you prove the suite can tell you something you didn’t want to hear.

3. Judge mergeability with a held-out reviewer. Never let the agent that did the work grade its own homework. An AI marking its own pull request will pass itself. So will a second instance of the same agent, primed with the same context, the same assumptions, and the same blind spots.

The reviewer must be held out: a separate senior engineer, or at minimum a separate judgement with no hand in producing the work and no stake in the result. The moment the maker and the marker are the same, the mergeability number is theatre. It’s the same separation that makes automated code review worth trusting: the thing checking the work cannot be the thing that produced it.

What This Looks Like in Practice

We benchmark our own engineering tooling this way. Same tasks, two ways, four numbers per task, bias guards in the set, and a held-out reviewer answering one question: would you merge this.

What it surfaced was uncomfortable and useful. Plenty of runs that were all-green and all-fast were also all-unmergeable. Lower token count. Shorter wall clock. And a senior engineer wouldn’t touch the output. Graded on test-pass rate, we’d have shipped a regression and called it a win.

It also stopped us believing our own demos. The bias-guard tasks failed our preferred approach often enough to keep us honest. That sting is the signal the benchmark is working.

First Thing Tomorrow

Stop counting green checks. Start measuring whether the work survives review.

Pick ten real tasks from your backlog. Not toy problems. Actual tickets your team would otherwise pick up. The benchmark is only as honest as the work it’s built from.
Run them two ways and write down four numbers each. Tokens, speed, accuracy, mergeability. One row per task. If you only track pass-rate, you’re tracking one of four.
Plant three bias-guard tasks where the simple baseline should win. If your fancy setup beats the baseline on those, your suite is lying to you. Fix the suite first.
Hand mergeability to someone who didn’t do the work. A senior engineer, held out from the run, answering yes or no on each diff. Never the agent. Never the agent’s twin.
Read the four numbers together, never alone. Cheaper and faster mean nothing if a senior engineer won’t merge it. Mergeability is the gate. The rest is detail.

The Bottom Line

Test-pass rate can’t tell you whether the attempt was worth making.

Half of test-passing agent PRs get rejected by the people who own the code. Developers using AI tools ran 19% slower while feeling 20% faster. The numbers that feel like progress and the numbers that are progress are not the same.

So measure the one that’s hard to fake. Would a senior engineer merge this?

If the answer is no, it doesn’t matter how green the tests are.

Building AI into your engineering pipeline? We measure agent work by whether a senior engineer would merge it, not by how green the tests look. Honest benchmarks. Mergeable output. Let’s talk.