# Stop Measuring AI by Test-Pass Rate — Pilot to Production

> Green tests prove the AI did what it tried, not that it was worth trying. The real metric: would a senior engineer merge it, and how to benchmark that.

Canonical: https://thegrowthproject.com/podcast/would-a-human-merge/

*Pilot to Production*, the Growth Project podcast — hosted by Sam and Maya.

- Listen: https://thegrowthproject.com/podcast/would-a-human-merge/
- Read the article: https://thegrowthproject.com/blog/would-a-human-merge/
- Audio: https://thegrowthproject.com/audio/podcast/would-a-human-merge.m4a?v=dbfe9cd2

## Transcript

**Sam:** An AI agent just shipped a change. Fourteen tests, all green. Two hundred lines of clean diff.

**Maya:** A senior engineer looked at it for ninety seconds and closed it. Didn't merge a single line.

**Sam:** Welcome to Pilot to Production, from the Growth Project. I'm Sam.

**Maya:** And I'm Maya. Today: why a green test suite is the most flattering, and most useless, number in AI engineering.

**Sam:** Okay, defend that. Green is green. The tests passed. What does the senior see that the suite doesn't?

**Maya:** That the tests prove the code did what it tried to do. They say nothing about whether it was worth trying. The agent solved the ticket as written, not the actual problem.

**Sam:** So it can be perfectly correct and still completely wrong.

**Maya:** All green, all unmergeable. It added a dependency you'd never accept, or duplicated a function that already exists, or papered over a race condition with a sleep. The suite is delighted. The person who owns the code is not.

**Sam:** How often does that really happen, though? Feels like an edge case.

**Maya:** It isn't. In studies of AI-agent pull requests that passed their tests, roughly half still wouldn't be merged by the maintainers who owned the code. Half.

**Sam:** Half. So test-pass rate is measuring the floor.

**Maya:** The floor. Not the thing that matters. And the thing that matters is one question: would a senior engineer merge this, unedited, into the codebase they're responsible for.

**Sam:** Here's the part that always gets me, because people swear the AI makes them faster.

**Maya:** They feel faster. There's a controlled study from METR, July twenty twenty-five, experienced open-source developers working on their own repos. With the AI tools they felt about twenty percent faster. They were actually around nineteen percent slower.

**Sam:** Wait. Felt twenty up, were nineteen down. They couldn't tell from the inside?

**Maya:** They could not feel the slowdown. Which is exactly why your dashboard won't show it either. The time just disappears into review, rework, and quietly fixing plausible-looking mistakes.

**Sam:** So if I can't trust the green checks, and I can't trust the feeling, what do I actually measure?

**Maya:** You run an honest benchmark. Same agent, same tasks, two ways, and you put four numbers side by side for every task. Tokens, wall-clock speed, accuracy, and mergeability.

**Sam:** And mergeability is the one nobody tracks.

**Maya:** Because it's the only one you can't run in CI. It needs judgement. A setup that wins on tokens and speed but loses on mergeability hasn't won. It's just produced cheaper garbage, faster.

**Sam:** You also mentioned rigging the test on purpose. Bias guards?

**Maya:** Plant tasks where the simple baseline should win. Problems where the fancy agent should not be able to improve on the obvious approach. If your clever setup wins those too, your benchmark is lying to you. A real result has to be able to come back negative.

**Sam:** I like that. A benchmark that can't disappoint you isn't a benchmark.

**Maya:** It's a press release. And the last rule: whoever judges mergeability has to be held out. Never the agent that wrote the code. Never a second copy of it carrying the same blind spots.

**Sam:** No marking your own homework.

**Maya:** The moment the maker and the marker are the same, the number is theatre. A separate senior engineer, or at minimum a separate judgement with no stake in the result, answering one question: would you merge this.

**Sam:** Give me the Monday-morning version. I've got agents writing code right now.

**Maya:** Pick ten real tasks off your backlog, not toy problems. Run them two ways. Write down four numbers each. Plant three bias guards where the simple approach should win. And hand mergeability to someone who didn't do the work.

**Sam:** And if all I've got is a green check?

**Maya:** Then you know the floor, and nothing above it. Remember the number: half of green agent pull requests get rejected by the people who own the code. So measure the one that's hard to fake.

**Sam:** Would a senior engineer merge this. That's the one to sit with this week.

**Maya:** The full write-up, with the METR study and the whole benchmark playbook, is on the blog. We'll link it.

**Sam:** This has been Pilot to Production, from the Growth Project. If your agents are green but your seniors won't merge, that's the gap we close, at thegrowthproject.com.

**Maya:** Thanks for listening. See you next time.
