# Our Journey to Automated Code Review (We're Still Figuring It Out) — Pilot to Production

> An honest account of automating code review with AI. What works, what backfired, and the one rule we won't break: no self-clearing.

Canonical: https://thegrowthproject.com/podcast/automating-code-review/

*Pilot to Production*, the Growth Project podcast — hosted by Sam and Maya.

- Listen: https://thegrowthproject.com/podcast/automating-code-review/
- Read the article: https://thegrowthproject.com/blog/automating-code-review/
- Audio: https://thegrowthproject.com/audio/podcast/automating-code-review.m4a?v=6f1a8601

## Transcript

**Sam:** We reverted a feature last week. Not a bug. A whole approach.

**Maya:** We made our AI reviewer a blocking gate. It jammed the pipeline. We turned it back into advice.

**Sam:** Welcome to Pilot to Production, from the Growth Project. I'm Sam.

**Maya:** And I'm Maya. Today: why automated code review isn't a switch you flip, it's a thing you earn one reliable check at a time.

**Sam:** Okay, but most posts say this is solved. You're telling me it isn't?

**Maya:** It isn't. Ours isn't. Anyone telling you otherwise is selling something.

**Sam:** So why automate review at all? What's the actual problem?

**Maya:** Agents write code faster than humans can read it. If every change waits for a human to read line by line, the human is the bottleneck. You've automated the cheap part and left the expensive part untouched.

**Sam:** Right. So where do you start? The clever reviewer?

**Maya:** No. The dumbest one. Before any AI looks at a change, deterministic checks run. Tests. Type checks. Linting. The build.

**Sam:** Deterministic meaning, no opinions.

**Maya:** They pass or they don't. There's no hallucination in a failing test. A change that breaks the build was never going to merge, and you don't need a model to tell you that.

**Sam:** Huh. So a lot of "review" is just that.

**Maya:** A surprising amount. Get the deterministic layer airtight and you've removed most of the noise before judgement even enters.

**Sam:** And then?

**Maya:** Then you tier by risk. A copy tweak is not a change to how money moves. So low-risk changes, fully covered by deterministic checks, merge on their own. Higher-risk ones stop and wait for a human.

**Sam:** Define higher-risk.

**Maya:** Anything that touches money, auth, or data. Anything that changes the rules everything else depends on. Anything hard to roll back. That always waits.

**Sam:** So the tiering is the whole game.

**Maya:** The tiering is the whole game. It lets you grant autonomy where it's earned without pretending you've earned it everywhere.

**Sam:** Now the rule. You've got one line you won't move.

**Maya:** An agent must never approve its own work. Not directly. Not by spinning up its own sub-agent review panel to bless what it just did.

**Sam:** Why is that so tempting to break?

**Maya:** Because the agent that wrote the code already has the context. It's right there, it's fast. And it'll tell you, confidently, that the work is good.

**Sam:** Of course it will. It wrote it.

**Maya:** Exactly. If the same intelligence that made the change signs off on it, you don't have a review. You have a rubber stamp wearing a lab coat.

**Sam:** Okay. Now the honest part. What backfired?

**Maya:** The reviewer hallucinated. Same false finding, over and over, on unrelated changes. Confident, detailed, wrong. A human who cried wolf that often loses all credibility in a week. The model never noticed.

**Sam:** And you made it a gate anyway.

**Maya:** That was the real mistake. We decided its findings were important and made them blocking. So it started objecting to good changes. Good changes stopped merging. The pipeline jammed.

**Sam:** You handed a second opinion a veto it hadn't earned.

**Maya:** So we reverted it. Blocking gate back to advisory. It still runs, still comments, but it can't stop a change on its own. Its findings are input, not law.

**Sam:** And that gave you the principle.

**Maya:** Gate what is reliable. Advise what is not. A check earns the right to block only when a block almost always means something real. A flaky AI reviewer doesn't clear that bar. A red build does.

**Sam:** There's a number that backs the gap, right? Why deterministic alone isn't enough.

**Maya:** In one study of AI-agent pull requests, roughly half of the changes that passed their tests still would not have been merged by human maintainers. Passing the tests is not the same as being mergeable.

**Sam:** So the build catches broken. It doesn't catch wrong.

**Maya:** Technically correct and still wrong. Wrong shape, wrong approach, wrong thing to build. That gap is where human review still lives, and it's the gap we can't yet measure our way across.

**Sam:** So what does someone do first thing tomorrow?

**Maya:** Don't buy the cleverest reviewer. Make the deterministic layer airtight first, tests, types, lint, build. Then tier your changes by risk and write it down. Forbid self-clearing. Demote any reviewer that cries wolf to advisory. And start tracking the false-positive rate, even crudely.

**Sam:** Grant autonomy.

**Maya:** Never assume it. Auto-merge is a privilege a category of change earns by being reliably safe, not a default you flip because the demo looked good.

**Sam:** This has been Pilot to Production, from the Growth Project. If you're hitting the review bottleneck with AI agents, scars and reverts included, that's the conversation we have at thegrowthproject.com.

**Maya:** Thanks for listening. See you next time.
