Our Journey to Automated Code Review (We're Still Figuring It Out)

We reverted a feature last week.

Not a bug. A whole approach. We’d made our AI reviewer a blocking gate, it jammed the pipeline, and we turned it back into advice.

Most posts about automated code review pretend it’s solved. It isn’t. Ours isn’t. That’s the point of this one.

What We Were Actually Trying To Do

We build software with AI agents. Not toy demos. Real changes that ship to production and run a business.

That creates an obvious bottleneck. Agents write code faster than humans can review it. If every change waits for a human to read line by line, the human is the constraint — you’ve automated the cheap part and left the expensive part untouched.

So the goal was never “AI writes everything.” It was a pipeline where the machine handles what the machine handles well, and a human spends attention only where it’s actually scarce.

Easy to say. Then Monday happens.

The Thing That Works: Deterministic First, Then Tiers

The first lesson was the least glamorous. The most reliable reviewer isn’t the cleverest. It’s the dumbest.

Before any AI looks at a change, deterministic checks run. Tests. Type checks. Linting. The build. These are not opinions. They pass or they don’t. There’s no hallucination in a failing test. (Getting that pipeline solid is its own discipline: platform engineering.)

A surprising amount of “review” is just this. A change that breaks the build — the automated step that assembles the code into something that actually runs — was never going to merge, and you don’t need a model to tell you that. Get the deterministic layer airtight and you’ve removed most of the noise before judgement enters.

Then we tier by risk. A copy tweak is not a change to how money moves. So changes get sorted: low-risk ones, fully covered by deterministic checks, merge on their own. Higher-risk ones stop and wait for a human.

What We Auto-Merge	What Waits For A Human
Fully covered by tests	Touches money, auth, or data
Low blast radius	Changes the rules everything depends on
Deterministic checks green	Anything the checks can’t fully judge
Reversible in one step	Hard or slow to roll back

The tiering is the whole game. It lets you grant autonomy where it’s earned without pretending you’ve earned it everywhere — the same compounding logic we cover in our post on compounding engineering.

The Rule We Won’t Break: No Self-Clearing

Here’s the line we drew early and have never moved.

An agent must never approve its own work.

Not directly. Not indirectly. Not by spinning up its own sub-agent “review panel” to bless what it just did. If the same intelligence that wrote the change also signs off on it, you don’t have a review. You have a rubber stamp wearing a lab coat.

This sounds obvious until you watch how tempting it is to violate. It’s so easy to have the agent that wrote the code also check the code. It’s right there, it has the context, it’s fast. And it will tell you, confidently, that the work is good.

Of course it will. It wrote it.

No-self-clearing is structural, not a politeness. The thing that clears a change must be independent of the thing that made it. Deterministic checks qualify — they have no ego. A separate reviewer qualifies. A human qualifies. The author, human or machine, never qualifies to clear its own gate.

What Backfired (The Honest Part)

Now the part most write-ups skip.

The reviewer hallucinated. We had an AI reviewer that kept flagging the same problem that wasn’t a problem. Same false finding, over and over, on unrelated changes. Confident. Detailed. Wrong. A human who cried wolf that often would lose all credibility in a week. The model never noticed it had.

We promoted it to a hard gate anyway. This was the real mistake. We decided its findings were important and made them blocking. If the reviewer objected, nothing merged.

You can guess what happened. The unreliable reviewer started objecting to good changes. Good changes stopped merging. The pipeline jammed. We’d handed a tool that was occasionally useful as a second opinion a veto it hadn’t earned.

So we reverted it. From a blocking gate back to advisory. The reviewer still runs, still comments. But it can’t stop a change on its own anymore. Its findings are input, not law.

That reversal felt like a defeat for a day. Then it felt like the most important thing we learned all quarter.

The Rule We Landed On

Gate what is reliable. Advise what is not.

That’s the whole principle, and it cost us a jammed pipeline to learn it.

A check earns the right to block only when a block almost always means something real. Deterministic checks clear that bar trivially. A flaky AI reviewer does not. So the AI reviewer advises — its findings show up, a human decides. The deterministic layer blocks, because once you’ve stamped out the flaky tests, a red build almost always means something real.

The corollary matters just as much: grant autonomy, never assume it. Auto-merge is a privilege a category of change earns by being reliably safe, not a default you flip because the demo looked good. Every time you let the machine clear a change unattended, you should be able to say exactly why that category is safe. If you can’t, it isn’t autonomous. It’s just unwatched. It’s the same way production-ready AI earns its autonomy: in shadow mode first, granted only once it’s proven.

You don’t treat something as trustworthy until reality proves it is.

The Open Question We Haven’t Solved

How much should you trust an AI reviewer? And how do you measure that trust instead of guessing at it?

We don’t have a clean answer. Right now it’s judgement: watch the false-positive rate, keep it advisory until it proves otherwise. That’s not a metric. That’s a feel.

And the wider evidence says feel isn’t enough. In one study of AI-agent pull requests, roughly half of the changes that passed their tests still would not have been merged by human maintainers. Passing the tests is not the same as being mergeable — the case we make in our post on why test-pass rate is the wrong metric. The deterministic layer catches the change that’s broken. It does not catch the change that’s technically correct and still wrong — wrong shape, wrong approach, wrong thing to build.

That gap is where human review still lives. And it’s the gap we can’t yet measure our way across.

So we’re not done. We have a pipeline that ships real work safely, and an honest list of what it can’t yet judge.

First Thing Tomorrow

Don’t start by buying the cleverest reviewer. Start by being honest about reliability.

Make the deterministic layer airtight first. Tests, types, lint, build. Most of “review” is here. Fix this before you add a single model.
Tier your changes by risk. Decide, explicitly, which categories are safe to merge unattended and which always wait for a human. Write it down.
Forbid self-clearing. The thing that made a change can never be the thing that clears it. Check your pipeline for this today — it sneaks in.
Demote unreliable reviewers to advisory. If a reviewer cries wolf, take away its veto. Let it comment, not block.
Track the false-positive rate. You can’t decide what to trust if you’re not measuring how often it’s wrong. Start counting, even crudely.

The Bottom Line

Automated code review isn’t a switch you flip. It’s a thing you earn, one reliable check at a time.

Gate what’s reliable. Advise what isn’t. Never let the author clear its own work. Grant autonomy where it’s been proven, nowhere else.

We’re still figuring out the rest. Anyone who says they’ve finished is the person you should review most carefully.

Building with AI agents and hitting the review bottleneck? We’ve shipped real work through it, scars and reverts included. Let’s talk.