We Were the Bottleneck: What 3 Days of Data Taught Us About Our Own Automation

We built an engineering pipeline that was supposed to run itself.

It worked. It also felt slow.

So before we built more of it, we did something we tell every client to do and rarely do ourselves. We measured it.

The Feeling That Lied

The pipeline was automated end to end. On paper, work flowed through without us. In practice, it felt like wading.

That feeling matters, because the feeling is what gets acted on. Something feels slow, so you queue work to speed it up. We had the queue ready: fifteen tasks of hardening, the kind of list that looks responsible and busy at the same time.

But feelings about your own throughput are unreliable. METR ran a controlled study where experienced developers used AI tools on real tasks. They felt about 20% faster. They were actually about 19% slower. The gap between the feeling and the measurement was nearly 40 points.

We didn’t trust the queue. We measured first.

What We Actually Measured

No new instrumentation. No dashboards. We used what was already lying around: three days of session transcripts and the full git history for the same window. Every CI run, every retry, every rebase, timestamped.

The point wasn’t to count what we did. It was to count what we redid. Effort that produced nothing. Motion mistaken for progress.

Three days is a small window. That’s the point. You don’t need a quarter of telemetry to see a structural problem. You need one honest slice and the willingness to read it. The numbers came back fast. They weren’t flattering.

The Numbers Didn’t Match the Story

Here’s the gap between what we believed and what the data showed.

What We Believed	What 3 Days Showed
CI was thorough	60% of CI runs were wasted re-runs
Each change ran the suite once or twice	One change ran the full suite nine times
Friction was spread across many causes	~70% of it came from rebase and worktree churn
The pipeline was autonomous	A human was silently doing the merge it couldn’t
We needed a 15-task hardening plan	We needed two moves

Sixty percent of CI runs were re-runs. Not new work. The same change going through the same suite again because something upstream had shifted underneath it.

One change ran the full CI suite nine times before it landed. Nine. Eight of those runs produced nothing but a green tick on a state that no longer existed by the time it finished.

And about 70% of the total friction traced back to a single mechanical cause: rebase churn and the disk-and-worktree shuffle around it. Not flaky tests. Not slow builds. Not the model. Mechanical thrash, compounding.

The Bottleneck Wasn’t Where We Pointed

We assumed the slow part was the work.

The slow part was the re-work.

A change would be ready. Before it could land, the base it sat on had moved. So it rebased — replayed its edits on top of the new, shifted starting point. Rebasing meant re-running CI, because the change was now technically different. While that ran, the base moved again. Rebase again. Run again.

Every loop was individually reasonable. Together they were a treadmill. The change was sprinting and the ground kept sliding back.

This is the trap with pipelines. Each step looks correct in isolation. The cost lives in the interaction between steps, and you only see interaction when you measure the whole thing running, not the parts in theory. It’s the same lesson late-night production fixes teach you: the architecture diagram shows clean boxes, but reality lives in the hidden dependencies between them.

We had been pointing at the work. The data pointed at the gaps between it.

The Human Was a Symptom, Not a Step

This was the finding that changed how we think.

There was a human in the loop. The founder. Stepping in often enough that it felt like part of the process. We’d half-rationalised it as a sensible approval gate. A human checking things before they ship. Prudent.

It wasn’t designed. We had already built automatic merging, meant to land finished work with no one touching it. The data showed it was firing far less than it should have, because the churn kept invalidating the conditions it needed to fire. The treadmill moved the ground so often the automation could never find stable footing to act.

So a human kept catching the work and pushing it through by hand. Not as a designed safeguard. As a fallback that a hidden failure kept triggering.

That reframes the whole thing. The human wasn’t the bottleneck. The human was the alarm for the real bottleneck, and we’d been reading the alarm as a feature. Every manual merge was the automation silently failing and a person quietly covering for it.

This is the quiet danger of automation you can’t observe. It doesn’t fail loudly. It fails into a human who absorbs the failure, keeps the lights on, and makes the system look like it’s working. The fallback masked the failure. The masking masked the cause.

Two Moves, Not Fifteen

The fifteen-task plan was built for the problem we imagined. The data described a different, smaller one. Once you see that 70% of the friction is one mechanical cause and the human is a symptom of one silent failure, the fix narrows hard.

Move one: stop the churn at its source. Kill the treadmill. If the ground stops sliding while a change is in flight, the re-runs collapse and most of that 60% of wasted CI disappears with them. You don’t optimise the loop. You remove the reason the loop exists. We dug into the specific tooling that does this in a separate piece — and, just as honestly, what it doesn’t fix.

Move two: make the automation observable, then let it do its job. The auto-merge was already built. It didn’t need replacing. It needed the churn removed so it could find stable footing, and a metric on it so we’d know whether it was firing. The moment you can see the automation succeed or fail, the human stops being the silent fallback.

Two moves. The other thirteen tasks weren’t wrong. They were solving for a system we’d imagined instead of the one the data described. Most of them quietly stopped mattering. That’s the dividend of measuring first: not just a faster pipeline, a much shorter list.

First Thing Tomorrow

You have a process that feels slow. Don’t build the fix yet. Measure the slow.

Count the re-work, not the work. Pull a few days of history from whatever you already have: logs, run records, version control. Count the actions that produced nothing. Retries, re-runs, re-dos. That number is your real target.
Find your treadmill. Look for the loop where something finishes, the ground shifts, and it has to start again. That interaction between steps is where the cost hides, not in any single step.
Check whether a human is a step or a symptom. For every manual intervention, ask: was this designed, or is a person silently covering for something meant to be automatic? If the latter, the human is your alarm.
Instrument the automation you already trust. Put one metric on every automated step: did it do its job, or did something quietly catch the work it dropped?
Re-cost your roadmap against the data. Hold your queued plan up to what you just measured. Cross off everything that solves a problem the data didn’t show. You’ll cross off more than you expect.

The Bottom Line

We were going to build fifteen things to fix a pipeline. Three days of our own data said the bottleneck was one mechanical cause wearing a costume, and that the human we’d mistaken for a safeguard was an alarm we’d learned to ignore.

The feeling pointed one way. The measurement pointed another. The measurement was right. It usually is.

The bottleneck is rarely where you point. And the automation you can’t observe is the automation quietly failing into someone who hasn’t told you yet. (Building the pipeline itself? See our guide to platform engineering.)

Building automation you can’t quite see into? We measure pipelines before we extend them. Find the bottleneck. Fix the cause. Let’s talk.