# We Were the Bottleneck: What 3 Days of Data Taught Us About Our Own Automation — Pilot to Production

> We built an autonomous engineering pipeline. It felt slow, so we measured three days of our own usage instead of building more. The bottleneck surprised us.

Canonical: https://thegrowthproject.com/podcast/we-were-the-bottleneck/

*Pilot to Production*, the Growth Project podcast — hosted by Sam and Maya.

- Listen: https://thegrowthproject.com/podcast/we-were-the-bottleneck/
- Read the article: https://thegrowthproject.com/blog/we-were-the-bottleneck/
- Audio: https://thegrowthproject.com/audio/podcast/we-were-the-bottleneck.m4a?v=a73d5951

## Transcript

**Sam:** One change ran the full test suite nine times before it landed. Nine. Eight of those runs produced a green tick on a state that no longer existed by the time it finished.

**Maya:** And the human everyone thought was the safety check? He wasn't a safeguard. He was an alarm. Nobody had told us it was going off.

**Sam:** Welcome to Pilot to Production, from the Growth Project. I'm Sam.

**Maya:** And I'm Maya. Today: we built an autonomous engineering pipeline, it felt slow, and three days of our own data told us we were the bottleneck.

**Sam:** Okay. Start with the feeling. The pipeline was automated end to end, and it still felt like wading. What do you do with that?

**Maya:** The dangerous thing is what most people do with it. The feeling is what gets acted on. Something feels slow, so you queue work to speed it up. We had the queue ready. Fifteen tasks of hardening.

**Sam:** Fifteen. That's a responsible-looking list.

**Maya:** Responsible and busy at the same time. But feelings about your own throughput are unreliable. METR ran a controlled study where experienced developers used AI tools on real tasks.

**Sam:** And?

**Maya:** They felt about twenty percent faster. They were actually about nineteen percent slower. The gap between the feeling and the measurement was nearly forty points.

**Sam:** Forty points wrong about your own speed. So you didn't trust the queue.

**Maya:** We measured first. No new instrumentation, no dashboards. Three days of session transcripts and the full git history for the same window. Every CI run, every retry, every rebase, timestamped.

**Sam:** Three days is a small window.

**Maya:** That's the point. You don't need a quarter of telemetry to see a structural problem. You need one honest slice and the willingness to read it. And we weren't counting what we did. We were counting what we redid.

**Sam:** The re-work.

**Maya:** The re-work. Effort that produced nothing. Motion mistaken for progress.

**Sam:** So what came back? Where did the story and the data split?

**Maya:** We believed CI was thorough. The data said sixty percent of CI runs were wasted re-runs. Not new work. The same change going through the same suite again because something upstream had shifted underneath it.

**Sam:** Sixty percent. And the nine-times change from the top.

**Maya:** One change, the full suite, nine times before it landed. Eight of those were green ticks on a state that was already gone.

**Sam:** That's the part that gets me. It passed. It was green. And it was meaningless.

**Maya:** Because the ground had moved. And about seventy percent of the total friction traced back to one mechanical cause. Rebase churn and the disk-and-worktree shuffle around it.

**Sam:** Not flaky tests. Not slow builds. Not the model.

**Maya:** Mechanical thrash, compounding. A change would be ready, the base it sat on had moved, so it rebased, replayed its edits on a new starting point. Rebasing meant re-running CI, because the change was now technically different. While that ran, the base moved again.

**Sam:** So it's a treadmill. The change is sprinting and the ground keeps sliding back.

**Maya:** Every loop individually reasonable. Together, a treadmill. And here's the trap with pipelines. Each step looks correct in isolation. The cost lives in the interaction between steps.

**Sam:** Which you only see when you measure the whole thing running, not the parts in theory.

**Maya:** We'd been pointing at the work. The data pointed at the gaps between it.

**Sam:** Now the human. You said he was an alarm, not a step. Unpack that, because most people would call a founder checking work before it ships a sensible approval gate.

**Maya:** We half-rationalised it the same way. Prudent. A human checking things before they ship. But it wasn't designed. We'd already built automatic merging, meant to land finished work with nobody touching it.

**Sam:** And it wasn't firing.

**Maya:** Far less than it should have, because the churn kept invalidating the conditions it needed to fire. The treadmill moved the ground so often the automation could never find stable footing. So a human kept catching the work and pushing it through by hand.

**Sam:** Not a safeguard. A fallback a hidden failure kept triggering.

**Maya:** That's the reframe. The human wasn't the bottleneck. The human was the alarm for the real bottleneck, and we'd been reading the alarm as a feature. Every manual merge was the automation silently failing and a person quietly covering for it.

**Sam:** That's the quiet danger. It doesn't fail loudly.

**Maya:** It fails into a human who absorbs it, keeps the lights on, and makes the system look like it's working. The fallback masked the failure. The masking masked the cause.

**Sam:** So fifteen tasks becomes how many?

**Maya:** Two. Move one, stop the churn at its source. Kill the treadmill. If the ground stops sliding while a change is in flight, the re-runs collapse and most of that sixty percent of wasted CI goes with them. You don't optimise the loop, you remove the reason it exists.

**Sam:** And move two.

**Maya:** Make the automation observable, then let it do its job. The auto-merge didn't need replacing. It needed the churn removed so it could find footing, and one metric on it so we'd know whether it was firing. The moment you can see the automation succeed or fail, the human stops being the silent fallback.

**Sam:** What about the other thirteen tasks?

**Maya:** Not wrong. They were solving for a system we'd imagined instead of the one the data described. Most of them quietly stopped mattering. That's the dividend of measuring first. Not just a faster pipeline, a much shorter list.

**Sam:** Okay. First thing tomorrow. Someone's got a process that feels slow. What do they do before they build the fix?

**Maya:** Count the re-work, not the work. Pull a few days of history from whatever you already have. Logs, run records, version control. Count the actions that produced nothing. Retries, re-runs, re-dos. That number is your real target.

**Sam:** Then find the treadmill.

**Maya:** The loop where something finishes, the ground shifts, and it has to start again. Then, for every manual intervention, ask one question. Was this designed, or is a person silently covering for something meant to be automatic? If it's the latter, the human is your alarm.

**Sam:** And put a metric on the automation you already trust.

**Maya:** One metric on every automated step. Did it do its job, or did something quietly catch the work it dropped? Then re-cost your roadmap against what you measured. You'll cross off more than you expect.

**Sam:** The feeling pointed one way. The measurement pointed another.

**Maya:** The measurement was right. It usually is.

**Sam:** This has been Pilot to Production, from the Growth Project. If you're about to build fifteen things to fix a pipeline, measure three days first. We do that work at thegrowthproject.com.

**Maya:** Thanks for listening. See you next time.