# From 17% to 97.8%: Making a Laptop-Sized AI Actually Reliable — Pilot to Production

> A small on-device model started at 17% on a real task and ended at 97.8% with full speed intact. The biggest lever wasn't the model. It was the prompt.

Canonical: https://thegrowthproject.com/podcast/small-models-that-work/

*Pilot to Production*, the Growth Project podcast — hosted by Sam and Maya.

- Listen: https://thegrowthproject.com/podcast/small-models-that-work/
- Read the article: https://thegrowthproject.com/blog/small-models-that-work/
- Audio: https://thegrowthproject.com/audio/podcast/small-models-that-work.m4a?v=81b155c8

## Transcript

**Sam:** A small AI model, the kind that runs on a laptop, was pointed at a real job. First score: seventeen percent.

**Maya:** Most teams stop right there. They say small models can't do real work, and they reach for a bigger one, or an API.

**Sam:** Seventeen percent. That's a fail. Game over, surely.

**Maya:** That number is a trap. The same model, same weights, same hardware, ended at ninety-seven point eight percent.

**Sam:** Welcome to Pilot to Production, from the Growth Project. I'm Sam.

**Maya:** And I'm Maya. Today: how a laptop-sized model went from seventeen percent to ninety-seven point eight, and why the model was never the problem.

**Sam:** Okay. If you didn't change the model, what did you change?

**Maya:** The prompt. The original was two thousand two hundred and seventy-eight tokens. Bloated. Stuffed with examples, caveats, edge cases, weeks of "just add one more line."

**Sam:** And you cut it down.

**Maya:** We rewrote it lean. Same task, asked clearly. The score went from seventeen to roughly eighty-seven percent.

**Sam:** Wait. Seventy points. From rewriting a prompt.

**Maya:** Without touching the model. Same quantisation, same hardware. We just stopped drowning it.

**Sam:** Why does that hit a small model so hard? Big models eat bloated prompts all day.

**Maya:** Because on a small model, context is scarce. Every token of instruction is a token it has to hold and reconcile. A two-thousand-token prompt isn't thorough. It's noise. The model spends its attention parsing your hedge cases instead of doing the job.

**Sam:** So the constraint is actually a gift.

**Maya:** It forces the clarity you should have had anyway. And there was a second finding. On the lean prompt, temperature didn't matter. We swept it, the output stayed stable.

**Sam:** Meaning it wasn't guessing and getting lucky.

**Maya:** It was deterministic. Same behaviour Tuesday as Monday. That's what you want from software.

**Sam:** But eighty-seven isn't customer-ready.

**Maya:** No. The misses were too costly per incident. So we fine-tuned. A small one, using QLoRA, a public technique that adapts the model cheaply. We distilled behaviour from a larger teacher model into the small one.

**Sam:** And that got you where?

**Maya:** Roughly ninety-seven point four percent. But watch the order. The prompt did seventy points. The fine-tune added ten.

**Sam:** And most teams reach for fine-tuning first.

**Maya:** Backwards. Fine-tune to close the last gap, not to fix a prompt problem.

**Sam:** So you're at ninety-seven point four. What about the last two point six percent? Is that the ceiling?

**Maya:** We thought so. It wasn't. When we read the errors, almost all were taxonomy ambiguity. Two tools that genuinely overlapped, where a human would also stop and ask "which one did you mean?"

**Sam:** So on the clean inputs?

**Maya:** About ninety-nine point six percent on inputs with one correct answer. The model was doing its job. Our categories were the blurry part.

**Sam:** That changes the whole story. The fix isn't a smarter model.

**Maya:** It's a clearer taxonomy. Disambiguate the inputs, same as you would for a competent human.

**Sam:** Okay, but here's where on-device stories usually fall apart. A fine-tuned model is bigger and slower. To run it fast on a laptop you re-quantise it down to four-bit, and that's lossy.

**Maya:** Right. We did the naive thing first. Merged the fine-tune, re-quantised straight to four-bit. It dropped to about ninety percent.

**Sam:** So you threw away most of the fine-tune for speed.

**Maya:** Seven of our ten hard-won points, gone. That's the trade everyone assumes is mandatory: accuracy or speed, pick one.

**Sam:** And you're telling me it isn't.

**Maya:** The fix is quantisation-aware training, QAT, a public technique. The model learns while accounting for the precision loss of being four-bit, instead of getting squashed after the fact. The QAT build held ninety-seven point eight percent accuracy and about eighty-eight tokens a second, at the same time.

**Sam:** Full accuracy and full speed. On a laptop.

**Maya:** That's the headline. No network call, no per-token cost, no data leaving the machine. And notice: the model itself was never on the list of things we changed.

**Sam:** So first thing tomorrow. Someone's got an AI feature that "isn't quite reliable enough." Where do they start?

**Maya:** Count your prompt tokens. If it's over a thousand, you have a prompt problem, not a model problem. Then build a sealed eval before you optimise, inputs the model never trained on, deterministic scoring. Sweep temperature once. And read your misses, don't just count them. Ambiguous inputs aren't model failures, they're taxonomy failures.

**Sam:** And don't accept the speed-for-accuracy trade.

**Maya:** If someone says the on-device model has to be dumber to be fast, ask whether they tried quantisation-aware training.

**Sam:** Seventeen to ninety-seven point eight, on a laptop, without trading away speed. The capability was there the whole time.

**Sam:** This has been Pilot to Production, from the Growth Project. If your small model "can't do real work," it probably can, and we'll show you how, at thegrowthproject.com.

**Maya:** Thanks for listening. See you next time.
