From 17% to 97.8%: Making a Laptop-Sized AI Actually Reliable

We pointed a small 4-bit model, the kind that runs on a laptop, at a real tool-calling task, where the model has to read an instruction and pick the right action to take.

It scored 17%.

Most teams stop there. They conclude small models can’t do real work and reach for a bigger one, or an API. That’s the wrong conclusion. Here’s what we did next, and the numbers we measured.

The Setup

The task was tool-calling. Give the model an input, have it pick the right tool and fill the arguments correctly. Boring, narrow, and exactly the kind of thing real software does thousands of times a day.

We measured it properly. A sealed evaluation set the model never trained on. Cross-domain inputs. Deterministic scoring, so a pass is a pass and a fail is a fail. No vibes, no cherry-picked demo. This is the same discipline we argue for in why you should stop measuring AI by test-pass rate: an honest measurement is worth more than a flattering one.

The first run scored 17%.

That number is a trap. It looks like a capability ceiling. It isn’t. It’s a measure of how we were asking, not what the model could do.

The Prompt Was the Problem

The original prompt was 2,278 tokens. Bloated. Stuffed with examples, caveats, edge-case handling, and instructions we’d accreted over weeks of “just add one more line.”

We rewrote it. Lean. Direct. The same task, asked clearly.

The score went to roughly 87%.

Read that again. We changed the model from 17% to 87% without touching the model. Same weights. Same quantisation. Same hardware. We just stopped drowning it.

This is the part nobody wants to hear, because rewriting a prompt isn’t glamorous. No GPU bill, no demo. But on a small model, context is scarce. Every token of instruction is a token the model has to hold and reconcile. A 2,278-token prompt isn’t thorough. It’s noise. The model spends its limited attention parsing your hedge cases instead of doing the job.

Big models forgive bloated prompts. Small models don’t. That constraint is a gift: it forces the clarity you should have had anyway.

There was a second finding here. On the lean prompt, temperature didn’t matter. We swept it. The output was stable. The model wasn’t guessing and getting lucky. It was deterministic, which is exactly what you want from software that has to behave the same way on Tuesday as it did on Monday. If you’ve read why most AI projects fail, you know reliability beats raw capability every time.

The Fine-Tune Was the Polish

87% is good. It isn’t good enough to put in front of customers unsupervised. The misses were too costly per incident.

So we fine-tuned. A small one, using QLoRA, a public technique that adapts a model cheaply without retraining the whole thing. We trained on the task, distilling behaviour from a larger teacher model into the small one.

The score went to roughly 97.4%.

Note the order of operations. The prompt rewrite did the heavy lifting, 17 to 87, seventy points. The fine-tune added ten. People reach for fine-tuning first because it feels like the “real” engineering. It’s backwards. Fine-tune to close the last gap, not to fix a prompt problem.

And then we looked at the remaining misses.

The Misses Weren’t Failures

We expected the last 2.6% to be the model hitting its ceiling. It wasn’t.

When we broke down the errors, almost all of them were taxonomy ambiguity. Cases where two tools genuinely overlapped, where a human would also have to stop and ask “well, which one did you mean?” On unambiguous inputs, the model scored about 99.6%.

That changes the story completely.

What 97.4% Looks Like	What It Actually Means
”2.6% of the time it’s wrong”	Almost all misses are genuinely ambiguous cases
”The model has a ceiling”	99.6% on inputs with one correct answer
”We need a bigger model”	We need a clearer taxonomy

The fix for the last fraction isn’t a smarter model. It’s disambiguating the inputs, the same way you’d resolve an ambiguous instruction given to a competent human. The model was doing its job. Our categories were the blurry part.

It also generalised. The eval was cross-domain, full of inputs the model never saw in training. It held up. This wasn’t a model that memorised a test. It learned the shape of the task.

Then We Had to Make It Fast

Here’s where most on-device stories quietly fall apart.

A fine-tuned model is bigger and slower than the lean 4-bit one we started with. To run it on a laptop at usable speed, you have to shrink it back down, re-quantise it to 4-bit. And quantisation is lossy. It throws away precision.

We did the naive thing first: take the merged fine-tuned model and re-quantise it straight to 4-bit.

It dropped to about 90%.

We’d just thrown away half the fine-tune. Seven of our ten hard-won points, gone, in exchange for speed. That’s the trade everyone assumes is mandatory on-device: you can have accuracy or you can have speed, pick one.

It isn’t mandatory.

The fix is quantisation-aware training (QAT), a public technique where the model learns while accounting for the precision loss of being 4-bit, rather than getting squashed after the fact. The QAT build held 97.8% accuracy and ~88 tokens/sec at the same time. Full accuracy. Full speed. On a laptop.

What Addition Looks Like	What It Costs
Naive re-quant to 4-bit	~90% — half the fine-tune gain lost
Quantisation-aware training	97.8% and ~88 tokens/sec

This is the headline. With QAT, you do not have to trade speed for accuracy on-device. A runtime matched to the hardware mattered too. But the technique broke the trade-off.

Why This Matters For Your Stack

The instinct in every AI project right now is to go bigger. Bigger model, bigger context, bigger API bill, bigger dependency on someone else’s uptime. The assumption is that capability lives in size.

This task says otherwise. A model small enough to run on the device in your hand, no network call, no per-token cost, no data leaving the machine, did a real job at 97.8%. That’s what production-ready AI looks like in practice: reliable, measured, and running where the work happens.

The levers, in the order that mattered:

The prompt moved it 70 points. Free. Reversible. Most teams never try it properly.
The fine-tune moved it 10 points. Cheap, with QLoRA and a teacher model.
The quantisation method decided whether you keep those gains at speed. QAT keeps them. Naive re-quant doesn’t.

Notice the model itself isn’t on that list. We never swapped it. The capability was always there. We were just asking badly, then polishing carelessly, then nearly throwing it away in the name of speed.

First Thing Tomorrow

If you’ve got an AI feature that “isn’t quite reliable enough”:

Count your prompt tokens. If it’s over a thousand, you have a prompt problem, not a model problem. Rewrite it lean before you touch anything else.
Build a sealed eval before you optimise. Inputs the model never trained on, deterministic scoring. Without it, every “improvement” is a guess.
Sweep temperature once. If output changes with temperature, your prompt isn’t doing its job. Stable output is the goal.
Read your misses, don’t just count them. Most teams see a failure rate. The signal is in which cases fail. Ambiguous inputs aren’t model failures, they’re taxonomy failures.
Don’t accept the speed-for-accuracy trade. If someone tells you the on-device model has to be dumber to be fast, ask whether they tried quantisation-aware training.

The Bottom Line

Small models don’t fail because they’re small.

They fail because we ask them badly, fine-tune them in the wrong order, and quantise away the gains to make them fast.

17% to 97.8%, on a laptop, without trading away speed.

The capability was there the whole time.

Need help getting a small model to actually ship? We build production AI that runs where the data lives, and we measure it before we trust it. Senior implementation. Real numbers. Start a conversation.