# Agentic AI: How to Build Agents That Run in Production

> What agentic AI is, how agents differ from chatbots and copilots, and how to build AI agents that run in production, with guardrails and shadow mode.

Canonical: https://thegrowthproject.com/guides/building-ai-agents/

Most "AI agents" you've seen are demos. They work once, in a meeting, on a happy path, with the person who built them hovering nearby. Then they meet a real Tuesday and fall over.

This is a guide to the other kind: agents that run in production, handle real workloads, and don't need babysitting. What agentic AI actually is, how agents differ from the chatbots and copilots you already know, and the engineering discipline that decides whether an agent ships or stalls.

**TL;DR:** An agent is given a goal, then plans, uses tools, and acts to reach it, rather than answering one prompt at a time. The hard part isn't the model; it's the engineering around it: capture repeatable work as skills, give the agent a library and a loop, wrap it in guardrails, prove it in shadow mode, and measure it honestly. Capability is rarely the constraint. Discipline is.

## What agentic AI actually is

Three things get sold under the word "AI," and only one of them is an agent.

| | What it does | Who's driving |
| --- | --- | --- |
| **Chatbot** | Answers questions | The user, every turn |
| **Copilot** | Assists in real time (e.g. GitHub Copilot) | The human, with suggestions |
| **Agent** | Plans and acts toward a goal, using tools | The model, within boundaries |

A chatbot responds. A copilot assists. An **agent** receives a goal, decides the steps, calls tools, and executes a multi-step job without a human driving each action, reading an invoice, validating the amounts, updating the accounting system, and flagging the exceptions for review. Anthropic draws the line cleanly in [Building Effective Agents](https://www.anthropic.com/research/building-effective-agents): a _workflow_ runs through predefined code paths, while an _agent_ dynamically directs its own process and tool use. The autonomy is the point, and the risk.

## The real unit of leverage

Here's what most teams miss. The leverage in agentic AI isn't a cleverer prompt. A prompt answers one turn and forgets. The durable unit is a **skill**: a captured, named procedure the agent can run the same way every time, plus a **loop** that composes skills toward a goal. We unpack this in [build skills, then loop them into a super-agent](/blog/skills-to-super-agent/): a drawer of tools doesn't build a house; the agent has to choose the right ones, chain them, and know when the job is done.

That's also why agents compound. Every problem you solve becomes a skill the agent keeps, the same [compounding-engineering loop](/blog/compounding-engineering/) where each run makes the next one cheaper. The super-agent isn't a smarter model. It's an ordinary one standing on a library it can call and a loop that won't quit.

## How to build an agent that ships

Production-ready agents follow a sequence, not a model choice. This is the four-step we use on every [agentic AI](/services/agentic-ai/) build.

**1. Identify the repeatable decision.** Not "where can we use AI?" but "where do people make the same decision over and over?" Score each candidate on three things: decision frequency, rule consistency, and the cost of an error. High volume, clear rules, recoverable mistakes, that's where agents earn their keep. Low-volume, high-ambiguity, irreversible calls stay with humans.

**2. Design the architecture and the boundaries.** Before any code: what tools can the agent use, what data can it reach, when must it escalate, and what's the cost ceiling per action? Tools deserve as much design attention as the prompt, the [Model Context Protocol](https://modelcontextprotocol.io) has become the common way to give agents typed, governed access to your systems instead of bespoke glue.

**3. Build the guardrails in, not on.** Structured logging, approval gates, confidence thresholds, and circuit breakers are part of the architecture, not a later add-on. And one rule we never break: an agent never clears its own work. The thing that approves an action must be independent of the thing that produced it, the [no-self-clearing rule](/blog/automating-code-review/) we learned the hard way.

**4. Earn autonomy in shadow mode.** The agent runs _beside_ the humans first, predicting but not acting, for two to four weeks. You compare its decisions to theirs, tune, and grant autonomy only on the decisions where it consistently matches or beats human performance. Autonomy is granted, not assumed. Everything else keeps escalating. That's the same discipline that gets any AI system to [production-ready](/guides/production-ready-ai/).

## Guardrails and safety

Autonomy without guardrails isn't ambition, it's negligence. A production agent has layers:

- **Boundaries.** It can only touch the tools and data you grant it. Nothing else exists, as far as it's concerned.
- **Escalation.** Confidence thresholds, high-stakes flags, and out-of-distribution detection route the hard calls to a human, with full context attached.
- **Circuit breakers.** If error rate or cost crosses a threshold, execution halts. No runaway loops, no surprise bills.
- **Traceability.** Every action is logged and replayable. When something goes wrong, you can see exactly what the agent did and why.

Human-in-the-loop isn't a fallback you bolt on after a scare. It's one click away from the start, and you remove it only where the agent has earned the trust.

## Measure honestly, or you're guessing

The fastest way to ship a bad agent is to grade it on a demo. A demo passes because someone chose the inputs. Production is full of inputs nobody chose.

Measure with a sealed evaluation, real cases the agent never trained on, deterministic scoring, before you trust anything. And for judgement-heavy work, score the thing that actually matters: would a competent human accept this output as-is? We make that case in [stop measuring AI by test-pass rate](/blog/would-a-human-merge/), and the reviewer that decides has to be held out, never the agent grading its own homework. An honest number that stings beats a flattering one that lies.

## You don't need a big model

The instinct in every agent project is to reach for the biggest model and the largest context. Usually that's the wrong lever. On one real tool-calling task we took a laptop-sized model [from 17% to 97.8%](/blog/small-models-that-work/), and the biggest jump came from rewriting the prompt, not changing the model. Small, on-device models now run genuine agentic work, Google's [Gemma 4 brings agentic skills to the edge](https://developers.googleblog.com/bring-state-of-the-art-agentic-skills-to-the-edge-with-gemma-4/) on the same class of hardware. Running locally also keeps the data on your own machine, which quietly solves a lot of privacy and compliance questions before they're asked.

Capability is usually a prompt, tools, and measurement problem. Reach for size last, not first.

## What agents do well today

Real workloads, not theoretical ones. The pattern is always the same: high-volume, rule-consistent, recoverable.

- **Support triage**, classify tickets, pull order data, draft replies, resolve the common cases, escalate the rest with context.
- **Document processing**, extract, validate, and route data from invoices, contracts, and forms into your systems.
- **Operational workflows**, order routing, inventory alerts, supplier comms, compliance checks, running 24/7 with human escalation for exceptions.
- **Data-quality monitoring**, watch pipelines for anomalies and flag bad data before it reaches production.

The unglamorous truth: the 80% of decisions that follow a consistent pattern are exactly the ones agents handle, freeing your team for the 20% that needs judgement. It's delegation, not replacement.

## Who builds production agents

A pilot agent is easy. An agent that runs your operations for years is an engineering commitment, guardrails, observability, evaluation, and a clean handover so your team owns it. That's the bar we hold ourselves to on every [agentic AI](/services/agentic-ai/) build, and it's worth knowing [how to tell a real delivery partner from a demo](/guides/ai-implementation-partner/) before you hand anyone the keys to your systems.

Bring the repeatable decision that's eating your team's week. [Tell us what's stuck](/contact/) and we'll find the agent worth building, and the ones that aren't.
