AI Implementation

Two Hosts Who Don't Exist: How We Generate Our Podcast With Gemini 3.1

Press play on any episode of our podcast and two people start talking. Sam and Maya. They have opinions, they interrupt each other, they land the occasional joke.

Neither of them exists.

Every field note we publish has a companion episode, and every episode is generated — two synthetic hosts, no microphones, no studio. This post is how we build them. It is also an honest account of getting there, because the first versions were bad, and the road from bad to good ran straight through Google.

Two hosts, no microphones

The job was simple to state. We write a field note. We want a version you can listen to on a drive, two voices turning the argument over instead of one voice reading it out. No studio, no scheduling, no second take.

It is also part of how we get found. More people ask an AI assistant or a podcast app for an answer than scroll a blog, and audio reaches the ones who would never read the post. The podcast is a deliberate piece of that reach — the same bet we describe in our guide to getting cited by AI. And it is dogfooding in the most literal sense: we tell clients to ship AI into their real workflow, so we ship it into ours and use it in anger.

Simple to state. Then you try to make two synthetic people sound like they are in the same room.

Where we started: one voice at a time

The narration came first, and it was easy. A single voice reading a single post. We ran it on OpenAI’s ash voice and it was clean.

The podcast broke that assumption, because a conversation is not a monologue with a second name attached. Our first real attempt synthesised each line on its own — Sam’s line, then Maya’s line, then Sam’s — and stitched the clips together. On paper, a transcript turned into audio.

In your ears, it fell apart. Every line started cold, so the energy reset on each turn. The timing between speakers was mechanical, because nothing knew what the previous line had sounded like. A one-word reaction — a flat “Right.” — sounded spliced in from another room, because it had been. We were assembling a conversation from parts, and it sounded assembled.

The nudge

By then we had three things happening inside one audio stack: OpenAI for narration, Google for the podcast voices, and glue holding the two together. It worked, more or less, which is exactly the kind of arrangement you stop questioning.

Last week, at the Google Cloud Summit, a conversation with Zara Craig, a Senior Account Manager at Google Cloud, and the team at Aviato Consulting — about cloud strategy and where AI genuinely earns its place — landed on the question we had been avoiding. Aviato’s whole reason for existing is turning Google Cloud and AI into an actual advantage rather than a pile of services, and the honest read on our setup was that we had never committed to it. We were running a fragmented stack for a single job and calling it pragmatism.

That nudged the move. It is the same discipline we preach about systems: stop adding pieces and decompose the problem instead. One job, one platform. We consolidated the audio stack onto Google.

Gemini 3.1’s native multi-speaker changed the problem

Gemini 3.1 added native multi-speaker generation: one model voicing both hosts in a single pass. You do not stitch a conversation together. You ask for the whole thing at once.

That reframed the work. The choppiness was never a volume problem we could mix our way out of — it was structural, the cost of building a dialogue out of disconnected parts. Generate the dialogue as one thing and the structure problem disappears. The model knows Maya is replying to Sam, so the timing lands, the interruptions feel intended, and the reactions sit in context instead of on top of it. We stopped assembling and started generating, and the single worst thing about the audio went away.

What still broke (the honest part)

A better model is not a finished system. Here is what Gemini 3.1 did not fix on its own, and what we built around it.

Per-turn, where we startedNative multi-speaker, where we landed
Each line voiced alone, stitchedBoth hosts voiced in one pass
Energy resets every turnTiming and interruptions land in context
A one-word reaction sounds splicedReactions sit inside the conversation
More vendors, more glueOne model, one call

Long renders drift. Ask for several minutes of audio in one shot and the model wanders — pacing slips, a voice subtly changes. So we split each script at its natural topic breaks and render in segments short enough that nothing has room to drift.

Words get clipped at the seams. The last word of a segment kept getting swallowed — “anymore” came out as “any-”. The fix is almost silly: we append a throwaway sentinel word the model can clip instead, then cut the sentinel back off. The real last word survives because something expendable stood behind it.

Models are confident and wrong. Now and then a segment quietly dropped or mangled a line and reported nothing. So every segment goes back through speech-to-text and gets compared to the script. If the audio does not match the words, we re-roll it automatically. It is the same instinct that runs through all of our work: stay suspicious of what the model hands back and verify before you trust, the way production-ready AI earns autonomy in shadow mode first.

Stale audio is its own bug. We would fix an episode and still hear the old version, served from a cache that had no idea anything changed. So we version every file by the hash of its actual bytes. The URL changes only when the audio does — which means a fix nobody could hear was its own kind of not-fixed.

The pattern, if you have read our other field notes, will be familiar: the biggest lever was never the model. It was the verification wrapped around it.

First Thing Tomorrow

If you want to generate a real two-host podcast rather than a stitched-together one:

  1. Do not stitch a conversation. Generate it. Use a native multi-speaker model (Gemini 3.1) and ask for the whole dialogue in one pass. Assembling separate lines is where the choppiness comes from, and no amount of mixing removes it.
  2. Segment long scripts at topic breaks. Keep each chunk short enough that the model has no room to drift on pacing or voice. Stitch at the seams you chose, not the ones it imposes.
  3. Verify every segment automatically. Run the audio back through speech-to-text, compare it to the script, and re-roll any mismatch. Do not trust a render because it sounded fine the one time you listened.
  4. Protect the seams. The final word of a chunk is where clipping hides. A sentinel word the model clips instead, then removed, saves the word that matters.
  5. Cache-bust on content, not filename. Version each file by the hash of its bytes. Otherwise you will ship a fix that nobody can hear.

The Bottom Line

The first versions were bad for a reason worth remembering: we were assembling a conversation instead of generating one. The model that fixed it was Gemini. The thing that made it reliable was not the model — it was the verification we built around it.

Consolidating onto Google Cloud was the unglamorous half of the win, and we had been putting it off until a good conversation made us commit. Thanks to Zara Craig and the Aviato Consulting team for the nudge.

And the two hosts you can hear on this very post still do not exist. The pipeline above made them. Hit play.


Consolidating a fragmented AI stack, or trying to get something real shipping on Google Cloud? That is the gap we work in. Let’s talk.