There's a moment every developer hits when using AI to code. You're in the zone, Claude is shipping features faster than ever, and then you look at the diff and realise it quietly rewrote an API contract you never asked it to touch. Not because it was dumb — because it was too helpful. It saw an opportunity to "improve" something, and it took it. No one told it not to.
I hit that wall enough times to start questioning the whole approach. One agent, full access, god mode. And it works, until it doesn't. Until the AI makes a product decision disguised as a code change, and you don't catch it until two features later, when something breaks, and you're spending an hour tracing back to a commit that shouldn't exist.
The problem was never intelligence. It was the scope. An AI with no boundaries will fill every boundary it finds.
So I did something that felt almost silly at the time. I took one powerful AI and broke it into eight weaker ones. Each has a narrow job description. Each has an explicit list of things it is not allowed to do.
- A PM agent that can interrogate a PRD and write a phased plan, but literally cannot suggest a tech stack.
- An architect who can design systems, APIs, and data models, but cannot write a single line of production code.
- A Coder that implements exactly what the Architect specified — and if something doesn't fit, it raises a flag and waits instead of quietly improvising.
- A Reviewer that reads everything and flags bugs, but cannot fix them itself.
- A QA agent that writes tests and issues a pass-or-fail verdict, but cannot touch production code.
- A Shipper that handles commits, PRs, and merges, but refuses to ship unless QA has passed. And an APK Builder that sits outside the whole pipeline, building Android binaries for testing, unable to modify a single source file.
They talk to each other through the most boring system imaginable — a folder of markdown files and one JSON handoff. Every agent reads what the previous one wrote and adds their own piece. No shared memory. No magic. Just files.
docs/<task-name>/
├── SESSION.md ← bootstrapper sets up the workspace
├── pm-plan.md ← PM writes the phased plan
├── handoff.json ← PM hands structured context to architect
├── architecture.md ← architect designs the system
├── implementation-log.md ← coder logs every change
├── review-report.md ← reviewer flags what's broken
├── qa-report.md ← QA issues pass or fail
└── ship-log.md ← shipper records what shipped
When something breaks, I can trace the entire chain. Which requirement was misunderstood. Which design decision got ignored. Which edge case the coder missed. It's not glamorous, but it works the way good engineering has always worked — through paper trails, not intuition.
The idea came from something I already knew but hadn't applied. Every good team I've worked with in a decade of building products operated on the same principle — people were effective not because of what they could do, but because of what they didn't do. The PM didn't push code. The engineer didn't redefine product scope mid-sprint. QA didn't deploy to production on a whim. Those weren't limitations on the team. Those were the reasons the team worked.
I just applied the same logic to Claude.
What surprised me is how much the constraints changed the output quality. The same Sonnet model that would hallucinate architecture decisions as a general-purpose assistant becomes a remarkably disciplined coder when you tell it — your job is to implement what the Architect designed, flag anything that doesn't fit, and never freelance. It's the same model. The only difference is what it's not allowed to do.
There's a broader thing here that I keep coming back to. The instinct with AI is always to give it more — more access, more capability, fewer guardrails. But the best results I've gotten have come from going the other way. Not dumber models, sharper roles. Not less capability, less scope. We figured this out with human teams a long time ago. Constraints aren't the enemy of good work. They're what make good work possible.
Turns out, that's just as true for the agents we build.