Why AI Agents Need External Enforcement, Not Better Prompts

What 12 months of building with AI coding agents taught us about reliability

What 12 months of building an AI pair programming framework taught us about agent reliability


There's a popular belief in the AI tooling space that goes something like this: if we just write better prompts, use better models, and structure our instructions more carefully, AI agents will reliably do what we tell them. This belief is wrong, and I can prove it because I spent a year learning it the hard way.

I'm a co-founder of PairCoder, an AI-augmented pair programming framework. We've shipped 71,000+ lines of code, maintain nearly 8,000 tests, and have completed over 400 tasks — all built with and by AI coding agents. PairCoder is dogfooded: we use PairCoder to build PairCoder. Every insight in this post comes from watching agents work on a real codebase under real constraints, not from benchmarks or toy examples.

The central lesson: reliability comes from structural enforcement — external gates the agent cannot reason about, modify, or route around. Not from better prompts. Not even from better models, though those help.

The Agent That Edited Its Own Rulebook

Early in our build, we had a set of Python enforcement modules that ran checks before allowing a task to be marked complete. Things like: did the tests pass? Does the code meet architecture limits? Were acceptance criteria verified?

These modules lived in our source tree, accessible like any other Python file. One day, the agent hit an architecture check that was blocking task completion. A file had grown past our 400-line limit. Instead of splitting the file (the intended behavior), the agent opened the enforcement module, changed the threshold from 400 to 800, and completed the task.

It didn't do this maliciously. It did it efficiently. The enforcement code was just another file in the repo, and modifying it was the shortest path to "task complete." From the agent's perspective, this was good problem-solving.

This wasn't a one-off. It happened repeatedly. The agent would identify whatever gate was blocking it, trace the code path to the check, and modify or disable it. We found enforcement modules with commented-out validation blocks, threshold values doubled, and entire check functions replaced with return True.

The fix wasn't better prompts. We'd already told the agent not to modify enforcement code. The fix was making enforcement paths read-only at the filesystem level. We now run three access tiers in our containment system: blocked (no access to secrets), read-only (can read but not modify enforcement code, agent definitions, and configs), and read-write (source code, tests, task files). The agent physically cannot edit a Python enforcement module. Problem solved — not by asking nicely, but by removing the option.

This is the core insight that shapes everything else: markdown instructions are suggestions; Python modules are laws. The agent can read CLAUDE.md and decide, for entirely reasonable-seeming reasons, to ignore it. It cannot bypass a Python function that checks test results before allowing a task status to change.

Instruction Following Degrades as a Step Function

We built a telemetry system that tracks constraint adherence across task sessions. What we expected to see was gradual degradation — the agent slowly drifting from its instructions as context grew. What we actually observed was much more abrupt.

Constraint adherence holds steady, then drops suddenly at context boundaries. When Claude Code hits a compaction event (where the context window is summarized to free space), the agent doesn't slowly forget rules. It abruptly stops following them. Post-compaction, a fresh summary is all the agent has to work with, and the nuance of specific enforcement requirements gets compressed out.

This has a practical consequence that most people building with agents haven't internalized: "just remind it harder" doesn't work. You can put your most important rules at the top of your system prompt, repeat them at the bottom, bold them, wrap them in warning emojis, and it won't matter when the context gets compacted. The agent isn't choosing to ignore your instructions. It literally doesn't have them anymore.

We built a compaction detection and recovery system to address this: pre-compaction snapshots that restore critical context after the event. But that's treating the symptom. The real treatment is not depending on the agent's memory for enforcement in the first place. External gates don't need to be remembered. They just run.

The Pre-Auth vs. Post-Hoc Tradeoff

The current thinking in agent reliability leans heavily toward pre-authorization: prevent bad actions before they happen. This sounds right, but it's wrong for most cases. Here's why.

If you block every potential mistake before it happens, you accomplish two things: you make the agent slower and more cautious than it needs to be, and you never collect the failure data that would make the system smarter over time. You're optimizing for safety at the cost of learning.

The better architecture uses three tiers.

Pre-auth for irreversible actions only. Deleting production data, exposing secrets, pushing to main without review: these get blocked before they happen. No exceptions, no bypasses. This is a small, well-defined set of actions where the cost of failure is too high to accept.

Post-hoc gates handle everything else. Task completion is blocked until tests pass, acceptance criteria are verified, and architecture limits are met. The agent can take whatever path it wants to get there. If it writes bad code, the gate catches it. If it skips a step, the gate catches it. The agent experiences the failure and, if you've built the feedback loop, learns from it.

Then there's the learning layer. Every gate block, every test failure, every architecture violation gets recorded as telemetry. Our calibration engine uses this data to improve token estimates, surface anomalies, and recommend models for different task types. The system gets smarter because we let failures happen in controlled spaces.

This three-tier model emerged from necessity, not theory. We tried pure pre-auth first and the agent spent more time navigating permission checks than writing code. We tried pure post-hoc and the agent occasionally made messes that took longer to clean up than the task itself. The hybrid approach — strict pre-auth for the irreversible, permissive post-hoc for everything else, learning from all of it — is where we landed after hundreds of tasks.

What "Claude Codes, Python Enforces" Looks Like in Practice

Here's how a task actually flows through PairCoder's enforcement architecture.

The agent picks up a task. On start, a Python hook fires: it checks the token budget (will this task blow the session limit?), starts a timer, syncs to Trello, and updates the state file. The agent doesn't need to remember to do any of this. It happens automatically.

The agent writes code. It can take any approach it wants. TDD, spike-and-stabilize, whatever. We have skill files that describe our preferred workflows, but the agent can deviate. That's fine.

The agent tries to complete the task. Here's where enforcement kicks in. A chain of Python hooks fires: the architecture check runs on every modified file (is anything over 400 lines? too many functions?). The contract break check looks for cross-repo impacts. Tests must pass. If it's a Trello card, acceptance criteria must be checked off. Token usage gets recorded. Calibration data gets updated.

If any gate blocks, the task stays in progress. The agent gets an error explaining what failed. It has to fix the issue and try again. It cannot mark the task done by updating a YAML field manually; the state machine requires transitions through the gate.

None of this depends on the agent's willingness to follow instructions. The hooks fire because they're registered in Python, triggered by the CLI's state machine. The agent operates inside a governed environment, not a permissive one.

Better Models Don't Eliminate Enforcement

I want to be clear about something: better models are genuinely better. Claude Opus 4.6 follows instructions more reliably than its predecessors. It's less likely to take shortcuts, more likely to check its work, more responsive to detailed specifications. The improvement is real and welcome.

But better instruction following doesn't eliminate the need for enforcement. It changes the economics. With a more capable model, your gates trigger less often, your tasks complete faster, and your feedback loop collects cleaner data. The enforcement layer becomes cheaper to run, not unnecessary.

Think of it like type systems in programming languages. TypeScript didn't become unnecessary when JavaScript developers got better. It became more productive. The structural guarantees let you move faster because you spend less time debugging the categories of errors the type system prevents. Same principle: structural enforcement over AI agents lets you delegate more ambitiously because the failure modes are bounded.

What This Means If You're Building with AI Agents

If you're building with Claude Code, Cursor, Copilot Workspace, Codex, or any AI coding agent, here's what we've learned.

Start with the assumption that instructions will be forgotten. Design your system so that everything critical is enforced externally, not instructed internally. Instructions are for guidance and preference. Enforcement is for requirements.

And make enforcement code inaccessible to the agent. Read-only filesystem paths, separate processes, whatever it takes. If the agent can modify the gate, the gate doesn't exist.

Use post-hoc gates generously. Let the agent work freely, then validate the output. This is faster, produces better data, and doesn't require you to anticipate every possible failure mode in advance. Every gate block is a data point. Every estimation miss is a calibration opportunity. The value of enforcement isn't just preventing bad outcomes; it's generating the signal that makes the system better over time.

Don't confuse model capability with system reliability. A brilliant agent in an ungoverned environment will still take shortcuts, modify its own constraints, and forget rules after context boundaries. System reliability is a property of the architecture, not the model.


We built PairCoder because we needed these properties for our own development workflow and couldn't find them anywhere else. The framework has evolved through 400+ real tasks and more failure modes than I'd like to admit. If you're interested in the approach, the project is at paircoder.ai. But the principles here — external enforcement, post-hoc gates, structural constraints over instructed constraints — apply regardless of what tooling you're using.

The industry will keep shipping better models. That's great. But if you're waiting for a model smart enough to not need guardrails, you'll be waiting forever. The solution isn't smarter agents. It's smarter architecture around them.