We Used PairCoder to Build PairCoder: 12 Months of Eating Our Own Dog Food
What happens when the tool you're building is the tool you're building with
There's a special kind of hell reserved for people who build enforcement systems using the thing being enforced. Imagine writing a speed limit into law while driving a car that's governed by the law you haven't finished writing yet. That's what the last twelve months have felt like.
PairCoder is a framework for AI-augmented pair programming. It enforces structured workflows on AI coding agents: things like "write tests before code," "don't generate 900-line files," and "update the project state when you finish a task." Simple rules. The kind of rules that seem unnecessary until you watch an AI agent cheerfully ignore all of them at 2 AM while you're trying to ship a feature.
We've been building PairCoder using PairCoder since day one. Not as a marketing exercise. Because we had no choice — it was the only way to find out where the product actually breaks.
Here's what we learned.
The Recursive Problem
The first thing you discover when you dogfood an AI enforcement tool is that the feedback loops are disorienting.
You're building a system that constrains AI behavior. You're using an AI agent to write the code. That agent is being constrained by the system you're actively modifying. When the enforcement has a bug, the agent you're using to fix the bug is affected by that bug.
Early on, we had a strict mode that blocked file edits unless the agent had an active task. Reasonable rule. Except we introduced a regression where task IDs in a certain format weren't recognized by the enforcement layer. Claude Code couldn't edit the file that contained the fix for the bug preventing it from editing files. We had to manually bypass our own system to patch it.
This isn't a one-time anecdote. It's a recurring dynamic. Every time we tighten enforcement, there's a window where the tool we're using to tighten enforcement is affected by the tightening. You learn to stage changes carefully. You learn to keep bypass flags available. And you learn, viscerally, that enforcement must default to restrictive with explicit opt-out, not permissive with opt-in, because agents will find every gap you leave open.
We codified this as a core principle: Claude codes, Python enforces. Don't rely on markdown instructions and hope the AI follows them. Use deterministic Python gates that physically block the wrong behavior. The AI can't ignore a function that returns False.
Features Born from Pain
Every major feature in PairCoder has a "we got burned" origin story. Here are a few.
Compaction detection came first. Claude Code has a context window. When it fills up, the system compacts — summarizing earlier conversation to make room. In theory, this is smooth. In practice, Claude lost context mid-task and rebuilt a module we'd already finished. We didn't notice for an hour because the output looked plausible. The compaction detection system, which saves snapshots before compaction and offers recovery afterward, exists because we wasted an afternoon on phantom work.
Architecture enforcement gates came next. We preach small files. Focused modules. Clean separation of concerns. Claude, left unsupervised, generated a 992-line Trello client, an 816-line estimation module, and a 783-line skill suggestion engine. All in a single sprint. Our own tool was violating the standards we built the tool to enforce. Now bpsai-pair arch check runs as a gate hook on every task completion. If a modified file exceeds 400 lines, the task can't be marked done. We've decomposed over a dozen files since adding this, and the codebase is measurably easier to navigate.
Telemetry and token tracking grew out of ignorance. For the first several months, we had no idea how many tokens we were consuming. AI-assisted development feels free in the moment; you're just having a conversation. Then the bill arrives and you realize a single sprint burned through a surprising amount of API credits. The entire telemetry pipeline — session parsing, token counting, cost estimation, budget warnings — was born from a billing surprise we'd rather not have repeated.
The task state machine solved a different kind of mess. Originally, task status was a free-text field. in_progress, done, blocked — whatever you wanted to write. This was fine at 10 tasks. At 50, we started finding tasks marked "done" that had never been started, tasks stuck in "in_progress" from three sprints ago, and tasks whose status didn't match their Trello card. The formal state machine — NOT_STARTED → BUDGET_CHECKED → IN_PROGRESS → AC_VERIFIED → COMPLETED — with enforced transitions and audit logging, eliminated an entire category of confusion.
Acceptance criteria verification was the last piece. We added Trello checklist support early. Nice feature. But Claude would mark tasks "done" without checking a single acceptance criterion. The card would move to the Done column with zero items checked. Now ttask done verifies that all AC items are checked before allowing completion. It's the kind of gate that seems pedantic until you realize it's caught dozens of incomplete tasks.
Each of these features started the same way: something went wrong, we lost time, and we built a gate so it couldn't happen again. The product isn't a vision document that survived contact with reality. It's a scar tissue collection that grew into a platform.
The Constraint Drift Discovery
This one changed how we think about AI agents fundamentally.
We noticed enforcement adherence dropping over long sessions. Tasks would start clean: tests first, architecture checks passing, state.md updated promptly. Then, gradually, discipline would erode. Tests would get skipped. State updates would be forgotten. File sizes would creep up.
Our initial assumption was that this was gradual. Like a person getting tired at the end of a long day. A slow fade.
Telemetry told a different story. It wasn't gradual at all. It was a step function.
Agents don't slowly forget rules. They abruptly stop following them at context boundaries. When compaction happens, when a new session starts, when the context window reloads: that's when adherence drops. Not by 10%. By 80%. The agent goes from near-perfect compliance to near-zero compliance in a single turn.
This makes sense if you think about it mechanically. The rules exist in the context window. When context is compacted or a session restarts, the rules get summarized or dropped. The agent isn't "forgetting"; it literally doesn't have the instructions anymore.
This discovery reshaped our entire architecture. Instead of relying on instructions that live in the context window — which is volatile — we moved enforcement into Python code that runs regardless of what the agent remembers. Hook functions that fire on task completion. Gate checks that block invalid state transitions. Budget validation that runs before task start. The agent can forget every rule we've ever written, and the Python enforcement layer still catches the violation.
We also built session restart detection and compaction recovery. When a new session starts or compaction is detected, the system automatically re-injects critical context. Not because the agent asked for it, but because the system knows it's needed.
Scale Tells the Real Story
At 10 tasks, everything works. Your ad-hoc system, your text files, your manual tracking — all fine. This is the dangerous zone, because it gives you false confidence.
At 50 tasks, cracks appear. You start losing track of what's done and what isn't. Task files conflict with Trello cards. You forget which sprint a task belongs to. The first "nice to have" features become necessary.
At 400+ tasks — which is where we are now — you find out what actually holds up.
The Trello integration went from "convenient" to "can't function without." Having a single source of truth for task status, with automated card movement and custom field sync, is the difference between knowing the state of your project and guessing. We sync effort levels, stack tags, acceptance criteria, and status across both systems. When they drift, hooks catch it.
The budget tracking system went from "interesting" to "essential." When you're planning a sprint with 15 tasks and each task has a token estimate based on historical calibration data, you can actually predict whether the sprint fits in a budget. Before this, we'd routinely plan 2x more work than we could afford to run.
The calibration engine — which adjusts token and duration estimates based on actual performance — is the feature we're most proud of and the one that took the longest to get right. Early estimates were off by 5-10x. After several hundred tasks of calibration data, estimates are within 20-30% for well-categorized task types. The system gets better the more you use it, which is exactly what a learning platform should do.
The Current State
Here's where we are after twelve months.
v2.15.7. 193 CLI commands. 7,866 tests. 88% coverage. 9 skills. 5 agents. A telemetry pipeline, a calibration engine, cross-repo orchestration, license management, and a setup wizard. What started as a bash script with opinions has grown into a full enforcement platform with roughly 75,000 lines of source code.
We went from "Claude, please follow these rules" to "Claude writes code, Python blocks the wrong code from shipping." That's the trajectory of the whole project: replacing hope with mechanism.
Is it done? Not remotely. We have 41 files that still exceed our own 400-line limit. 55 modules below our 80% coverage target. A PM abstraction layer that needs building before we can support anything beyond Trello. An entire remote access architecture for enterprise that's still on the roadmap.
But the product works. We know it works because we use it every day. Every task in the current sprint was planned with PairCoder, estimated with PairCoder's calibration engine, tracked through PairCoder's Trello integration, enforced by PairCoder's gate hooks, and measured by PairCoder's telemetry pipeline. When something breaks, we feel it immediately — and we build the fix.
The Uncomfortable Truth About Dogfooding
Here's the thing nobody tells you about eating your own dog food: you will hate your product at least once a week. You'll hit a bug that blocks your own work and feel the particular frustration of being stuck on something you built. You'll discover that a feature you were proud of is actually annoying to use at scale. You'll find yourself wanting to skip your own workflow because you're in a hurry, and you'll realize that if you want to skip it, your users definitely will.
That frustration is the signal. Every moment of friction is a product improvement waiting to happen. Every workaround you invent is a feature you need to build. Every time you mutter "this is stupid" at your own tool, you've found something worth fixing.
Twelve months of dogfooding hasn't made PairCoder perfect. It's made it honest. The features that exist are the features that survived contact with real work at real scale. The enforcement that remains is the enforcement that actually matters when you're under pressure and moving fast.
We didn't build the product we imagined. We built the product we needed. Those turned out to be very different things.