Anthropic's Own Teams Prove the Point: Context Engineering Is the Job Now
Anthropic published how ten internal teams use Claude Code. Their number one recommendation is the same thing I've been building around for a year.
Anthropic published how their own teams use Claude Code internally. Ten departments. Engineers, designers, lawyers, marketers. The number one tip across almost every team?
Write detailed Claude.md files.
Their Data Infrastructure team. Their Security Engineering team. Their RL team. Their Product Design team. All saying the same thing: the quality of your context files determines how well the agent performs. Not the model. Not the prompt. The persistent context you give it before it starts working.
This matches my experience exactly. The single biggest factor in whether Claude produces usable output isn't how I phrase the request. It's what the agent already knows when it starts: project conventions, architecture constraints, file scope, the standards that apply. Get that context layer right and the agent builds what you meant. Leave it thin and you spend your time correcting drift.
The research is worth reading in full. Three patterns stood out.
Their RL Engineering team reports that Claude Code works on first attempt about one-third of the time. The other two-thirds require guidance or manual intervention. One-third. At Anthropic. With their own model, their own infrastructure, their own engineers who presumably understand the tool better than anyone.
If Anthropic's RL team is getting 33% first-attempt success, what's the realistic number for everyone else?
Their Data Science team describes what they call a "slot machine" workflow. Commit state. Let Claude run for 30 minutes. Look at what came back. Accept or rollback. That's not engineering. That's gambling with version control as the safety net.
Their Security Engineering team built custom slash commands so heavily used that they account for 50% of all slash command usage across Anthropic's entire monorepo. One team, half of all command usage. They didn't just use Claude Code out of the box. They built structured workflows on top of it because the default interface wasn't enough.
Every one of these is a workflow problem, not a model problem. Better models won't fix a 33% first-attempt success rate if the context is thin. A faster agent running on bad architecture decisions still produces bad architecture faster. The slot machine workflow doesn't improve when the slot machine gets slightly luckier.
What fixes these problems is structure around the agent, not improvements inside it.
That's a distinction most of the discourse misses. The conversation about AI coding tools is almost entirely about model capability. Can it handle larger codebases? Can it reason about more complex problems? Can it write better tests? Those improvements matter. But Anthropic's own data shows that their best engineers, working with the best model, still needed to build layers of context management and structured workflows before the tool became reliable. The model was never the bottleneck. The workflow was.
The parallels to what we've built with PairCoder are specific enough to be worth walking through.
Every Anthropic team that reported strong results emphasized persistent context files. Their recommendations read like a checklist for what should be in a Claude.md: project structure, coding conventions, testing standards, architectural constraints. PairCoder manages this automatically. When I start a task, the orchestration layer has already injected the sprint context, the files in scope, and the architecture limits that apply. I don't write a context prompt for each task. The system assembles it from the project state.
The checkpoint discipline that Anthropic's Data Science team does manually — commit, run, inspect, rollback — we built as compaction detection and recovery. Pre-compaction snapshots preserve state. Session restart detection re-injects critical context. The ceremony of remembering to commit before every experiment is handled by the system, not by the developer's discipline at 11 PM.
The custom slash commands Anthropic's Security team invented? We formalized those as skills and agent roles. Repeatable workflows encoded as structured definitions that any developer can use, not tribal knowledge that lives in one team's muscle memory.
The difference isn't that we're smarter than Anthropic's engineers. It's that they're solving these problems one team at a time, locally, and we're solving them as architecture that ships to every project.
The 33% first-attempt number is the one I keep coming back to. Two out of three times, even at Anthropic, the agent needs correction. That's not a failure of the model. That's the reality of working with an agent that infers intent from context and gets it wrong more often than most people assume.
The teams that reported higher success rates all had the same thing in common: heavier investment in upfront context. Detailed Claude.md files. Structured conventions. Clear boundaries. They spent more time telling the agent what it was working on and less time correcting what it produced.
That trade-off is the whole game. You can spend your time on the back end, reviewing and fixing output, or you can spend it on the front end, building the context layer so the output is right more often. The second approach compounds. Better context files make every future task better. Fixing output is the same work every time.
Context engineering isn't prompt tricks. It's deciding what the agent needs to know before it touches a file, and encoding that into something persistent so you're not re-explaining your codebase every session.
Anthropic's best engineers arrived at the same conclusion independently across ten teams: the agent needs more structure, not more freedom. Detailed context. Structured workflows. Checkpoint discipline. Formalized commands.
The model is extraordinary. The default workflow around it isn't enough. Ten teams figured that out on their own. The interesting question is what you build once you accept it.