A 4-Hour Task Takes 10 Minutes: What 400 Tasks Taught Us About AI Estimation

Why traditional software estimation breaks down with AI agents

Everything you think you know about estimating software work breaks down when an AI agent is doing the coding.


We've been building PairCoder, an AI pair programming framework, for about twelve months. During that time, we've tracked over 400 tasks end-to-end: estimated complexity, predicted tokens, actual tokens consumed, Claude Code execution time, and total wall-clock time including all the human parts. We built a calibration engine that feeds completion data back into future estimates. We watched the numbers stabilize, diverge, and occasionally make no sense at all.

Here's the uncomfortable truth the data forced us to confront: traditional software estimation isn't just inaccurate for AI-augmented development. It's measuring the wrong thing entirely.

The 25-50x Speed Illusion

Our earliest surprise was the raw execution speed. A task we'd estimate at four hours of developer time — say, decomposing a large module into smaller files with proper test coverage — would take Claude Code roughly ten minutes of active generation. Sometimes less.

That sounds incredible. It is, on paper. But here's the number that doesn't make it into the hype cycle: the total wall-clock time for that same task was typically two to three hours. Not ten minutes. Not four hours. Somewhere in between, and almost entirely driven by what the human was doing.

We'd watch the agent finish its work, then spend the next two hours reviewing the output, catching edge cases it missed, re-prompting when the architecture didn't quite match what we intended, fixing subtle bugs that passed the tests but didn't pass the smell test, and validating that the changes played nicely with parts of the codebase the agent didn't have full context on.

We started calling this the "human overhead gap," and it turned out to be the single most important metric in our entire system. Almost nobody in the industry is tracking it.

The Human Overhead Gap

Here's what the breakdown actually looks like across our dataset. Agent execution runs about 5-15 minutes — that's the part everyone talks about. Human review and validation takes 30-90 minutes — the part nobody talks about. Re-prompting and correction cycles add another 15-60 minutes — the part people are embarrassed about. And context rebuilding after compaction eats 10-30 minutes — quietly brutal every time.

The agent's time is almost a rounding error in the total cost. The human is the bottleneck, and they're bottlenecked on trust: the cognitive work of verifying that code they didn't write actually does what it's supposed to do.

This has massive implications for how you plan work. If you're estimating based on "how long will the AI take," you're optimizing the smallest part of the pipeline. The real question is: how long will it take a human to gain confidence that the AI's output is correct? That depends on the task type, the complexity of the surrounding code, how good your test coverage is, and whether the agent had enough context to make sound decisions.

We built our feedback loop system specifically to capture this gap. Every task records estimated duration, agent execution time, and total wall-clock time. The ratio between them tells us where our process is leaking time — and it's different for every task type.

Token Usage Is the New Story Points

Early on, we used complexity scores — a 0-100 scale based on file count, scope, and estimated difficulty. Classic story points with a different name. They were useless.

A task scored at complexity 40 might burn 8,000 tokens and finish cleanly. Another task scored at 40 might burn 80,000 tokens because the agent kept going in circles, hitting context limits, or producing code that failed tests in non-obvious ways. The complexity score couldn't distinguish between the two. Token consumption could.

We shifted our calibration engine to track token usage as the primary difficulty signal, broken down by task type. After about 30 completions per type, the predictions stabilized meaningfully. Our system now estimates that a typical bugfix consumes roughly 0.8x the baseline token budget, a feature consumes about 1.2x, and a refactor burns around 1.5x. Those multipliers aren't arbitrary; they fell out of the data.

The calibration engine adjusts these coefficients continuously. Every completed task feeds back into the model. When a "simple bugfix" burns 100K tokens, that's not a bad estimate — that's a signal. The bug wasn't simple. The agent struggled, and the token consumption is a more honest assessment of difficulty than any human estimate would have been.

We also started tracking per-model calibration. Different models have different token profiles for the same task type. The system now recommends which model to route a task to based on its predicted token budget and the historical success rate of each model for that task type. A straightforward CLI command gets routed to Sonnet. An architecture-level refactor gets routed to Opus. The routing isn't based on vibes; it's based on completion data.

Sprint Planning Gets Aggressive (and Weird)

The most practical consequence of all this data is that sprint planning looks completely different.

In traditional planning, a two-week sprint might contain 8-12 tasks. That felt right; each task represented a few hours to a day of work. With AI-augmented development, we routinely plan sprints with 25-40 tasks. By conventional standards, this looks insane. We finish early anyway.

The trick is that "finishing early" is measured in agent time, not human time. If you have 30 tasks and each one takes ten minutes of agent execution, that's five hours of Claude time. Spread across two weeks with a single developer reviewing and steering, it's tight but doable. The constraint isn't the coding; it's the human's capacity to review, validate, and course-correct.

We've found that planning at 80% of theoretical agent capacity works well. This leaves room for the tasks that blow up — the ones where the agent struggles, where context gets compacted mid-task, or where the initial approach turns out to be wrong and you need to re-plan. About 15-20% of tasks in any sprint will take significantly longer than estimated, and that's fine as long as you've built slack into the plan.

The Anomaly Signals

The most fascinating finding from our data is what the anomalies tell you. We track five anomaly types: token spikes, duration spikes, repeated failures, compaction-heavy sessions, and overhead spikes. Each one means something specific.

Token spikes — where the agent burns 3x or more of the predicted budget — almost always mean the task was mis-scoped. The agent is exploring, backtracking, and trying alternative approaches. It's essentially doing the design work you thought was already done. When we see a token spike, we stop the task and re-plan rather than letting the agent continue to burn budget.

Compaction-heavy sessions correlate with tasks that touch too many files or cross too many abstraction boundaries. If the context window fills and gets compressed multiple times during a single task, that's a signal to decompose further — not to give the agent more context.

Repeated failures are the most expensive anomaly. Tests fail, the agent fixes them, and they fail again in a different way. These typically indicate that the acceptance criteria were ambiguous or that the agent doesn't have sufficient context about invariants in the surrounding code. The fix is better task specifications, not more agent retries.

Overhead spikes — where human time balloons far beyond the agent's execution time — tend to cluster around integration-heavy tasks. Anything that touches APIs, configuration, or deployment pipelines generates disproportionate review overhead because the failure modes are harder to catch in tests.

These patterns are more valuable than any upfront estimate. They're real-time signals about what's actually happening, and they feed directly into the calibration engine so future estimates account for the characteristics that make tasks genuinely harder.

What We Changed

After twelve months and 400+ tasks, here's what our estimation practice looks like now.

We estimate in tokens, not hours. The calibration engine provides a predicted token budget per task based on type, historical data, and the files involved. We review the estimate, sanity-check it, and use it as the primary planning unit.

We track human overhead explicitly. Every task records total wall-clock time alongside agent time. The ratio between them is a process health metric. When it drifts above 10:1 (ten minutes of human time per minute of agent time), something in our workflow needs attention.

We let the system learn. The feedback loop adjusts estimates after every completion. We don't manually tune coefficients. The data tells us what a bugfix costs, and we trust it more than our intuition — because our intuition was trained on a world where humans wrote all the code.

We plan aggressively and expect anomalies. High task counts per sprint, with explicit budget for the 15-20% of tasks that will blow up. We'd rather have the system flag a struggling task early than discover it at the end of the sprint.

And we stopped pretending that faster agent execution means faster delivery. It doesn't. It means different delivery — one where the human's job shifts from writing code to verifying code, and where the estimation challenge shifts from "how long to build this" to "how long to trust this."

That's the conversation the industry hasn't had yet. We've got the data that starts it.


We built PairCoder to bring structure to AI-augmented development: planning, estimation, enforcement, and feedback loops that actually learn from your work. If broken estimation is slowing your team down, check out PairCoder.