It Feels Faster Than It Is
The most consistent result in the AI productivity literature isn't a speedup. It's the gap between how fast work feels and how fast it measures.
Three years into the deployment of generative AI, a simple question still has no settled answer: does it actually make people more productive? Not whether people use it — adoption is not in dispute — but whether the using shows up as measured output. You would expect that to be the easy part by now. It isn’t. The best evidence we have points in several directions at once, and the disagreement is sharp enough that it can’t be filed away as noise. I think the disagreement is the most useful thing in the literature, because once you line the studies up properly, they stop contradicting each other and start describing a single phenomenon nobody has named cleanly.
Start with the two artifacts that frame the whole problem, both produced by the same organization.
METR, a nonprofit that evaluates frontier models, is best known for one chart: the length of task an AI can complete on its own, plotted over time, doubling roughly every seven months and lately faster. It is the most extrapolated graph in AI, because it turns a vague sense of acceleration into a clean exponential. In early 2025 it said the capability frontier is racing.
A few months later the same lab published a randomized controlled trial on whether that capability translates into productivity for the people most likely to benefit. Sixteen experienced open-source developers, 246 real tasks in repositories they had maintained for years, each task randomly assigned to allow or forbid AI. The developers expected a 24% speedup. Afterward, they believed they had gotten a 20% speedup. The screen recordings showed a 19% slowdown. The researchers had expected a positive result too.
One lab, one year, two graphs running in opposite directions. The temptation is to decide which one is “real.” That is the wrong move. They measure different things, and the distance between them is where the story actually lives.
The evidence doesn’t agree, and that disagreement is the data
Collect the serious studies and you get a field that looks incoherent until you sort it.
Brynjolfsson, Li and Raymond put a generative assistant in front of 5,179 customer-support agents and measured a 14% average lift in issues resolved per hour — but the gain was about 34% for novices and roughly zero for the most experienced agents. The tool worked mainly by spreading the habits of the best workers to the worst.
METR’s developers — all experts, all working in code they knew better than almost anyone — went the other way and got slower.
Faros AI, looking at telemetry from more than ten thousand developers, found adoption skewing toward newer engineers using AI to navigate unfamiliar code, and high-adoption teams shipping meaningfully more pull requests per day.
Step up to the aggregate and the signal thins out. Humlum and Vestergaard, using Danish administrative data across eleven of the most exposed occupations, found essentially no effect on wages or hours through 2024 despite heavy adoption. Acemoglu’s macro estimate puts AI’s likely contribution at well under one percent of total factor productivity over a decade.
Large positive effects, null effects, and negative effects — all measured carefully, often in the same year. The honest response is not to pick a favorite. It is to ask what these studies have in common and where exactly they diverge, because the pattern in the agreements and the pattern in the disagreements are both legible.
What the studies share
Three things recur across almost all of this work, regardless of which way the headline number points.
The first is an expertise gradient, and it runs opposite to intuition. AI helps most where the human baseline is lowest. Novices in the support study gained the most; experts gained nothing. Faros found the pull toward newer engineers in unfamiliar code. METR’s experts, in the most familiar code imaginable, were the ones it slowed down. The consistent finding underneath the inconsistent headlines is that AI compresses the distribution — it lifts the floor and barely moves the ceiling, and in the highest-context cases the ceiling can come down, because the expert’s own knowledge of the system already beats whatever the model can infer about it.
The second is that self-report overshoots measurement, and not by a little. METR’s developers felt 20% faster while running 19% slower — a sign error, not a rounding error. Survey after survey reports large majorities who feel more productive with AI. The feeling is real and the satisfaction is real; they are simply not evidence of measured output. This is the most portable result in the literature, because it survives every disagreement about magnitude. Whatever AI is doing, it makes work feel faster more reliably than it makes work be faster.
The third is that the aggregate is quiet. Whatever is happening to individuals — and clearly something is — it is not yet visible in wages, hours, or productivity statistics at the level of an economy. The micro is loud and contradictory; the macro is nearly silent.
Hold those three together and a question forms. If gains concentrate among novices, if perception runs ahead of measurement, and if nothing aggregates, what would we actually expect the productivity statistics to show in year three? Probably something close to what we see: a great deal of motion, not much movement.
Where they part company
The contradictions dissolve once you notice that the studies measure different layers, and different people, and call all of it “productivity.”
The time-horizon graph measures capability — what a model can do alone, on a task scoped tightly enough to score. The productivity studies measure realized outcome — what happens when a person folds the tool into messy work with standards, reviewers, and tacit requirements that live in nobody’s prompt. Benchmarks tend to flatter, because they include only tasks clean enough to grade, and real work is mostly the ungradeable part. Capability and outcome are not the same axis; the seven-month doubling tells you almost nothing about where the productivity line goes, because it is not the same graph.
Then there is who. Novice versus expert flips the sign on its own. And there is what you count. Faros measured throughput — pull requests per day — and found it up. METR measured completion time per task and found it up too, meaning slower. Both can hold at once: you can start more things and finish each of them more slowly. Concurrency and latency are different quantities, and AI seems to trade one for the other in a way that reads as a gain or a loss depending on which you were already watching.
The last axis is context. The support agent handles a fairly bounded problem with low private context, which is exactly where a model’s generic competence has the most to add. The senior developer works in a system whose constraints he carries in his head and the model has never seen, which is exactly where generic competence has the least to add and the most to get wrong. Same technology, opposite result, for a reason that has nothing to do with the model’s quality and everything to do with the gap between what the user knows and what the model knows.
So how do we reconcile a field this scattered? Not by averaging it. By recognizing that “does AI make you productive” is four or five different questions wearing one coat.
We have seen this lag before
The shape of this — instant capability, absent or negative measured productivity — is not new. It is one of the more reliable patterns in the history of general-purpose technology, and it has a literature of its own.
In 1987 Robert Solow made the remark that became the name for the phenomenon: the computer age was visible everywhere except in the productivity statistics. For years that looked like a paradox. The resolution, worked out largely by Brynjolfsson and collaborators, is what they later called the productivity J-curve. When a general-purpose technology arrives, firms have to build a large stock of complementary intangibles around it — new processes, new skills, redesigned workflows — and while they are building that stock, measured productivity actually dips, because the effort is real but the output isn’t counted yet. The gains arrive later, after the reorganization. That is why the curve is J-shaped rather than a straight climb.
The cleaner version of the story is electrification. Factories ran electric motors for something like forty years before the productivity gains showed up, and the reason is almost too on-the-nose. The early gains were small because factories were still laid out around the logic of a central steam engine — one power source, a system of shafts and belts feeding the whole floor. The motor only paid off once someone rebuilt the factory around the fact that each machine could now have its own power. The technology was available immediately. The reorganisation took a generation, and until it happened, the statistics were unimpressed.
Read against that pattern, METR’s slowdown stops looking like a verdict on AI and starts looking like the bottom of the J-curve, observed at the resolution of one developer’s afternoon instead of a national accounts table. The typing got cheap. The workflow around the typing — the review, the standards, the question of who is accountable for a change nobody fully wrote — has not been rebuilt yet. During that interval, in the highest-context settings, the un-rebuilt workflow makes the tool net-negative. That is not a surprise. That is the shape the pattern predicts.
What might be new this time
If that were the whole story, the ending would be tidy: we are early, the J-curve will turn, be patient. I am not sure it is the whole story, because of one result with no clean historical precedent.
In early 2026 METR tried to run its developer study again — more participants, newer tools, settle the magnitude. It couldn’t get a clean reading, and the reason is the interesting part. A large share of developers would no longer agree to work without AI, even when paid to, and many quietly declined to submit the tasks they didn’t want to do unassisted. The control condition — the person doing the work the old way — was becoming impossible to staff. One participant compared going back to walking across a city after getting used to a car.
This is the genuinely novel wrinkle. In every prior general-purpose technology, you could at least measure the lag, because you could compare adopters to non-adopters, or the new way to the old. The dynamo could be tested against steam. What the METR re-run suggests is that AI adoption is starting to outrun our ability to evaluate it — not because the measurement is technically hard, but because the unassisted baseline is becoming a place people refuse to stay long enough to be measured. We may be entering the reorganisation without the instrument that would tell us whether it is working.
That sets up a trap worth naming. When the counterfactual disappears, “a real gain we can’t measure” and “no gain we won’t test” look identical from the outside. Both produce confident, satisfied users and flat statistics. The patient J-curve reading — the gains are coming, give it time — becomes unfalsifiable at precisely the moment it becomes most convenient to believe.
The part that doesn’t get cheaper
Underneath all of it is a distinction the studies keep circling without quite naming. Capability is production: what can be generated, how fast, how cheaply. Production capability is the thing that has genuinely collapsed in cost — that is the seven-month graph, and it is real. But none of the realised-outcome studies are gated on production. They are gated on the part that converts production into something worth having: the judgment about whether the output is right, the standards it has to clear, the reorganisation of the work around the new tool, the accountability for what ships. That layer is exactly what the support agent’s experience supplied, what the senior developer’s context supplied, and what the model does not supply. It is also the layer that does not get cheaper when generation does. If anything it gets more expensive, because there is now more plausible output to judge, and plausible-but-wrong is the costliest kind.
My own experience building with these tools rhymes with this more than I would like. The sensation of speed is intense and immediate — something runs, something ships, the demo works. The part that was always hard — deciding whether the thing is any good, and whether anyone wants it — sits there untouched. Most of what I built that way went nowhere, and not because the production failed. The production was the easy part. It was never the part standing between the work and the outcome.
So the two graphs were never in conflict. One measures how cheap/easy production has become. The other measures whether cheap/easy production has been converted into value, and keeps finding the conversion slow, uneven, concentrated among the inexperienced, and — where human context is richest — sometimes negative. The distance between the graphs is the cost of that conversion, and the cost is mostly human, mostly organisational, and mostly unmeasured.
Which leaves two questions where I often search for literature for my own research. Is generative AI the first general-purpose technology that compounds at the level of the individual who feels faster, without compounding at the level of the firm or the economy that would show it? And if the unassisted baseline is already disappearing — if we are losing the ability to run the experiment at all — how would we ever tell the difference between a revolution and a very satisfying way to feel productive?

