OpenAI's Noam Brown: Benchmark Grids Are Misleading Investors on Model Capability

Research Scientist reveals why traditional evaluation frameworks fundamentally misrepresent reasoning models, January 2026

OpenAI Research Scientist Noam Brown has published an essay arguing that the industry's standard approach to evaluating AI models has become dangerously misleading as reasoning capabilities scale with inference compute. The problem, Brown explains in a recent podcast, is that benchmark grids show single-number scores that mask the most important variable: how much compute budget a model consumes to reach that performance.

When OpenAI released its latest model, designated 5.5 internally, initial skepticism emerged from benchmark comparisons showing only marginal improvements over the previous 5.4 release. "It was only a few percentage points in some benchmarks," Brown notes. But that reaction lasted just hours before hands-on usage revealed substantial capability gains. The disconnect stemmed from a measurement problem that Brown believes has infected the entire industry's evaluation methodology.

The Hidden Variable in Model Performance

The core issue is that benchmark grids fail to control for test-time compute, the inference budget allocated to each problem. Model 5.5 proved far more efficient at reasoning than 5.4, delivering comparable performance while thinking for substantially less time. "Once you control for the amount of thinking time, actually you can see that 5.5 is a substantial jump over 5.4," Brown explains. Yet standard benchmarks make this efficiency advantage invisible to investors and researchers scanning performance tables.

The natural response, Brown notes, is to simply let models think until performance plateaus. But that approach has become impractical with modern reasoning systems. "What we're seeing today with the modern models is that 5.5 and other models can think for, if you scaffold them reasonably well, can think for weeks even before having performance plateau on some of these benchmarks." This represents a fundamental shift from the GPT-3 era, when additional inference time yielded minimal gains beyond a few seconds of processing.

Brown's proposed solution involves either enforcing explicit budget constraints or plotting performance as a function of test-time compute. "You either have some kind of budget for the benchmark whether it's tokens or cost or time or whatever or you plot the performance as a function of the amount of test time compute that's going into the model," he argues. Only then does meaningful comparison between models become possible.

Safety Evaluation Frameworks Built for a Different Era

The measurement problem extends beyond capability assessment to safety evaluations, with potentially serious implications. Brown points out that responsible scaling policies and preparedness frameworks at major labs were largely developed before inference-time scaling became significant. These policies evaluate whether models possess dangerous capabilities, but fail to account for the budget-dependent nature of modern model performance.

"The problem is we're in a world now where the capability of the model is a function of how much money you put into it," Brown states. "Basically, if you give it a budget of $10,000, it can do a lot more than what it can do with a budget of $10. If you give it a budget of $10 million, it can do even more." Current safety frameworks don't address at what budget level dangerous capabilities should be assessed.

The AI Safety Institute has demonstrated that models continue improving on cybersecurity tasks even at 100 million token budgets, representing substantial computational expense and time. Brown suggests evaluation protocols could project performance at high budgets by measuring improvement slopes at lower budgets, though he acknowledges this remains an open research problem.

Latent Capability in Already-Released Models

The rapid model release cycle creates another wrinkle. OpenAI and competitors now ship new models every two to three months, but truly pushing models to their limits can require running them for months. "Nobody actually knows what the ceiling of capabilities are for these models because nobody's actually run them for long enough to really tell," Brown observes.

He offers a striking example from OpenAI's recent work disproving the Erdős unit distance conjecture using an internal model. The achievement required minimal budget, but subsequent experimentation revealed that the publicly available model 5.5 could reach the same result through proper scaffolding, though at an estimated cost of $1,000 to $100,000. "It would have been possible for somebody to disprove the erdős unit distance conjecture before we did using a general purpose model," Brown notes. "Nobody had explored sufficiently what happens if I put $100,000 worth of compute into 5.5."

This dynamic presents a coordination problem. Each model release drops the cost of achieving specific results by 10x to 100x, creating incentives to wait rather than extensively explore current capabilities. OpenAI itself actively discourages internal researchers from exhaustively testing current models on open problems in mathematics and physics, preferring to focus efforts on developing more capable and cost-effective next-generation systems.

Concrete Examples from Poker Bot Development

Brown uses his personal evaluation methodology to illustrate capability progression across model releases. As an expert in game theory who developed poker-playing AI during his PhD, he tests each new model by attempting to build poker bots. Model 5.2 allowed him to create a river solver, the final stage of poker analysis, approximately five times faster than he could alone. However, he characterizes its performance as resembling "a grad student where they would run into issues, but at least I would know what those issues were and know how to fix it."

A persistent problem Brown labels "gaslighting" emerged with earlier models. In one instance, he asked a model how much he'd lose folding with $100 in the pot. The model answered $92, then when challenged insisted "it's close to 100, it's fine, it's no big deal." Model 5.5 largely eliminated this behavior and can build a complete river solver with minimal guidance. Brown estimates that within six to twelve months, models will complete "an entire poker solver, basically my entire PhD thesis in one go" with zero-shot prompting.

When attempting to push models toward genuine research contributions by requesting algorithms superior to published work, Brown finds current systems still fall short. "I can give it a lot of time and it's still not able to do it," he reports. He does note incremental improvement across releases and expects an eventual inflection point where research taste becomes genuinely useful, similar to previous breakthroughs in coding and mathematics.

Recursive Self-Improvement Without Fast Takeoff

Brown's observations inform his perspective on recursive self-improvement and takeoff dynamics. While acknowledging that models are "definitely accelerating what researchers can do inside the labs," he sees this acceleration as uneven across different aspects of research. "Currently we're at the point where if something goes 100x faster you get bottlenecked by the things that don't go 100x faster," he explains.

Critically, Brown does not anticipate an overnight intelligence explosion scenario. "There is this hypothesis that you could have basically an overnight intelligence explosion where the models discover some kind of breakthrough to make themselves smarter and then that leads to more breakthroughs that make themselves even smarter immediately," he notes. His skepticism stems directly from the test-time compute requirements: "If it requires so much test time compute to unlock the full capabilities of the model, then that means you're bottlenecked by time."

This time bottleneck currently represents the binding constraint for frontier labs, in Brown's assessment. "The biggest bottleneck for all of us is time and that's why all the researchers are working so intensely right now," he states. "We all see what the overhang is. We see what the capabilities are and we're just bottlenecked by how quickly can we do things."

Multi-Agent Coordination as Unexplored Frontier

When asked about underexplored research directions, Brown points to large-scale multi-agent coordination. While acknowledging substantial existing work, he believes current efforts merely scratch the surface of what's possible. His mental model draws from human civilization's development, which progressed not through individual intelligence gains but through billions of humans accumulating and building on shared knowledge over millennia.

"We're not seeing that with AI models today," Brown observes. "They're born into a world and they exist for a very short context window and then they just disappear." While retrieval systems and scaffolding provide limited continuity, Brown sees early products like MultiOn and OpenClaw as indicators of a potential future state involving coordinated compounding knowledge on a global scale.

Breaking the Benchmark Grid Equilibrium

Brown characterizes the continued publication of traditional benchmark grids as a bad equilibrium that persists despite widespread recognition of its inadequacy. "Everybody kind of knows that it's a bad equilibrium, but nobody wants to break out," he explains. Companies publish grids because investors and researchers expect them, creating a self-reinforcing cycle.

His essay aims to provide permission for the next model release to abandon top-line grid presentations in favor of performance curves with explicit compute budgets on the x-axis. On routing layers and consensus approaches popular among application companies, Brown applies the same principle: such techniques may improve performance, but evaluation must control for test-time compute to determine whether they outperform simply allowing a single model to think longer at equivalent cost.

Brown maintains appropriate skepticism about whether routing optimizations for specific benchmarks translate to real-world improvements, noting the persistent risk of overfitting to evaluation suites. But his fundamental message remains that without controlling for the compute variable, meaningful comparison has become impossible in an era where model capability scales continuously with inference budget.

DruckFin