Pathway Debate: Transformer Co-Inventor Lukasz Kaiser Concedes a Post-Transformer Architecture Could Win

Pathway Debate: Transformer Co-Inventor Lukasz Kaiser Concedes a Post-Transformer Architecture Could Win — If It Shows a 10x Improvement

Pathway hosted a live San Francisco debate on May 5, 2026, pitting the inventors of the Transformer against the pioneers of what comes next

The most striking moment in Pathway's live architecture debate was not a challenger landing a punch — it was the reigning champion offering terms of surrender. Lukasz Kaiser, co-inventor of the Transformer and the researcher behind GPT-4, GPT-5, and the o1/o3 reasoning models, told the audience that if a Post-Transformer architecture can demonstrate a better scaling curve — even at 50 times the wall-clock cost on current hardware — he would have no choice but to concede. "If you show me a model that's just constant fifty times slower, but on a better slope, you win. I have to give up. The hardware will follow when you show that." That is a more open door than most investors following the AI infrastructure buildout have probably assumed.

The 10x Bar: Why Hardware Is Not the Excuse It Once Was

The hardware lottery argument — the idea that the Transformer won partly because GPU matrix multiplication happened to suit its architecture perfectly — was front and center throughout the evening. Llion Jones, who is perhaps uniquely positioned in this debate as a co-inventor of the Transformer now fighting for the Post-Transformer side and co-founder of Sakana AI, argued bluntly that "the Transformer breakthrough is deeply misunderstood." In his framing, researchers who keep shuffling attention layers and residual connections in search of the next thing are wasting their time. The real breakthrough was hardware parallelism, and that optimization is no longer available to be discovered again.

Kaiser pushed back with a historical footnote that carries real weight. The first generation of TPUs was built to serve RNNs, not Transformers. When attention models first ran on them, the softmax had to be offloaded to CPU because the exponent was not in hardware. "They were slow as hell," Kaiser said. "It had to prove itself to be good enough for the hardware company to change course, and now eight years later, they can serve it very fast." His point was that a sufficiently superior architecture will earn its own hardware — but the bar is not 2x better. It is 10x. And he added a practical observation that shifts the calculus for researchers today: AI agents can now write CUDA. "A lot of things that are just painfully slow on the GPU you can overcome with a good kernel, which you don't need to write anymore." The implication for anyone building or funding Post-Transformer research is that the implementation moat around the Transformer is narrowing faster than the benchmark numbers suggest.

The BDH Architecture and the PageRank Analogy

Adrian Kosowski, Pathway's Chief Scientific Officer and inventor of the Dragon Hatchling architecture, made arguably the most conceptually ambitious argument of the evening. His claim was not that the Transformer is wrong, but that neither the Transformer nor any current architecture has yet discovered what he called the "leitmotif" of intelligence — the underlying process, analogous to PageRank for information retrieval, that unifies all forms of intelligent behavior. "Back in the nineties, there was a problem which is just a tiny subset of intelligence, which is indexing information. And then there was a company that came with one big theme, one mathematical equation, and one way to implement it." Google's PageRank and MapReduce did not merely build a better AltaVista. They reframed the problem entirely. Kosowski's argument is that we have not yet had that moment for intelligence itself.

His architectural answer, the BDH approach being developed at Pathway, centers on latent reasoning in high-dimensional spaces — the ability to think without externalizing thought into language tokens. "Transformers think in language. They do not think in latent thought. They memorize their thoughts, but they think in language." This is not merely a philosophical distinction. It has direct implications for reasoning efficiency and hardware utilization during inference, which Kosowski identified as the next frontier. "As we move into a world where more and more time is being spent on inference and on reasoning, it is a perfectly honest question whether the Transformer is also the ultimate architecture in terms of use of hardware while reasoning."

Liquid AI's Hedge: Transformers and Post-Transformers, Not Versus

Mathias Lechner, co-founder and CTO of Liquid AI and a research affiliate at MIT CSAIL, was the most pragmatic voice on stage, and his framing is probably the most commercially honest. Liquid AI does not pick a side. It builds what works for the deployment constraint in front of it. Lechner described running a GPT-3-level capable language model on a Raspberry Pi at approximately forty tokens per second — achieved not by allegiance to any single architecture but by selecting from Transformer components, SSMs, gated linear attention, and convolutional layers depending on the requirements. "Whenever there's a new attention mechanism being introduced by DeepSeek, I'm happy. And every time there's a new Post-Transformer model released, I'm also happy because it allows me to draw from a wider set of architectures."

Lechner also raised the most provocative long-term prediction of the evening almost as an aside: that AI agents, themselves built on Transformers, may be the ones that ultimately discover the Transformer's replacement. "I believe that they will find their own replacement. I'm convinced that the Transformer will find its own replacement." It was said without drama, but the implication — that the next architectural breakthrough may be an emergent output of the current paradigm rather than a deliberate human research program — deserves more attention than it received in the room.

Continual Learning: The Inconvenient Weakness

One of the sharpest exchanges of the night concerned continual learning, which Jones described with visible frustration as the central structural weakness of the Transformer paradigm. "We've taken something that's fundamentally built to have static weights, and we're like, 'now how can we add something on top of it so that we have dynamic weights?' I would much rather see someone develop something that was designed to have dynamic weights from the ground up." Kaiser, in a moment of genuine intellectual honesty, acknowledged that the Transformer's in-context learning mechanism does something that looks like dynamic weight updating — but added the caveat that "the thing that actually pains me is that you need to say maybe." There is, as he noted, no serious benchmark that measures the quality of in-context learning as opposed to simple retrieval. Needle-in-a-haystack tests are retrieval problems, not learning problems, and the field has not yet built the tool to distinguish between them.

Perplexity as the Benchmark That Should Be Running Everything

One of the most actionable insights from the debate was Kaiser's argument for perplexity on a held-out dataset as the superior benchmark that the industry should already be using more systematically. He described how, during the original Transformer work, dropping the BLEU score in favor of perplexity turned out to be the correct call — it correlated when it needed to and remained useful long after BLEU scores became saturated. "The way OpenAI really benchmarks its models is perplexity on the internal code base, and I think a lot of labs do this." He went further, floating the idea of a small company that maintains a private, never-released holdout set of text and code, charges a fee per evaluation, and publishes scaling curves across architectures. Jones agreed immediately. "I would like to see people going back to trying to push perplexity." For researchers and investors trying to evaluate which architectural bets are genuinely compounding and which are benchmark-tuned artifacts, this framing matters.

The Local Minimum Problem and the Case for Radical Departure

Jones returned repeatedly to what he called the field's most underappreciated problem: that the success of the Transformer is itself preventing the discovery of its successor. "I actually think that the success of the Transformer is stopping us from finding the next thing. People are concentrating way too much on this architecture, and it's so successful and so good at what it does that we're really stuck in a local minimum right now." His most candid admission was about the economics of that trap. A company like OpenAI is rationally correct to double down on Transformers — it is where their moat sits. But startups, he argued, should be doing the opposite. "It makes more sense to put some money behind the long bets, actually taking time to find what's coming next. OpenAI was in that position at some point. They found that Transformers scaled better before other people, and they've done very well of it."

The most speculative disclosure of the evening came from Jones in passing: that some of the architectures his team at Sakana AI is exploring may not be trainable by backpropagation even in principle. He offered no further detail, but the comment signals that at least one well-resourced lab is genuinely operating outside the current paradigm rather than decorating its edges.

The Safety Dimension Nobody Is Taking Seriously Enough

Kaiser raised a safety point near the end of the evening that cuts against the conventional wisdom about chain-of-thought transparency providing interpretability guarantees. "You have these tokens, and the tokens are like a few bytes each. And then you have the activations above them, and it's like dozens and dozens of layers of thousands of floating points, and we have absolutely no clue what's happening in them." His warning was direct: the current faithfulness of chain-of-thought reasoning to underlying model behavior is a product of pre-training incentives, not an architectural guarantee. "One day you may see the same words said there, and the thoughts will be totally different, and I'm not sure you're gonna know." Jones added a counterintuitive corollary — that a Post-Transformer architecture designed to more closely mirror how biological neural systems actually work might, paradoxically, turn out to be more interpretable and safer than the Transformer it replaces.

The crowd voted Post-Transformers the winner on the night's clapometer, though the margin was described as close. The more durable takeaway is that one of the Transformer's own architects has now publicly set out the terms under which he would abandon it — and those terms are more achievable than the current benchmarking culture would suggest.

DruckFin