Why This Essay, and Why Now

In February 2026, Sam Altman told students at Stanford’s TreeHacks hackathon: “I bet there is another new architecture to find that is gonna be like as big of a gain as transformers were over LSTMs.” He’s not alone. Yann LeCun has called autoregressive LLMs “doomed”, arguing they are a dead end for anything resembling general intelligence. Ilya Sutskever, at NeurIPS 2024, declared that “pre-training as we know it will end” and later told Dwarkesh Patel that the systems needed for AGI are “systems that don’t exist, that we don’t know how to build.” Even Geoffrey Hinton, while more measured, expects “more scientific breakthroughs” on the scale of the Transformer.

Meanwhile, in 2026, the field is expanding in every direction at once. Agentic systems are booming. Embodied AI and world models are advancing rapidly. Reasoning models are scaling test-time compute. But none of these frontiers have converged on the next paradigm: the architecture or training method that will generalize, scale, and be efficient enough to define the next era the way the Transformer defined this one.

If the leaders of the field are telling us the next big architecture hasn’t been found yet, it’s worth asking: what did it look like the last time a big architecture was found? What were the patterns? And what can we learn from the research that came within striking distance but didn’t quite get there?

That’s the purpose of this essay: to look back at thirteen major AI breakthroughs and their immediate precursors, and extract a pattern that might help us recognize the next one when it appears.

A Note on “Precursors” and “Breakthroughs”

Every distinction I make here between a “precursor” and a “breakthrough” is subjective. I want to be explicit about that.

The precursor papers discussed in this essay are not lesser work. Many of them are as highly cited, as theoretically important, and as respected in academia as the “breakthrough” papers that followed them. Highway Networks is a deeply influential paper. Bahdanau attention is foundational. Score-based generative modeling is brilliant. ELMo changed NLP. These are not “failed” papers. They are the shoulders on which breakthroughs stood.

What I mean by “breakthrough” here is something specific and narrow: the formulation that triggered widespread adoption across both research and engineering, and shaped all subsequent work in that area. ResNet’s residual connections are in every deep network. The Transformer is the backbone of all frontier models. DDPM’s loss function is the starting point for all diffusion work. That kind of adoption is what I’m pointing at, not quality, not originality, not importance.

It’s also worth noting that in several cases, the same researchers appear on both sides. Jakob Uszkoreit is an author on both Decomposable Attention and the Transformer. Kenton Lee is on both ELMo and BERT. Maarten Bosma is on both Scratchpads and Chain-of-Thought. Most strikingly, Ricky T. Q. Chen is a core author on Neural ODEs, FFJORD, and Flow Matching. He helped build the complex version, then helped simplify it. These researchers had the deepest understanding of the precursor’s strengths and limitations, and rather than doubling down on what they’d already built, they were willing to step outside their own framework and strip it down to something more essential. That intellectual flexibility, the willingness to simplify your own prior work rather than defend it, may be the most underrated quality in the researchers behind these breakthroughs.

Reasonable people can and do disagree about which paper was the real breakthrough in any given area. I fully respect those disagreements. If your list would look different from mine, I’d love to hear why. The pattern might still hold, or it might break in interesting ways.


The Pattern

For every paper that changed deep learning, there was a precursor, published 2 to 18 months earlier, that had 90% of the core idea. The precursor was theoretically sound. It often introduced the key mechanism. And yet it didn’t cross the tipping point into ubiquitous adoption.

Why not?

Looking at thirteen major AI milestones, from AlexNet to test-time compute scaling, a single recurring pattern emerges: the breakthrough paper found the minimal sufficient mechanism. In every case, the precursor introduced a powerful idea but buried it under unnecessary complexity: learned gates where a constant would do, local approximations where a brutal quantization would suffice, fine-tuning pipelines where a prompt would work, ODE solvers where regression would suffice. The breakthrough stripped away everything except the essential insight, and what remained mapped cleanly onto existing infrastructure: standard matrix multiplications, existing optimizers, pretrained components, or just the model’s own in-context learning.

Simplification was the breakthrough.


Part I: The Foundations (2011 to 2020)

1. Before AlexNet: Ciresan’s GPU-CNNs

Precursor: High-Performance Neural Networks for Visual Object Classification (Ciresan et al., IDSIA/Schmidhuber lab, 2011)

Breakthrough: ImageNet Classification with Deep Convolutional Neural Networks (AlexNet) (Krizhevsky, Sutskever & Hinton, 2012)

Gap: ~12 months

Ciresan et al. were arguably the first to train deep CNNs on GPUs and demonstrate competitive results. Between 2011 and 2012, their GPU-trained models won multiple image recognition competitions, achieving state-of-the-art on MNIST, NORB, CIFAR-10, and several specialized vision tasks. The core technique (deep convolutional networks trained end-to-end on GPUs) was identical to what AlexNet would use.

Why it didn’t break through:

The demonstrations were on small benchmarks. MNIST (28×28 grayscale digits), NORB (normalized objects), CIFAR-10 (32×32 tiny images). These were the standard benchmarks of the day, but they weren’t taken seriously by the broader computer vision community as evidence that neural networks could compete with hand-engineered feature pipelines (SIFT + SVM) on real-world images.

What AlexNet did: Competed on ImageNet, 1.2 million high-resolution images across 1,000 categories, and won by a landslide. Top-5 error of 15.3% versus the runner-up’s 26.2%, a gap so large it was undeniable. The technical innovations (ReLU, dropout, data augmentation, multi-GPU training) mattered, but the decisive move was choosing the right benchmark. ImageNet was large enough and difficult enough that the community couldn’t dismiss the result.

The same capability demonstrated on a toy benchmark is “interesting.” Demonstrated on the benchmark the field actually cares about, it’s a revolution.


2. Before ResNet: Highway Networks

Precursor: Highway Networks (Srivastava, Greff & Schmidhuber, May 2015)

Breakthrough: Deep Residual Learning for Image Recognition (He et al., December 2015)

Gap: 7 months

This is the cleanest example of the pattern. Highway Networks and ResNet solve the exact same problem (training networks with 100+ layers) using the exact same insight: let information skip layers. ResNet is, mathematically, a special case of Highway Networks where all gates are permanently set to 1.

Highway Networks borrowed LSTM-style gating: a learned transform gate controls how much signal passes through the layer, and a carry gate controls how much bypasses it:

The network had to learn to keep the highway open.

Why it didn’t break through:

The learned gates added parameters, complicated the optimization landscape, and, most critically, could partially close during training, impeding gradient flow in exactly the deep networks they were supposed to enable. Training 100-layer Highway Networks required careful initialization of the gate biases. The complexity that was supposed to give the model flexibility became the very thing that made it harder to train.

What ResNet did: Deleted the gates. The residual connection is just addition:

No parameters. No learning. The gradient of the loss with respect to any layer’s input always contains a term of exactly 1. Gradients cannot vanish through the skip connection. He et al. replaced a learned mechanism with a hardwired structural prior, the identity mapping, and the optimization problem that Highway Networks struggled with simply disappeared.

The ResNet paper explicitly cites Highway Networks as motivation, then notes that its “shortcut connections are neither gated nor data-dependent,” and that this simpler approach works better. Schmidhuber has argued that ResNet should be recognized as a special case of Highway Networks, and he has a point, but the special case turned out to be the one that scaled.


3. Before the Transformer: Decomposable Attention

Precursor: A Decomposable Attention Model for Natural Language Inference (Parikh et al., June 2016)

Breakthrough: Attention Is All You Need (Vaswani et al., June 2017)

Gap: 12 months

Parikh et al. proved something remarkable: you don’t need recurrence. Their model used only attention and feed-forward networks (no RNNs, no LSTMs) and beat recurrent models on natural language inference. The paper demonstrated that attention alone was sufficient for strong NLP performance, presaging the Transformer’s core thesis by a full year.

Why it didn’t break through:

The model was shallow and task-specific. It used inter-sequence attention (attending between two input sequences for comparison tasks) but not self-attention (attending within a single sequence). It couldn’t be stacked into a deep, general-purpose encoder-decoder. And it was benchmarked only on NLI, a classification task, not on generation or translation, where sequence-to-sequence models were the gold standard.

Meanwhile, Cheng et al., 2016 introduced “intra-attention” (what we now call self-attention) but embedded it within an LSTM, adding the right mechanism to the wrong architecture.

What the Transformer added: Self-attention, multi-head attention, and a stackable encoder-decoder architecture. The key wasn’t inventing a new mechanism. It was composing the attention primitive into a general-purpose architecture and demonstrating it on the task the field cared most about (machine translation). The Transformer took the “attention is enough” insight and made it architecturally complete.

The deeper origin of attention is Bahdanau et al., 2014, but at 33 months prior and still using RNNs, it’s better understood as a foundational ancestor than an “almost-breakthrough.” A closer temporal precursor is ConvS2S (Gehring et al., May 2017), published just 5 weeks before the Transformer, but it chose convolutions over pure attention, the road not taken.


4. Before BERT: ELMo

Precursor: Deep Contextualized Word Representations (ELMo) (Peters et al., February 2018)

Breakthrough: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al., October 2018)

Gap: 8 months

ELMo showed that pre-trained contextualized representations dramatically improve downstream NLP tasks. Before ELMo, word embeddings (Word2Vec, GloVe) were static. “Bank” had the same vector whether it meant a financial institution or a river bank. ELMo gave each word a representation that depended on its full sentential context, derived from a pre-trained bidirectional LSTM language model. The “pre-train on language modeling, then transfer” paradigm was born.

Why it didn’t break through:

Two compounding limitations:

  1. Architecture: ELMo used bidirectional LSTMs, not Transformers. LSTMs process sequences step by step ( sequential operations), so pre-training on large corpora was slow and didn’t scale.

  2. Integration pattern: In practice, ELMo’s representations were predominantly used as frozen features. You’d run text through the pre-trained ELMo, extract the contextualized embeddings, and feed them as input features to a separate task-specific model. While the original paper noted that fine-tuning the biLM weights could sometimes help, the dominant usage pattern was feature extraction, and the pre-trained weights were largely kept fixed. This paradigm limited how deeply the pre-trained knowledge could adapt to each downstream task.

What BERT did: Combined two shifts:

  • Switched from LSTMs to Transformers (parallelizable, scalable)
  • Switched from feature extraction to full fine-tuning, where the entire pre-trained model’s weights are updated on each downstream task

BERT pre-trained faster, scaled to larger corpora, and adapted more deeply to each task. It achieved massive improvements on 11 NLP benchmarks simultaneously, making “pre-train then fine-tune” the default paradigm for all of NLP.

See also: ULMFiT (Howard & Ruder, January 2018), which introduced transfer learning for NLP with discriminative fine-tuning. Published a month before ELMo, but using LSTMs and benchmarked mainly on text classification.


5. Before ViT: Stand-Alone Self-Attention

Precursor: Stand-Alone Self-Attention in Vision Models (Ramachandran et al., NeurIPS 2019)

Breakthrough: An Image is Worth 16×16 Words (Dosovitskiy et al., October 2020)

Gap: ~16 months

Ramachandran et al. showed that self-attention layers can fully replace convolutions in vision models while matching or exceeding performance. They proved the theoretical point: you don’t need convolutions for image recognition.

Why it didn’t break through:

Self-attention scales quadratically with sequence length. For a 224×224 image, treating each pixel as a token gives , a 50K × 50K attention matrix per head. Completely infeasible.

The precursor’s solution was local self-attention: restrict attention to small spatial neighborhoods (e.g., 7×7 windows around each pixel). This made it computationally tractable but required custom CUDA kernels to efficiently handle the irregular memory access patterns. It also defeated the purpose. The whole point of attention is global receptive fields, and local attention is just a more complicated convolution.

What ViT did: The patch projection. Instead of attending to 50,176 pixels, chop the image into 16×16 patches and treat each patch as a token. Now . The “projection” is just a linear layer applied to each flattened patch, mathematically equivalent to a single convolution with kernel size and stride of 16.

This is a brutal quantization of spatial resolution at the input layer. But it meant ViT could use a completely standard NLP Transformer with no custom kernels, no local attention windows, no vision-specific architectural modifications. The same codebase, the same hardware optimizations, the same scaling laws.

See also: Image Transformer (Parmar et al., 2018), which applied self-attention to image generation but only in restricted 1D local blocks.


6. Before DDPM: Score-Based Generative Modeling

Precursor: Generative Modeling by Estimating Gradients of the Data Distribution (Song & Ermon, NeurIPS 2019)

Breakthrough: Denoising Diffusion Probabilistic Models (Ho et al., June 2020)

Gap: ~12 months

Song & Ermon introduced score-based generative modeling: train a neural network to estimate the score function (gradient of the log-density) of the data distribution, then generate samples using Langevin dynamics. The approach was principled, theoretically elegant, and produced reasonable results on CIFAR-10.

Why it didn’t break through:

The score-matching objective required careful multi-scale noise scheduling, specifically “annealed Langevin dynamics” across multiple noise levels. Getting the noise schedule right was finicky, and sample quality didn’t yet rival GANs. The generation pipeline was complex: you needed to choose how many noise levels, what the noise magnitudes should be, how many Langevin steps at each level, and the step sizes. Many hyperparameters, fragile training.

What DDPM did: Ho et al. drew deep connections between diffusion models and denoising score matching (the formal unification of the two frameworks into a single SDE framework came later, in Song et al., 2021). But their key contribution was a dramatically simpler training objective. Instead of estimating the score function with complex weighting across noise scales, the network just predicts the noise that was added to a clean image:

Plain MSE. One noise level sampled uniformly per training step. No annealed scheduling, no multi-scale objectives. This simplified loss also acted as an implicit weighting that focused model capacity on perceptually important image features, and the resulting sample quality exceeded GANs for the first time.

The deeper origin is Sohl-Dickstein et al., 2015, which laid the mathematical foundation using nonequilibrium thermodynamics, but the 5-year gap puts it in the “foundational ancestor” category rather than “almost-breakthrough.”


7. Before Chain-of-Thought: Scratchpads

Precursor: Show Your Work: Scratchpads for Intermediate Computation with Language Models (Nye et al., November 2021)

Breakthrough: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al., January 2022)

Gap: ~2 months

Nye et al. demonstrated the core insight: if you force a language model to output intermediate computation steps before giving a final answer, it can solve problems (polynomial evaluation, addition of large numbers) that it otherwise fails at. The idea of “using more serial compute via sequential token generation” was fully present.

Why it didn’t break through:

Scratchpads required fine-tuning. You had to generate supervised datasets of execution traces (step-by-step arithmetic, variable tracking through program execution) and train the model on these traces. This was:

  1. Expensive: gradient updates on trace-annotated datasets
  2. Brittle: each new reasoning domain needed its own curated trace dataset
  3. Narrow: the traces were formatted like program execution logs, not natural language reasoning

The even deeper origin is Ling et al., 2017, who created a 100K-sample dataset of math problems with step-by-step natural language rationales. The Wei et al. paper itself cites them as pioneering this idea.

What Chain-of-Thought did: Discovered that at sufficient scale (>100B parameters), you don’t need fine-tuning at all. Just show the model a few examples of step-by-step reasoning in the prompt (few-shot), or even just append “Let’s think step by step” (zero-shot), and the reasoning emerges from in-context learning. Wei et al. shifted reasoning from a data engineering problem (curate traces → fine-tune) to a prompting pattern (demonstrate → generate).

The key enabling factor was scale. Nye et al. worked with smaller models where the reasoning capability had to be explicitly trained in. At 100B+ parameters, the capability was already latent. It just needed to be elicited.


Part II: The Modern Era (2020 to 2025)

The same pattern continues through the current wave of AI development.

8. Before Chinchilla: Kaplan’s Scaling Laws

Precursor: Scaling Laws for Neural Language Models (Kaplan et al., 2020)

Breakthrough: Training Compute-Optimal Large Language Models (Chinchilla) (Hoffmann et al., NeurIPS 2022)

Gap: ~26 months

This is a case where the precursor did break through. Kaplan et al.’s scaling laws were hugely influential, but got the key number wrong, leading the entire field astray.

Kaplan et al. showed clean power-law relationships between loss and compute/data/parameters across 7 orders of magnitude. This was itself a breakthrough: it turned LLM training from art into engineering. The precursor to Kaplan is Hestness et al., 2017, who observed power-law scaling across 4 domains but treated it as a descriptive observation rather than a prescriptive tool.

Why Kaplan’s version needed correction:

Kaplan’s scaling exponents implied that for a fixed compute budget, making models larger was more efficient than adding more data. In practice, this led to a roughly 3:1 ratio of parameter scaling to token scaling. This led to GPT-3 (175B parameters, 300B tokens), Gopher (280B parameters, 300B tokens), and other models that were, in retrospect, massively undertrained.

What Chinchilla did: Corrected the key exponent. Hoffmann et al. showed that parameters and tokens should scale equally, a 1:1 ratio, not 3:1. Their 70B-parameter Chinchilla, trained on 1.4 trillion tokens, used the same compute budget as the 280B-parameter Gopher but allocated it differently (4x fewer parameters, 4x more data) and significantly outperformed it.

This single correction reshaped the entire industry. Llama (65B, 1.4T tokens), Mistral, and every efficient open-weight model is a “Chinchilla-optimal” model. The lesson: even within a correct framework (scaling laws), getting the ratio wrong can waste billions of dollars of compute.


9. Before Flow Matching: Neural ODEs

Precursor: Neural Ordinary Differential Equations (Chen et al., NeurIPS 2018) and FFJORD (Grathwohl et al., ICLR 2019)

Breakthrough: Flow Matching for Generative Modeling (Lipman et al., ICLR 2023) and Flow Straight and Fast (Rectified Flow) (Liu et al., ICLR 2023)

Gap: ~4 years from FFJORD; but the direct simplification target is DDPM-style diffusion (2020), making the effective gap ~2 years

Neural ODEs and FFJORD established that you can model data distributions with continuous normalizing flows: parameterize a vector field, solve an ODE to map noise to data. The idea was elegant: generation as a continuous flow through time.

Why they didn’t break through:

Training required expensive ODE simulation during both forward and backward passes. FFJORD used Hutchinson’s trace estimator for log-density computation, which was stochastic, noisy, and slow to converge. Training was memory-intensive and numerically unstable at scale. The continuous flow idea was sound, but the training procedure was impractical.

Meanwhile, DDPM (2020) and score-based SDEs (Song et al., 2021) achieved much better generation quality, but with their own complexity: stochastic forward processes producing curved probability paths that required 50 to 1000 denoising steps for high-quality sampling.

What Flow Matching did: Eliminated simulation during training entirely. Instead of solving an ODE to compute loss, you directly regress a vector field against analytically known conditional paths, typically straight-line interpolations between noise and data.

Rectified Flow further simplified by showing that iteratively straightening these paths yields nearly straight trajectories that can be traversed in a single Euler step, enabling 4 to 8 step generation instead of 50 to 1000.

This is now the dominant generative modeling paradigm. Stable Diffusion 3 uses rectified flow matching. Meta, Adobe, and Runway have adopted it in production. Flow matching unified diffusion, score matching, and optimal transport as special cases under different choices of probability paths, and the simplest choice (straight lines) turned out to work best.


10. Before DPO: RLHF

Precursor: Training Language Models to Follow Instructions with Human Feedback (InstructGPT/RLHF) (Ouyang et al., 2022)

Breakthrough: Direct Preference Optimization (Rafailov et al., NeurIPS 2023)

Gap: ~14 months

RLHF solved the alignment problem: make language models actually do what users want. The recipe was: (1) collect human comparisons of model outputs, (2) train a reward model on those comparisons, (3) use PPO (a reinforcement learning algorithm) to fine-tune the language model against the reward model while staying close to the original policy via KL regularization.

InstructGPT showed that a 1.3B-parameter RLHF model could be preferred by humans over a 175B base GPT-3. The approach powered ChatGPT and kicked off the current AI wave.

Why it became the bottleneck:

RLHF is a complex multi-stage pipeline. You need to train and maintain a separate reward model (which can overfit, go stale, or be exploited). You need PPO, which is notoriously finicky, sensitive to hyperparameters, reward scaling, KL penalty coefficients, and value function estimation. The pipeline has at least three models in memory simultaneously (policy, reference policy, reward model), making it GPU-hungry. Many teams reported that getting RLHF to work reliably required months of engineering effort.

The deeper precursor is Learning to Summarize with Human Feedback (Stiennon et al., 2020), which applied RLHF to a single task (summarization). InstructGPT’s breakthrough was generalizing it to all tasks.

What DPO did: Proved mathematically that the reward model and PPO optimization can be collapsed into a single supervised learning objective. Instead of the three-stage pipeline (collect preferences → train reward model → run RL), DPO directly optimizes the policy on preference pairs using a classification-style loss:

No reward model. No RL. No PPO. Just supervised fine-tuning on preference pairs. The key insight was that the optimal policy under the RLHF objective has a closed-form relationship with the reward function, so you can implicitly learn the reward while directly learning the policy.

DPO is to RLHF what ResNet is to Highway Networks: the same destination, reached by deleting the learned intermediary. Most open-source alignment efforts now use DPO or a variant (IPO, KTO, ORPO) as a core component, often combined with other techniques. Llama 3, for instance, uses DPO alongside rejection sampling and PPO in a multi-stage pipeline.


11. Before Flash Attention: Memory-Efficient Attention

Precursor: Self-attention Does Not Need Memory (Rabe & Staats, 2021)

Breakthrough: FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (Dao et al., NeurIPS 2022)

Gap: ~6 months

The efficient attention literature from 2020 to 2021 was enormous. Dozens of papers proposed ways to reduce attention’s complexity: low-rank projections (Linformer), kernel approximations (Performer), sparse patterns (Longformer), and more. But the most direct precursor is Rabe & Staats, who proved the algorithmic insight: attention doesn’t need memory. You can compute exact attention using online softmax and gradient checkpointing in memory.

Why it didn’t break through:

The broader efficient attention literature traded quality for speed, and every approximation degraded performance. But even Rabe & Staats, who kept exact attention, didn’t deliver the key thing practitioners needed: wall-clock speedup. Their paper was purely algorithmic and mathematical. It reduced memory without reducing time, because it didn’t account for the GPU memory hierarchy. The paper had no optimized implementation.

The fundamental error across the entire efficient attention literature was treating attention’s bottleneck as computational (too many FLOPs). It’s actually memory bandwidth: the bottleneck is reading and writing the attention matrix to and from GPU high-bandwidth memory (HBM).

What Flash Attention did: Kept exact attention, no approximation, and solved the problem from a completely different angle, an IO-aware implementation. Instead of materializing the full attention matrix in slow HBM, Flash Attention tiles the computation in fast SRAM, computing attention in blocks and never writing the full matrix to slow memory. The algorithm is mathematically identical to standard attention; only the memory access pattern changed.

This is the inverse of the usual paradigm: instead of simplifying the model, Flash Attention simplified the implementation’s relationship with hardware. But the effect is the same: it removed the obstacle (memory bottleneck) that was preventing standard attention from scaling to long sequences, without requiring anyone to change their model architecture.

Flash Attention is now the default attention implementation in virtually every major LLM framework. It enabled the jump from 2K to 128K+ context lengths.


12. Before LoRA: Adapter Layers

Precursor: Parameter-Efficient Transfer Learning for NLP (Adapter Layers) (Houlsby et al., ICML 2019)

Breakthrough: LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., ICLR 2022)

Gap: ~28 months

Houlsby et al. showed that you don’t need to fine-tune all parameters of a large pre-trained model. Instead, insert small “adapter” bottleneck modules between Transformer layers, freeze the original weights, and train only the adapters. This reduced trainable parameters dramatically while preserving most of the quality of full fine-tuning.

Why it didn’t break through:

Adapter layers had two structural problems:

  1. Inference latency: Adapters are additional sequential modules in the forward pass. Every inference step must evaluate the adapter layers, adding latency proportional to the number of adapter parameters.

  2. Architectural modification: The adapted model has a different architecture than the original, requiring a modified forward pass, which complicates deployment, serving, and model merging.

What LoRA did: Instead of adding new modules, decompose the weight update itself. Weight updates during fine-tuning have low intrinsic rank. So represent the update as where and with rank .

At inference time, merge: . The adapted model has zero additional latency: same architecture, same forward pass, just different weight values. No architectural modification. No serving complexity.

LoRA reduced trainable parameters by 10,000× while matching full fine-tuning quality. More importantly, it democratized fine-tuning: anyone with a consumer GPU could adapt a 7B+ parameter model. The entire open-source LLM ecosystem (every Hugging Face model variant, every domain adaptation) runs on LoRA or its derivatives (QLoRA, DoRA).

The pattern is the same as the others: the precursor added a mechanism (adapter modules); the breakthrough removed it by folding the adaptation into existing weights.


Precursor: Chain-of-Thought Prompting (Wei et al., 2022), Self-Consistency (Wang et al., 2022), Tree of Thoughts (Yao et al., 2023), Let’s Verify Step by Step (Lightman et al., 2023)

Breakthrough: Test-time compute scaling / o1 (OpenAI, September 2024) and DeepSeek-R1 (DeepSeek, January 2025)

Gap: ~18 months from process reward models to o1

The intellectual lineage here is a chain of precursors, each adding a piece:

  • Chain-of-Thought (2022): models reason better when they show their work
  • Self-Consistency (2022): sample multiple reasoning chains and majority-vote
  • Tree of Thoughts (2023): explore multiple reasoning paths with search
  • Process Reward Models (Lightman et al., 2023): reward each step of reasoning, not just the final answer
  • Scaling Test-Time Compute (Snell et al., 2024): proved that spending more inference compute can be more effective than scaling model size

Each piece demonstrated that spending more compute at inference time improves reasoning. But each was a prompting trick or a supervised training procedure, external scaffolding bolted onto a base model.

Why they didn’t break through individually:

CoT is a prompting pattern, powerful but uncontrolled. Self-Consistency is brute-force sampling. Tree of Thoughts requires hand-designed search procedures. Process reward models require expensive human annotations of step-level correctness (Lightman et al. released 800K labels). None of these approaches trained the model itself to internalize search and verification as part of its generation process.

What o1 and R1 did: Used reinforcement learning to train the model to autonomously allocate variable amounts of inference-time compute. The model learns when to think longer, when to backtrack, when to verify its own reasoning, without external scaffolding, hand-designed search trees, or human-annotated step labels. The “chain of thought” is no longer a prompting trick; it’s a learned behavior reinforced by outcome rewards.

DeepSeek-R1-Zero showed this could be done with pure RL (GRPO) without any supervised fine-tuning on reasoning traces, and that reasoning behaviors (self-verification, backtracking, exploration) emerge spontaneously from the RL training process, reminiscent of how in-context learning emerged from scale in GPT-3. (The final DeepSeek-R1 model adds cold-start data and multi-stage training on top of this foundation.)

The simplification: replace external search scaffolding with learned internal search. This opened an entirely new scaling axis. Before o1, “scaling” meant bigger models or more training data. Now you can also scale at inference time, trading compute for capability on a per-query basis.


The Pattern

Every case follows the same arc:

Precursor → BreakthroughWhat was deleted
Ciresan’s GPU-CNNs → AlexNetSmall benchmarks → ImageNet scale
Highway Networks → ResNetLearned gates → identity shortcut
Decomposable Attention → TransformerTask-specific model → general architecture
ELMo → BERTLSTMs + frozen features → Transformers + fine-tuning
Stand-Alone Self-Attention → ViTPer-pixel attention → patch tokenization
Score-based models → DDPMComplex noise schedule → simple MSE loss
Scratchpads → Chain-of-ThoughtFine-tuning on traces → prompting
Kaplan’s scaling laws → ChinchillaWrong ratio (3:1) → correct ratio (1:1)
Neural ODEs / FFJORD → Flow MatchingODE simulation → direct regression
RLHF pipeline → DPOReward model + PPO → single loss
Rabe & Staats → Flash AttentionAlgorithm only → IO-aware kernel
Adapter layers → LoRAAdded modules → merged weight updates
CoT + scaffolding → o1 / R1External search → learned reasoning

The precursors proved the idea works. The breakthroughs made the idea easy to use.

The pattern extends beyond these 13 cases. In diffusion models, classifier guidance (Dhariwal & Nichol, 2021) required training a separate classifier on noisy images; classifier-free guidance (Ho & Salimans, 2022) deleted the classifier entirely. In Mixture of Experts, Shazeer et al., 2017 used complex top-k routing with load-balancing losses; Switch Transformer (Fedus et al., 2021) simplified to k=1 routing. In state space models, S4 (Gu et al., 2021) required complex structured matrices with HiPPO initialization; Mamba (Gu & Dao, 2023) replaced them with simple input-dependent diagonal SSMs. Every time: delete the complex part, keep the essential mechanism.


Three Sub-Patterns

Within the overarching theme of simplification, three distinct strategies recur:

1. Delete the learned intermediary. Highway Networks learned gates → ResNet deleted them. RLHF learned a reward model → DPO deleted it. Scratchpads required learned execution traces → CoT just prompted. Classifier guidance trained a separate classifier → classifier-free guidance deleted it. When a component exists only to route or mediate, try hardwiring its behavior or removing it entirely.

2. Brutally quantize the input representation. ViT threw away 99.6% of spatial resolution at the input layer (50,176 → 196 tokens) and gained access to the entire Transformer ecosystem. Flow matching replaced smooth stochastic paths with straight lines. Chinchilla showed you need far fewer parameters and far more data than assumed. When perfect fidelity at every stage isn’t necessary, find the coarsest representation that preserves the essential signal.

3. Move the complexity to where infrastructure already exists. Flash Attention didn’t simplify the math. It simplified the memory access pattern to match GPU SRAM. BERT didn’t invent pre-training. It moved ELMo’s idea onto Transformers where scaling was already proven. AlexNet didn’t invent GPU-CNNs. It moved them onto ImageNet where the community was already paying attention. LoRA didn’t invent parameter-efficient fine-tuning. It reformulated adapters so they merge into existing weights at zero cost. The breakthrough often isn’t always a new idea. Sometimes it’s putting an existing idea where existing infrastructure can amplify it.


What This Means for Current Research

If this pattern holds, the next major breakthroughs may already exist in papers published in the last 12 to 18 months, but in forms that are too complex, too specialized, or demonstrated on the wrong benchmarks.

The question to ask of any promising-but-underappreciated paper is:

What would happen if you deleted half the mechanism and scaled up what remains?

The Transformer deleted recurrence. ResNet deleted the gates. ViT deleted per-pixel attention. DDPM deleted the complex loss. Flow Matching deleted the curved paths. DPO deleted the reward model. LoRA deleted the adapter modules. In every case, the removal was the contribution.

Some areas where “almost-breakthroughs” may be waiting for their simplification moment:

  • State space models (Mamba, S4) have demonstrated impressive efficiency for long sequences, but adoption remains limited by architectural complexity and the need for custom kernels. A ViT-style brutal simplification, making SSMs work with completely standard infrastructure, could change this.
  • Mixture of Experts scaling remains gated by complex routing mechanisms and load balancing losses. The lesson of Highway Networks suggests: what if the routing were simpler, or even static?
  • Multimodal alignment currently requires elaborate multi-stage training pipelines (separate vision encoders, projection layers, multi-phase training schedules). A DPO-like collapse of the pipeline into a single training objective could unlock much wider adoption.

The next paradigm shift might not come from adding something new. It might come from someone who looks at a complicated 2025 paper and asks: what if we just… didn’t do that part?