The Scaling Laws are already broken: smaller models win out on reasoning long term

April 25, 2025

The SoTA LLMs that score highest on standard benchmarks all have >100B parameter counts, but these consist of mainly “flat” tasks: single-prompt problems with short, self-contained answers. The current scaling curves that plot test loss vs. parameter count show smooth power-law gains and suggest that more weights yield monotonic progress. However, these curves are misleading: they measure token-level accuracy, not whole-task reliability across longer, chained sequences of actions.

Once models need to maintain that correctness through hundreds or thousands of dependent steps (writing, compiling, running, reading, revising, etc), they break down. Below is my argument for why parameter growth or increased test-time compute alone cannot overcome that shift, and why smaller, modular, hierarchy-aware systems will ultimately likely dominate.

Concept breadth versus chain length

Define

$P$ — active parameters
$\varepsilon(P)\sim P^{-\alpha}$ — average per-token error, with $\alpha\approx0.27 – 0.34$ in current scaling law fits
$L$ — steps in the task chain
$S$ — probability the entire chain succeeds

With an autoregressive decoder, token errors accumulate multiplicatively:

$$ S=(1-\varepsilon)^{L}\approx e^{-\varepsilon L}. $$

Increasing $P$ reduces $\varepsilon$ only polynomially; doubling the horizon multiplies required parameters by $2^{1/\alpha}$.

Because $\varepsilon$ falls only polynomially while the chain penalty is exponential, success collapses once $L\gg P^{\alpha}$. Long-horizon coding loops ($L\!\sim\!10^{3\!-\!4}$) cross this frontier even for 100B-parameter models.

Our brains avoid the collapse by chunking: macro-actions compress thousands of micro moves into tens of high-level decisions (write a function, run tests, skim a diff), so effective chain length is logarithmic in problem size ($L_{\text{eff}}\!\ll\!L$), and keeps $S$ reasonable.

Information-theoretic capacity bend

Dense networks reuse the same weights for every concept. Interference noise between gradient updates grows like $\sqrt{C/P}$ for $C$ stored concepts.

Holding accuracy fixed therefore demands:

\[ \text{P}\gtrsim\frac{C}{\varepsilon^{2}}, \]

So after a certain breadth the marginal parameters per additional concept rise super-linearly. Empirically we are still on the pre-bend “Chinchilla” slope, but real world benchmarks that have longer sequences (coding agents, multi-hop tool use, etc) are already beginning to hint at saturation.

Salience collapse: capacity-driven memorization versus reasoning

Large dense models can drive their training loss down by allocating surplus parameters to rote lookup rather than deeper abstraction. The extra capacity expands the surface area of memorised fragments faster than it expands the manifold of compositional reasoning.

Let

$P$ — parameter count
$d_g(P)\propto P^{\beta}$ — degrees of freedom dedicated to generalisable abstractions
$d_m(P)\propto P^{\gamma}$ — degrees of freedom tied up in thin, input-specific memorisation with $\gamma > \beta$

For a task requiring effective reasoning load $D$, success needs:

\[ d_g(P)\gtrsim D \]

Yet gradient noise from memorization grows with $d_m(P)$, so the signal-to-noise ratio degrades as $P^{\beta-\gamma}$. Beyond a crossover $P^{\star}$ where $\beta=\gamma$ further scale reduces reasoning fidelity even as perplexity improves.

Policy–gradient alignment (RLHF, DPO, “policy refinement”) attempts to re-inject salience by adding an auxiliary loss $L_{\text{policy}}$ that penalizes obviously spurious continuations. Each new domain or safety constraint requires a fresh reward model and multiple PPO passes, so alignment compute grows roughly

\[ \text{FLOPs}_{\text{align}}\;\approx\;k\,P\,T, \]

with $T$ human preference samples and $k\!>\!1$ PPO epochs. Alignment absorbs a larger share of the total training budget as $P$ rises, pushing the economic wall forward only marginally.

In summary, larger models over-memorise, then we must pay extra compute to prune the memorization down to human-relevant abstractions. Small or sparsely activated models can skip this step entirely.

Online / continual learning worsens this wall

Replay-bound economics

To avoid catastrophic forgetting when new data arrives, dense models must replay old samples or regularise all weights:

\[ \text{marginal FLOPs} \;\approx\; P\;(1+R), \]

where $R$ is the replay ratio. For $P\!\sim\!10^{11}$ even $R\!=\!0.1$ becomes financially prohibitive eventually. A 1B-parameter model can be kept current daily; a 1T-parameter one cannot.

Drift scaling

Let $S=|\Delta\theta|/|\theta|$ be relative weight drift per task.

With sparse or adapter-based updates,

\[ P_{\text{active}}\!\ll\!P \]

\[ S\propto \sqrt{C/P_{\text{active}}} \]

keeping interference controllable. Dense updates keep $P_{\text{active}}\!=\!P$ and hit the $C/\varepsilon^{2}$ wall quickly.

Test-time compute is a brittle workaround

“Chain-of-thought’’ (CoT) prompting, self-consistency sampling, and tree-of-thought search attempt to average out the exponential failure $S=(1-\varepsilon)^{L}$ by generating many reasoning traces per query and selecting the first or majority-voted success.

Let

$K$ — sampled chains at inference time
$C_{\text{tok}}$ — tokens per chain (≈ $L$)
$S$ — single-chain success ($S\!\ll\!1$)

The probability that at least one chain succeeds is

\[ P_{\text{any}} \;=\; 1-(1-S)^{K}\;\approx\;1-e^{-KS}. \]

To reach $P_{\text{any}}\ge 0.63$ one needs

\[ K\gtrsim 1/S \]

Given $S\approx e^{-\varepsilon L}$, this implies

\[ K \;\approx\; e^{\,\varepsilon L}. \]

Inference FLOPs therefore scale as

\[ \text{FLOPs} = KC_{\text{tok}} \propto Le^{\varepsilon L} \]

exponential in the very horizon the method tries to tame. At long $L$ the compute wall arrives almost as rapidly as for parameter scaling.

Moreover, CoT inherits the same salience gaps: the model still evaluates each token in isolation, so spurious low-level correlations leak into the trace. Voting or heuristic scoring only filters outcomes; it does not raise the underlying $S$.

In essence, test-time brute force buys reliability by paying an exponential tax that sooner or later dwarfs both training and deployment budgets, so it's another band-aid over the $e^{-\varepsilon L}$ issue rather than a fundamental solution.

Latent scratchpad suffers from the same issue

The class of “latent-scratchpad” systems, like OpenAI’s o-series, DeepSeek-R1, etc, does the speculative exploration during training, then distills the surviving path into a single forward pass for test-time. We get two extra cost terms from this approach.

Hidden-loop factor

During inference each output token is no longer a single transformer pass; it is the result of an unrolled micro-loop of depth $r$ embedded in the block (e.g. iterative attention updates, look-ahead planners, or a value-head scoring step).

\[ \text{FLOPs} \;=\; r\,C_{\text{tok}},\qquad r>1 . \]

If $r$ is fixed, runtime grows only linearly. In practice, models raise $r$ with prompt length to maintain context-wide coherence (DeepSeek’s planner re-invokes itself each tool call), so $r=r_0+\kappa L$ and the cost becomes $O(L^2)$.

Trace-distillation cost

Let $K_{\text{train}}$ be the number of sampled reasoning traces stored for imitation. Distillation loss reduces token error as

\[ \varepsilon_{\text{distilled}} \approx \varepsilon_{\text{base}}/\sqrt{K_{\text{train}}} \]

but the gradient noise from contradicting traces adds a variance term $\sigma^2\propto K_{\text{train}}$. Required optimisation steps scale like $\sigma^2/\varepsilon^2\sim K_{\text{train}}^2$. Hence:

\[ \text{FLOPs} \;\approx\; K_{\text{train}}^2\,M\,C_{\text{tok}} . \]

The approach trades a one-shot exponential inference wall for a quadratic training wall plus a super-linear runtime factor. For long-horizon tasks ($L\!\sim\!10^3$) even moderate $K_{\text{train}}$ (10–30) rivals the raw cost of running many CoT samples at test time.

Residual fragility

The distilled single path is still a chain of length $L$; only the selection occurred offline. Any unmodelled context shift resurrects the original error law $S=(1-\varepsilon)^{L}$. The hidden loop cannot retroactively explore alternatives, so test-time flexibility is gone.

Net effect

Latent-scratchpad models compress exploration into training but pay:

$O(K_{\text{train}}^2)$ extra pre-compute,
$O(rL))–(O(L^2)$ test-compute,
unchanged exponential sensitivity to horizon drift.

They remain a band-aid for the $e^{-\varepsilon L}$ failure mode.

Empirical fault line: long-range coding tasks

HumanEval, GSM8K, MMLU (the benchmarks that motivated current scaling laws) fit in ≤ 15 reasoning steps. They primarily measure local reasoning, and say nothing about the model's capacity for long range credit assignment.

Ask the same models to:

iteratively fix a 10,000+ line codebase,
pass a compiler-run-debug loop,
coordinate across multiple tools and environments

and error chains will quickly explode.

The perplexity gains observed in the flatter context disappear or become meaningless, implying that the unit of action, not the language loss, is the limiting factor.

How smaller/sparser models can cope better

The recipe that emerges from the above analysis is to shift optimization effort away from brute-forced token-level accuracy and toward explicit structuring of computation. We can achieve this through two orthogonal efforts:

Horizon compression – collapse long micro-token chains into a handful of macro actions that can be supervised or rolled back individually.
Selective activation – ensure that only a small, context-relevant slice of the parameter matrix participates in each forward/backward pass so new knowledge only perturbs a bounded region.

These techniques convert the exponential failure surface $e^{-\varepsilon L}$ into a tractable polynomial and bound the $C/\varepsilon^{2}$ interference bleed. They do so through parameter reuse rather than parameter accumulation, keeping training and inference budgets within a single-GPU or single-node envelope.

Reliability ↔ Continual-Learning Trade-offs of Modular Techniques

Design move	Chain reliability	Continual-learning cost
Hierarchical planner + executor	Cuts $L_{\text{eff}}$ by 10–100×	Only planner/executor slice retrained
Verifier-in-the-loop	Detects & rolls back errors every macro step	No weight update needed
Sparse MoE routing	Active params per token ≤ 10-50B	Replay on experts only
Retrieval memory	Offloads rare concepts, reduces interference	New info logged, not re-trained
LoRA / adapters	Drift confined to hand-sized matrices	Update cost ∝ adapter size

Each technique lowers either $L$ or interference, breaking the exponential failure mode without enlarging $P$.

Conclusion

The mainstream scaling narrative extrapolates single-step accuracy and misses the exponential fragility of long chains. Past a short term horizon, dense parameter growth has severely diminishing returns – essentially it hits a wall: every new concept conflicts with all previous ones, while each additional reasoning step multiplies failure odds.

Hierarchical control, external verification, sparse activation and retrieval shift the curve: they compress the effective chain length and cut interference without bloating the active model. The systems that leverage small, modular, tool-using principles will likely achieve human-level macro accuracy first, even as gigantic models keep hitting the $e^{-\varepsilon L}$ limit.

— Matthew Di Ferrante

Design move	Chain reliability	Continual-learning cost
Hierarchical planner + executor	Cuts \(L_{\text{eff}}\) by 10–100×	Only planner/executor slice retrained
Verifier-in-the-loop	Detects & rolls back errors every macro step	No weight update needed
Sparse MoE routing	Active params per token ≤ 10-50B	Replay on experts only
Retrieval memory	Offloads rare concepts, reduces interference	New info logged, not re-trained
LoRA / adapters	Drift confined to hand-sized matrices	Update cost ∝ adapter size