The Scaling Laws are already broken: smaller models win out on reasoning long term
The SoTA LLMs that score highest on standard benchmarks all have >100B parameter counts, but these consist of mainly “flat” tasks: single-prompt problems with short, self-contained answers. The current scaling curves that plot test loss vs. parameter count show smooth power-law gains and suggest that more weights yield monotonic progress. However, these curves are misleading: they measure token-level accuracy, not whole-task reliability across longer, chained sequences of actions.
Once models need to maintain that correctness through hundreds or thousands of dependent steps (writing, compiling, running, reading, revising, etc), they break down. Below is my argument for why parameter growth or increased test-time compute alone cannot overcome that shift, and why smaller, modular, hierarchy-aware systems will ultimately likely dominate.