I read a paper with a pretty unexpected idea: language models may have been training with massive efficiency losses all these years.
The researchers argue that modern neural networks can be trained several times faster. In a controlled experiment the gap hit 16x — with no new data, no new hardware. The problem is baked into the architecture itself.
How a model thinks vs. how it speaks
Internally the model thinks in "meanings." It compresses any concept into a very compact code — call it a set of a few thousand numbers. That's its native language, its working memory.
But it communicates with us in words. For that it has a vocabulary — tens of thousands of tokens (words and fragments), varying by model.
When the network writes you a response, the process goes inside-out. It takes its compact thought and expands it to pick one suitable word from the vocabulary. Works great.
Where the bottleneck appears
During training the process runs in reverse. Say the model outputs the wrong word. The algorithm generates a detailed report: it scores how appropriate every single word in the vocabulary would have been. You get a massive, ultra-detailed map of what the correct answer should have looked like.
For the model to actually learn, this enormous error map has to be compressed and shoved back into its internal format — those same few thousand numbers.
It's like trying to transmit a detailed 4K image over a dial-up modem by squishing it down to thumbnail size. Something gets through, sure, but most of the nuance gets destroyed.
The researchers measured it: 95 to 99% of the useful signal is lost at this bottleneck. The model still learns, just 20x slower than it could. That's exactly why companies have been throwing compute at the problem.
What this means
If this holds up, the industry has spent years buying GPUs by the tens of thousands when the actual bottleneck was the weight update mechanism itself.
There's no solution in the paper yet. But the authors seem to be the first to map the physics of this so precisely: networks underfit not because of data scarcity, but because the architecture literally doesn't let new knowledge fully through.