Your 70-Billion-Parameter Model Might Be 40% Wasted
Автор: LLMs Research
Загружено: 2026-02-11
Просмотров: 5
Описание:
Your 70-Billion-Parameter Model Might Be 40% Wasted
Three papers from February 1–6, 2026 converge on a question the field has been avoiding since 2016: what if most transformer layers aren't doing compositional reasoning at all, but just averaging noise?
This video traces a decade of evidence, from Veit et al.'s original ensemble observation in ResNets through ShortGPT's layer pruning results and October 2025's formal proof, to three new papers that quantify the consequences. Inverse depth scaling shows loss improves as D to the negative 0.30, worse than one-over-n. TinyLoRA unlocks 91% GSM8K accuracy by training just 13 parameters with RL. And the attention sink turns out to be a native Mixture-of-Experts router hiding in plain sight.
The picture that emerges: modern LLMs are simultaneously too deep (layers averaging rather than composing) and too wide (attention heads collapsing into dormancy). Architecturally large, functionally much smaller.
This is a video adaptation of our LLMs Research newsletter issue covering the same papers.
Papers referenced (in order of appearance):
Residual Networks Behave Like Ensembles of Relatively Shallow Networks (Veit, Wilber, Belongie, 2016) https://arxiv.org/abs/1605.06431
Deep Networks with Stochastic Depth (Huang et al., 2016) https://arxiv.org/abs/1603.09382
ALBERT: A Lite BERT for Self-supervised Learning (Lan et al., 2020) https://arxiv.org/abs/1909.11942
ShortGPT: Layers in Large Language Models are More Redundant Than You Expect (Men et al., 2024) https://arxiv.org/abs/2403.03853
Your Transformer is Secretly Linear (Razzhigaev et al., 2024) https://arxiv.org/abs/2405.12250
On Residual Network Depth (Dherin, Munn, 2025) https://arxiv.org/abs/2510.03470
Inverse Depth Scaling From Most Layers Being Similar (Liu, Kangaslahti, Liu, Gore, 2026) https://arxiv.org/abs/2602.05970
Learning to Reason in 13 Parameters / TinyLoRA (Morris, Mireshghallah, Ibrahim, Mahloujifar, 2026) https://arxiv.org/abs/2602.04118
Attention Sink Forges Native MoE in Attention Layers (Fu, Zeng, Wang, Li, 2026) https://arxiv.org/abs/2602.01203
Timestamps:
0:00 Why this should bother you 0:41 Veit 2016: ResNets as ensembles 2:14 Stochastic depth, ALBERT, and the quiet accumulation 3:08 ShortGPT, secretly linear transformers, and the formal proof 4:22 February 2026: this week's answer 4:38 Inverse depth scaling: D to the negative 0.30 5:57 Where does capability actually live? 6:23 TinyLoRA: 13 parameters, 91% accuracy 8:35 Width: attention sinks as native MoE 10:58 What this means for architecture, fine-tuning, and inference 11:49 The decade-long arc
Newsletter: https://llmsresearch.substack.com GitHub: https://github.com/llmsresearch
This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit llmsresearch.substack.com (https://llmsresearch.substack.com?utm...)
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: