Fixing Reasoning from Three Directions at Once
Автор: LLMs Research
Загружено: 2026-02-07
Просмотров: 10
Описание:
Fixing Reasoning from Three Directions at OnceLLMs Research Podcast | Episode: Feb 1–6, 2026
DeepSeek-R1 made RL the default approach for reasoning. This episode covers the 30 papers from the first week of February that are debugging what that approach got wrong, across training geometry, pipeline design, reward signals, memory architecture, and inference efficiency.
Timestamps
[00:00] Opening and framing — The post-DeepSeek hangover: why the field has shifted from scaling RL to fixing it. Three camps emerge: geometricians, pipeline engineers, and architects.
[01:11] MRPO and the bias manifold — Standard RL may not create new reasoning at all, just surface existing capabilities within a constrained subspace. Spectral Orthogonal Exploration forces the model into orthogonal directions. A 4B model beating a 32B on AIME'24.
[04:15] ReMiT and the training flywheel — Breaking the linear pretraining-to-RL pipeline. The RL-tuned model feeds signal backward into pretraining. The finding that logical connectors ("therefore," "because") carry the reasoning weight.
[06:03] DISPO and training stability - The four-regime framework for gradient control. When to unclip updates, when to clamp hard. Surgical control over the confidence-correctness matrix to prevent catastrophic collapse.
[07:20] TrajFusion and learning from mistakes - Interleaving wrong reasoning paths with reflection prompts and correct paths. Turning discarded data into structured supervision.
[08:05] Lightning round: Grad2Reward and CPMöbius - Dense rewards from a single backward pass through a judge model. Self-play for math training without external data.
[08:46] InfMem and active memory - The shift from passive context windows to active evidence management. PreThink-Retrieve-Write protocol: the model pauses, checks if it has enough information, retrieves if not, and stops early when it does. 4x inference speedup.
[09:56] ROSA-Tuning - CPU-based suffix automaton for context retrieval, freeing GPU for reasoning. 1980s data structures solving 2026 problems.
[10:36] OVQ Attention - Online vector quantization for linear-time attention. Removes the quadratic memory ceiling for long context.
[11:05] Closing debate: which paper survives six months? — One host bets on MRPO (the geometric view challenges scale-is-all-you-need). The other bets on ReMiT (the flywheel efficiency is too obvious to ignore once you see it).
Papers Discussed:
MRPO: Manifold Reshaping Policy Optimization (https://arxiv.org/abs/2602.02545) ReMiT: RL-Guided Mid-Training (https://arxiv.org/abs/2602.03075) DISPO (https://arxiv.org/abs/2602.00983) TrajFusion (https://arxiv.org/abs/2602.04391) Grad2Reward (https://arxiv.org/abs/2602.01791) CPMöbius (https://arxiv.org/abs/2602.02979) InfMem (https://arxiv.org/abs/2602.02704) ROSA-Tuning (https://arxiv.org/abs/2602.02499) OVQ Attention (https://arxiv.org/abs/2602.03922)
Key Takeaways
Standard RL might be trapping models in a low-rank subspace rather than expanding their reasoning capacity. The bias manifold concept from MRPO reframes the entire alignment-vs-capability debate as a geometric problem.
The strict separation between pretraining and post-training is looking increasingly artificial. ReMiT's finding that logical connectors carry disproportionate reasoning weight suggests the base model's curriculum should be informed by what the tuned model struggles with.
Passive context windows fail at multi-hop reasoning because they treat all tokens equally. Active memory management (InfMem) and hardware-level retrieval offloading (ROSA-Tuning) are converging on the same insight: models need to manage their own cognitive load.
Most RL training fixes this week assume GRPO-style optimization. If that changes, these contributions become fragile. The architectural work on memory and attention solves structural problems that persist regardless of the training recipe.
This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit llmsresearch.substack.com (https://llmsresearch.substack.com?utm...)
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: