ycliper

Популярное

Музыка Кино и Анимация Автомобили Животные Спорт Путешествия Игры Юмор

Интересные видео

2025 Сериалы Трейлеры Новости Как сделать Видеоуроки Diy своими руками

Топ запросов

смотреть а4 schoolboy runaway турецкий сериал смотреть мультфильмы эдисон
Скачать

Fixing Reasoning from Three Directions at Once

Автор: LLMs Research

Загружено: 2026-02-07

Просмотров: 10

Описание: Fixing Reasoning from Three Directions at OnceLLMs Research Podcast | Episode: Feb 1–6, 2026

DeepSeek-R1 made RL the default approach for reasoning. This episode covers the 30 papers from the first week of February that are debugging what that approach got wrong, across training geometry, pipeline design, reward signals, memory architecture, and inference efficiency.

Timestamps
[00:00] Opening and framing — The post-DeepSeek hangover: why the field has shifted from scaling RL to fixing it. Three camps emerge: geometricians, pipeline engineers, and architects.

[01:11] MRPO and the bias manifold — Standard RL may not create new reasoning at all, just surface existing capabilities within a constrained subspace. Spectral Orthogonal Exploration forces the model into orthogonal directions. A 4B model beating a 32B on AIME'24.

[04:15] ReMiT and the training flywheel — Breaking the linear pretraining-to-RL pipeline. The RL-tuned model feeds signal backward into pretraining. The finding that logical connectors ("therefore," "because") carry the reasoning weight.

[06:03] DISPO and training stability - The four-regime framework for gradient control. When to unclip updates, when to clamp hard. Surgical control over the confidence-correctness matrix to prevent catastrophic collapse.

[07:20] TrajFusion and learning from mistakes - Interleaving wrong reasoning paths with reflection prompts and correct paths. Turning discarded data into structured supervision.

[08:05] Lightning round: Grad2Reward and CPMöbius - Dense rewards from a single backward pass through a judge model. Self-play for math training without external data.

[08:46] InfMem and active memory - The shift from passive context windows to active evidence management. PreThink-Retrieve-Write protocol: the model pauses, checks if it has enough information, retrieves if not, and stops early when it does. 4x inference speedup.

[09:56] ROSA-Tuning - CPU-based suffix automaton for context retrieval, freeing GPU for reasoning. 1980s data structures solving 2026 problems.

[10:36] OVQ Attention - Online vector quantization for linear-time attention. Removes the quadratic memory ceiling for long context.

[11:05] Closing debate: which paper survives six months? — One host bets on MRPO (the geometric view challenges scale-is-all-you-need). The other bets on ReMiT (the flywheel efficiency is too obvious to ignore once you see it).

Papers Discussed:
MRPO: Manifold Reshaping Policy Optimization (https://arxiv.org/abs/2602.02545) ReMiT: RL-Guided Mid-Training (https://arxiv.org/abs/2602.03075) DISPO (https://arxiv.org/abs/2602.00983) TrajFusion (https://arxiv.org/abs/2602.04391) Grad2Reward (https://arxiv.org/abs/2602.01791) CPMöbius (https://arxiv.org/abs/2602.02979) InfMem (https://arxiv.org/abs/2602.02704) ROSA-Tuning (https://arxiv.org/abs/2602.02499) OVQ Attention (https://arxiv.org/abs/2602.03922)

Key Takeaways

Standard RL might be trapping models in a low-rank subspace rather than expanding their reasoning capacity. The bias manifold concept from MRPO reframes the entire alignment-vs-capability debate as a geometric problem.

The strict separation between pretraining and post-training is looking increasingly artificial. ReMiT's finding that logical connectors carry disproportionate reasoning weight suggests the base model's curriculum should be informed by what the tuned model struggles with.

Passive context windows fail at multi-hop reasoning because they treat all tokens equally. Active memory management (InfMem) and hardware-level retrieval offloading (ROSA-Tuning) are converging on the same insight: models need to manage their own cognitive load.

Most RL training fixes this week assume GRPO-style optimization. If that changes, these contributions become fragile. The architectural work on memory and attention solves structural problems that persist regardless of the training recipe.

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit llmsresearch.substack.com (https://llmsresearch.substack.com?utm...)

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...
Fixing Reasoning from Three Directions at Once

Поделиться в:

Доступные форматы для скачивания:

Скачать видео

  • Информация по загрузке:

Скачать аудио

Похожие видео

From Transformers to Autonomous Agents: A Timeline of the Research That Got Us Here

From Transformers to Autonomous Agents: A Timeline of the Research That Got Us Here

ChatGPT продает ваши чаты, Anthropic создает цифровых существ, а Маск как всегда…

ChatGPT продает ваши чаты, Anthropic создает цифровых существ, а Маск как всегда…

This AI Ended Software Engineering (again)?

This AI Ended Software Engineering (again)?

15 ПРЕСТУПНО НЕДООЦЕНЕННЫХ ФАНТАСТИЧЕСКИХ ФИЛЬМОВ,  которые НУЖНО УВИДЕТЬ! 2026

15 ПРЕСТУПНО НЕДООЦЕНЕННЫХ ФАНТАСТИЧЕСКИХ ФИЛЬМОВ, которые НУЖНО УВИДЕТЬ! 2026

ГИПОТЕЗА КАКЕЯ: От детской загадки до преобразования Фурье | LAPLAS

ГИПОТЕЗА КАКЕЯ: От детской загадки до преобразования Фурье | LAPLAS

What ICLR 2026 Taught Us About Multi-Agent Failures

What ICLR 2026 Taught Us About Multi-Agent Failures

LLM fine-tuning или ОБУЧЕНИЕ малой модели? Мы проверили!

LLM fine-tuning или ОБУЧЕНИЕ малой модели? Мы проверили!

Good или well? Ошибка, которая сразу выдаёт уровень английского / Английский ПРОСТО

Good или well? Ошибка, которая сразу выдаёт уровень английского / Английский ПРОСТО

The Big Bang Wasn’t the Beginning… Scientists Just Proved It | Neil deGrasse Tyson

The Big Bang Wasn’t the Beginning… Scientists Just Proved It | Neil deGrasse Tyson

Frontier Models & AI | Sam Altman, CEO & Co-Founder, OpenAI

Frontier Models & AI | Sam Altman, CEO & Co-Founder, OpenAI

The Evolution of Long-Context LLMs: From 512 to 10M Tokens

The Evolution of Long-Context LLMs: From 512 to 10M Tokens

Jan 17–23, 2026: The Rise of the Action Layer

Jan 17–23, 2026: The Rise of the Action Layer

Google's Quantum Chip Ran for 5 Minutes and Found Something It Wasn't Supposed To

Google's Quantum Chip Ran for 5 Minutes and Found Something It Wasn't Supposed To

Почему Ядерная война уже началась (А вы не заметили)

Почему Ядерная война уже началась (А вы не заметили)

Почему работает теория шести рукопожатий? [Veritasium]

Почему работает теория шести рукопожатий? [Veritasium]

RθJA — главная ловушка: как правильно считать температуру кристалла

RθJA — главная ловушка: как правильно считать температуру кристалла

Единственный в мире танк отлитый целиком КАК СТАТУЯ. Австралийский

Единственный в мире танк отлитый целиком КАК СТАТУЯ. Австралийский "Страж"

Робототехническая революция стала реальностью: почему Boston Dynamics и Figure вот-вот изменят всё.

Робототехническая революция стала реальностью: почему Boston Dynamics и Figure вот-вот изменят всё.

Грозев шокировал заявлением: что на самом деле происходит внутри Кремля из-за войны

Грозев шокировал заявлением: что на самом деле происходит внутри Кремля из-за войны

ChatGPT in a kids robot does exactly what experts warned.

ChatGPT in a kids robot does exactly what experts warned.

© 2025 ycliper. Все права защищены.



  • Контакты
  • О нас
  • Политика конфиденциальности



Контакты для правообладателей: [email protected]