Sharing is Caring: Efficient LM Post-Trainingwith Collective RL Experience Sharing

Автор: Mayuresh Shilotri

Загружено: 2026-01-12

Просмотров: 9

Описание: Paper: https://arxiv.org/abs/2509.08721v1

Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing

Jeffrey Amico, Gabriel Passamani Andrade, John Donaghy, Ben Fielding, Tristin Forbus, Harry Grieve, Semih Kara, Jari Kolehmainen, Yihua Lou, Christopher Nies, Edward Phillip Flores Nuño, Diogo Ortega, Shikhar Rastogi, Austin Virts, Matthew J. Wright

Post-training language models (LMs) with reinforcement learning (RL) can enhance their complex reasoning capabilities without supervised fine-tuning, as demonstrated by DeepSeek-R1-Zero. However, effectively utilizing RL for LMs requires significant parallelization to scale-up inference, which introduces non-trivial technical challenges (e.g. latency, memory, and reliability) alongside ever-growing financial costs. We present Swarm sAmpling Policy Optimization (SAPO), a fully decentralized and asynchronous RL post-training algorithm. SAPO is designed for decentralized networks of heterogenous compute nodes, where each node manages its own policy model(s) while "sharing" rollouts with others in the network; no explicit assumptions about latency, model homogeneity, or hardware are required and nodes can operate in silo if desired. As a result, the algorithm avoids common bottlenecks in scaling RL post-training while also allowing (and even encouraging) new possibilities. By sampling rollouts "shared" across the network, it enables "Aha moments" to propagate, thereby bootstrapping the learning process. In this paper we show SAPO achieved cumulative reward gains of up to 94% in controlled experiments. We also share insights from tests on a network with thousands of nodes contributed by Gensyn community members running the algorithm on diverse hardware and models during an open-source demo.

Welcome to the Mayuresh Shilotri's Youtube . Maintained by Mayuresh Shilotri

You can follow me at
Blog - https://shilotri.com/
LinkedIn - / mayureshshilotri
Twitter - / mshilotri

Note: I only claim to have read the research paper and created a Video using AI tool. I am not the author. All intellectual heavy lifting was performed by the respective authors. 🙏

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...

Sharing is Caring: Efficient LM Post-Trainingwith Collective RL Experience Sharing

Доступные форматы для скачивания:

Скачать видео

Информация по загрузке:

Скачать аудио

Похожие видео

Modeling Others' Minds as Code

Modeling Others' Minds as Code

50 бессмертных произведений оркестра о вечной любви | Шопен, Бетховен, Лист

50 бессмертных произведений оркестра о вечной любви | Шопен, Бетховен, Лист

Блэкаут в Украине и Молдове, В Иране началось? Кадыров простил Керимова. Галлямов, Крутихин, Левиев

Блэкаут в Украине и Молдове, В Иране началось? Кадыров простил Керимова. Галлямов, Крутихин, Левиев

Трамп отдал приказ / Новая операция США

Трамп отдал приказ / Новая операция США

9 AI-навыков, которые должен освоить каждый в 2026 году

9 AI-навыков, которые должен освоить каждый в 2026 году

Time-To-Inconsistency A Survival Analysis of Large Language Model Robustness to Adversarial Attacks

Time-To-Inconsistency A Survival Analysis of Large Language Model Robustness to Adversarial Attacks

Создаю AI-бизнес на инструментах Google: 6 сервисов, которые работают как фабрика!

Создаю AI-бизнес на инструментах Google: 6 сервисов, которые работают как фабрика!

Нейронка, которая УНИЧТОЖИЛА ChatGPT 5! / Обзор бесплатной нейросети и ее возможности

Нейронка, которая УНИЧТОЖИЛА ChatGPT 5! / Обзор бесплатной нейросети и ее возможности

Chill Mood Music 🎧 – Spanish & French Relaxing Playlist

Chill Mood Music 🎧 – Spanish & French Relaxing Playlist

4 Hours Chopin for Studying, Concentration & Relaxation

4 Hours Chopin for Studying, Concentration & Relaxation

Rethinking KL Regularization in RLHF From Value Estimation to Gradient Optimization

Rethinking KL Regularization in RLHF From Value Estimation to Gradient Optimization

ZeroShotOpt Towards Zero-Shot Pretrained Models for Efficient Black-Box Optimization

ZeroShotOpt Towards Zero-Shot Pretrained Models for Efficient Black-Box Optimization

The New American Factory: Inside the AI Data Center Boom

The New American Factory: Inside the AI Data Center Boom

AgenticRAG Tool-Augmented Foundation Models for Zero-Shot Explainable Recommender Systems

AgenticRAG Tool-Augmented Foundation Models for Zero-Shot Explainable Recommender Systems

Музыка для работы за компьютером | Фоновая музыка для концентрации и продуктивности

Музыка для работы за компьютером | Фоновая музыка для концентрации и продуктивности

Playlist,,Deep House,Music Played in Louis Vuitton Stores

Playlist,,Deep House,Music Played in Louis Vuitton Stores

Focus Music for Work – Deep Concentration Chill Beats for Study and Coding

Focus Music for Work – Deep Concentration Chill Beats for Study and Coding

Самая сложная модель из тех, что мы реально понимаем

Самая сложная модель из тех, что мы реально понимаем

Германия только что создала машину, которая могла бы бесконечно обеспечивать энергией всю планету.

Германия только что создала машину, которая могла бы бесконечно обеспечивать энергией всю планету.

Reward Model Routing in Alignment

Reward Model Routing in Alignment