ycliper

Популярное

Музыка Кино и Анимация Автомобили Животные Спорт Путешествия Игры Юмор

Интересные видео

2025 Сериалы Трейлеры Новости Как сделать Видеоуроки Diy своими руками

Топ запросов

смотреть а4 schoolboy runaway турецкий сериал смотреть мультфильмы эдисон
Скачать

An Unexpected Reinforcement Learning Renaissance

Автор: Interconnects AI

Загружено: 2025-02-13

Просмотров: 14757

Описание: The era we are living through in language modeling research is one pervasive with complete faith that reasoning and new reinforcement learning (RL) training methods will work. This is well founded. A day cannot go by without | a new reasoning model, RL training result, or dataset distilled from DeepSeek R1.

More information: https://www.interconnects.ai/p/an-une...
Slides: https://docs.google.com/presentation/...

00:00 The ingredients of an RL paradigm shift
16:04 RL with verifiable rewards
27:38 What DeepSeek R1 taught us
29:30 RL as the focus of language modeling

The difference, compared to the last time RL was in the forefront of the AI world with the fact that reinforcement learning from human feedback (RLHF) was needed to create ChatGPT, is that we have way better infrastructure than our first time through this. People are already successfully using TRL, OpenRLHF, veRL, and of course Open Instruct (our tools for Tülu 3/OLMo) to train models like this.

When models such as Alpaca, Vicuña, Dolly, etc. were coming out they were all built on basic instruction tuning. Even though RLHF was the motivation of these experiments, tooling and lack of datasets made complete and substantive replications rare. On top of that, every organization was trying to recalibrate its AI strategy for the second time in 6 months. The reaction and excitement of Stable Diffusion was all but overwritten by ChatGPT.

This time is different. With reasoning models, everyone already has raised money for their AI companies, open-source tooling for RLHF exists and is stable, and everyone already is feeling the AGI.

The goal of this talk is to try and make sense of the story that is unfolding today:

Given it is becoming obvious that RL with verifiable rewards works on old models — why did the AI community sleep on the potential of these reasoning models?

How to contextualize the development of RLHF techniques with the new types of RL training?

What is the future of post-training? How far can we scale RL?

How does today’s RL compare to historical successes of Deep RL?

And other topics. This is longer recording of a talk I gave this week at a local Seattle research meetup.

Some of the key points I arrived on:

RLHF was necessary, but not sufficient for ChatGPT. RL training like for reasoning could become the primary driving force of future LM developments. There’s a path for “post-training” to just be called “training” in the future.

While this will feel like the Alpaca moment from 2 years ago, it will produce much deeper results and impact.

Self-play, inference-time compute, and other popular terms related to this movement are more “side quests” than core to the developments.

There is just so much low-hanging fruit for improving models with RL. It’s wonderfully exciting.

For the rest, you’ll have to watch the talk.

Get Interconnects (https://www.interconnects.ai/)...
... on YouTube:    / @interconnects  
... on Twitter: https://x.com/interconnectsai
... on Linkedin:   / interconnects-ai  
... on Spotify: https://open.spotify.com/show/2UE6s7w...
… on Apple Podcasts: https://podcasts.apple.com/us/podcast...

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...
An Unexpected Reinforcement Learning Renaissance

Поделиться в:

Доступные форматы для скачивания:

Скачать видео

  • Информация по загрузке:

Скачать аудио

Похожие видео

How We Built a Leading Reasoning Model (Olmo 3)

How We Built a Leading Reasoning Model (Olmo 3)

Richard Sutton – Father of RL thinks LLMs are a dead end

Richard Sutton – Father of RL thinks LLMs are a dead end

How language model post-training is done today

How language model post-training is done today

Experimenting with Reinforcement Learning with Verifiable Rewards (RLVR)

Experimenting with Reinforcement Learning with Verifiable Rewards (RLVR)

How does DeepSeek learn? GRPO explained with Triangle Creatures

How does DeepSeek learn? GRPO explained with Triangle Creatures

Обучение с подкреплением для агентов — Уилл Браун, исследователь машинного обучения в Morgan Stanley

Обучение с подкреплением для агентов — Уилл Браун, исследователь машинного обучения в Morgan Stanley

Rich Sutton, The OaK Architecture: A Vision of SuperIntelligence from Experience - RLC 2025

Rich Sutton, The OaK Architecture: A Vision of SuperIntelligence from Experience - RLC 2025

AI can't cross this line and we don't know why.

AI can't cross this line and we don't know why.

The art of training a good (reasoning) language model

The art of training a good (reasoning) language model

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

LLM и GPT - как работают большие языковые модели? Визуальное введение в трансформеры

LLM и GPT - как работают большие языковые модели? Визуальное введение в трансформеры

Physics Simulation Just Crossed A Line

Physics Simulation Just Crossed A Line

Early stages of the reinforcement learning era of language models

Early stages of the reinforcement learning era of language models

Why NVIDIA builds their own open models | Nemotron w/ Bryan Catanzaro

Why NVIDIA builds their own open models | Nemotron w/ Bryan Catanzaro

Is human data enough? | David Silver

Is human data enough? | David Silver

Andrej Karpathy: Software Is Changing (Again)

Andrej Karpathy: Software Is Changing (Again)

Как подходить к постобучению в приложениях искусственного интеллекта

Как подходить к постобучению в приложениях искусственного интеллекта

[GRPO Explained] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

[GRPO Explained] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Обучение с подкреплением, по книге

Обучение с подкреплением, по книге

Как ИИ научился думать

Как ИИ научился думать

© 2025 ycliper. Все права защищены.



  • Контакты
  • О нас
  • Политика конфиденциальности



Контакты для правообладателей: [email protected]