Self Play for Safety - Online Multi-Agent Adversarial Training for Provably Robust LLMs
Автор: Natasha Jaques
Загружено: 2025-06-12
Просмотров: 3340
Описание:
A talk I gave on May 9th about our recent paper, Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models https://arxiv.org/abs/2506.07468.
Currently, reinforcement learning from human feedback (RLHF) is the predominant method for ensuring LLMs are safe and aligned. And yet it provides no guarantees that they won’t say something harmful, copyrighted, or inappropriate.
We create an online, zero-sum adversarial training game where Attacker & Defender co-evolve to repeatedly discover and then patch new exploits. Theoretically, we show that if the game converges it will create an LLM that responds safely no matter the input. Empirically, our method improves safety by up to 72% vs. RLHF, without sacrificing capabilities.
Talk is an edited version of the full talk posted on the UW Industry Academia Partnership workshop website: https://www.industry-academia.org/uw-...
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: