Introducing RewardBench: The First Benchmark for Reward Models (of the LLM Variety)
Автор: Nathan Lambert
Загружено: 2024-03-20
Просмотров: 1340
Описание:
Get to know my latest major project -- we're building the science of LLM alignment one step at a time.
Sorry about the glitchy noise! I didn't think it was so bad that I needed to kill it.
00:00 Brief Intro
02:34 Why Reward Models
05:35 RewardBench Paper
07:01 Dataset & Code Intro
14:20 Leaderboard Results
Abstract
Reward models (RMs) are at the crux of successful RLHF to align pretrained models to human preferences, yet there has been relatively little study that focuses on evaluation of those reward models. Evaluating reward models presents an opportunity to understand the opaque technologies used for alignment of language models and which values are embedded in them. To date, very few descriptors of capabilities, training methods, or open-source reward models exist. In this paper, we present REWARDBENCH, a benchmark dataset and code-base for evaluation, to enhance scientific understanding of reward models. The REWARDBENCH dataset is a collection of prompt-win-lose trios spanning chat, reasoning, and safety, to benchmark how reward models perform on challenging, structured and out-of-distribution queries. We created specific comparison datasets for RMs that have subtle, but verifiable reasons (e.g. bugs, incorrect facts) why one answer should be preferred to another. On the REWARDBENCH leaderboard, we evaluate reward models trained with a variety of methods, such as the direct MLE training of classifiers and the implicit reward modeling of Direct Preference Optimization
(DPO), and on a spectrum of datasets. We present many findings on propensity for refusals, reasoning limitations, and instruction following shortcomings of various reward models towards a better understanding of the RLHF process.
Links!
RewardBench paper (arxiv soon): https://github.com/allenai/reward-ben...
ReardBench Code: https://github.com/allenai/reward-bench
RewardBench Leaderboard: https://huggingface.co/spaces/allenai...
Interconnects post on Costs vs. Rewards vs. Preferences: https://www.interconnects.ai/p/costs-...
Interconnects post on why we need reward models: https://www.interconnects.ai/p/open-r...
Interconnects post on why we need reward models (p2): https://www.interconnects.ai/p/why-re...
Paper on history and risks of RLHF: https://arxiv.org/abs/2310.13595
Talk on history of RLHF: • 15min History of Reinforcement Learning an...
RewardBench dataset: https://huggingface.co/datasets/allen...
Other preference data test sets: https://huggingface.co/datasets/allen...
Reward bench results repo: https://huggingface.co/datasets/allen...
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: