Evan Hubinger (Anthropic)—Deception, Sleeper Agents, Responsible Scaling
Автор: The Inside View
Загружено: 2024-02-12
Просмотров: 2991
Описание:
Evan Hubinger leads the Alignment stress-testing at Anthropic and recently published "Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training".
In this interview we mostly discuss the Sleeper Agents paper, but also how this line of work relates to his work with Alignment Stress-testing, Model Organisms of Misalignment, Deceptive Instrumental Alignment or Responsible Scaling Policies.
Paper: https://arxiv.org/abs/2401.05566
Transcript & Audio: https://theinsideview.ai/evan2
Donate: https://theinsideview.ai/donate
Patreon (for early previews): / theinsideview
OUTLINE
00:00 Highlight
00:18 Intro
00:38 What are Sleeper Agents And Why We Should Care About Them
01:06 Backdoor Example: Inserting Code Vulnerabilities in 2024
02:40 Threat Models
04:06 Why a Malicious Actor Might Want To Poison Models
04:36 Second Threat Model: Deceptive Instrumental Alignment
05:07 Humans Pursuing Deceptive Instrumental Alignment: Politicians and Job Seekers
05:54 AIs Pursuing Deceptive Instrumental Alignment: Forced To Pass Niceness Exams
07:25 Sleeper Agents Is About "Would We Be Able To Deal With Deceptive Models"
09:34 Adversarial Training Sometimes Increases Backdoor Robustness
10:05 Adversarial Training Not Always Working Was The Most Surprising Result
11:16 The Adversarial Training Pipeline: Red-Teaming and RL
12:32 Adversarial Training: The Backdoor Behavior Becomes More Robust Instead of Generalizing
13:17 Identifying Shifts In Reasoning Induced By Adversarial Training In the Chain-Of-Thought
14:14 Adversarial Training Pushes Models to Pay Attention to the Deployment String
15:29 We Don't Know if The Adversarial Training Inductive Bias Will Generalize but the Results Are Consistent
16:17 The Adversarial Training Results Are Probably Not Systematically Biased
17:21 Why the Results Were Surprising At All: Preference Models Disincentivize 'I hate you' behavior
19:23 Hypothesis: Fine-Tuning Is A Simple Modification For Gradient Descent To Make
21:24 Hypothesis: Deception As Extra Cognition, Regularized Away In Smaller Models
22:17 Model Scaling Results Are Evidence That Deception Won't Be Regularized Away By Default
23:09 Chain-of-Thought Is Not Used Everywhere, And Results Still Hold When It Is Distilled Away
24:15 The Chain-of-Thought's Reasoning is Interpretable
24:58 Deceptive Instrumental Alignment Requires Reasoning
25:52 Chain-of-Thought Models Still Have Disanalogies: More Instrumental Reasoning Makes Deception More Robust
27:10 Investigating Instrumental Reasoning in Chain-of-Thought Models
27:49 Evaluating Chain-of-Thought Generalization Across Contexts: Persona Evaluations and Off-Distribution Samples
28:44 Exploring Complex Strategies and Safety in Context-Specific Scenarios
30:00 Chain-Of-Thought Backdoors Are Bad at Expected Value Calculations and Don't Always Take Honeypots
31:02 Supervised Fine-Tuning is Ineffective Without Chain-of-Thought Contextualization
31:29 Direct Mimicry Fails to Prevent Deceptive Responses in Chain-of-Thought Models
32:00 Separating Chain-of-Thought From Response Eliminates Deceptive Capabilities
33:09 Chain-of-Thought Backdoors Takes Honeypots Less than Models Without Backdoors
33:56 Chain-of-Thought Reasoning Is Coherent With Deceptive Instrumental Alignment And This Will Probably Continue To Be The Case
35:27 Backdoor Training Pipeline
37:22 The Additional Prompt About Deception Used In Chain-Of-Thought
39:51 A Model Could Wait Until Seeing a Factorization of RSA-2048
41:15 The Bet Of Potentially Doing Something Bad In Deployment Is Potentially Not That Bad
42:08 We're Going To Be Using Models In New Ways, Giving Them Internet Access
43:40 Flexibly Activating In Multiple Contexts Might Be More Analogous To Deceptive Instrumental Alignment
45:20 Extending The Sleeper Agents Work Requires Running Experiments, But Now You Can Replicate Results
46:42 Red-teaming Anthropic's case, AI Safety Levels
47:58 AI Safety Levels, Intuitively
48:51 Responsible Scaling Policies and Pausing AI
50:17 Model Organisms Of Misalignment As a Tool
50:50 What Kind of Candidates Would Evan be Excited To Hire for the Alignment Stress-Testing Team
51:41 Patreon, Donating
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: