AI Sleeper Agents: How Anthropic Trains and Catches Them

Автор: Rational Animations

Загружено: 2025-08-30

Просмотров: 250859

Описание: In this video, we explain how Anthropic trained "sleeper agent" AIs to study deception. A "sleeper agent" is an AI model that behaves normally until it encounters a specific trigger in the prompt, at which point it awakens and executes a harmful behavior. Anthropic found that they couldn't undo the sleeper agent training using standard safety training, but they could detect sleeper agents through a simple interpretability technique.

▀▀▀▀▀▀▀▀▀SOURCES & READINGS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀

Sleeper agents: training deceptive LLMs that persist through safety training:
https://www.anthropic.com/research/sl...
https://www.alignmentforum.org/posts/...

Simple probes can catch sleeper agents: https://www.anthropic.com/research/pr...

Alignment Faking in Large Language Models (mentioned in passing as a more natural demonstration of deceptive alignment): https://www.anthropic.com/research/al...

▀▀▀▀▀▀▀▀▀PATREON, MEMBERSHIP, MERCH▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀

🟠 Patreon:   / rationalanimations

🔵 Channel membership:    / @rationalanimations

🟢 Merch: https://rational-animations-shop.four...

🟤 Ko-fi, for one-time and recurring donations: https://ko-fi.com/rationalanimations

▀▀▀▀▀▀▀▀▀SOCIAL & DISCORD▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀

Rational Animations Discord:   / discord

Reddit:   / rationalanimations

X/Twitter:   / rationalanimat1

Instagram:   / rationalanimations

▀▀▀▀▀▀▀▀▀PATRONS & MEMBERS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
A
Alcher Black
Alex Hall
Amir Saboury
Apuis Retsam
blasted0glass
Bleys
BlueNotesBlues
bparro
Chad M Jones
Chris Painter
Christian Loomis
Colin Ricardo
Craig Falls
Danealor
Danilo Stefani - Alessandra Erba
David Piepgrass
Dawson
Ducky
Edward Yu
Ellis Jones
Felix Akkermans
Forodriac Origamius
Fraser Cain
Gabriel Ledung
Glenn Tarigan
Honyopenyoko
Ingvi Gautsson
Ivan Bachcin
Jackson Emanuel
James Babcock
Jana
JanJan
Jasper L
Jeroen De Dauw
joe39504589
John
John Everett-Slape
Joshua Adrian Cahyono
Juan Benet
Klemen Slavic
Kristin Lindquist
loopuleasa
Luke Freeman
Martin Skalstad Steen
Matthew Shinkle
Michael Andregg
Michael Hewitt
Nathan Fish
Nathan Metzger
Neal Strobl
NMS
noggieB
Odet Abadia
rictic
Robert Paul Schwin
Scott Alexander
SQRT42Pi
steven michaels
Stuart Alldritt
Superslowmojoe
Terberlo.dog
Tomas Campos
Tor Barstad
ttw
Vladimir Silyaev
Fede Mathieu
ronvil
Michael Suazo
rx
Laissez Scholar
BestProGaming
7ic7ac
Devin King
RED
Rinthean
Thomas Grip
Boris Bend
J H
Richard Stambaugh
Teo Val
Ken Mc
Alcher Black
AWyattLife
Torstein Haldorsen
MichaÅ‚ ZieliÅ„ski

▀▀▀▀▀▀▀CREDITS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀

Directed by:
Hannah Levingstone | @hannah_luloo

Writers:
John Burden

Producer:
Emanuele Ascani

Art Director:
Hané Harnett | @Peony_Vibes / @peonyvibes (insta)

Line Producer:
Kristy Steffens | https://linktr.ee/kstearb

Production Managers:
Jay McMichen | @Jay_TheJester
Kristy Steffens | https://linktr.ee/kstearb
Grey Colson | https://linktr.ee/earl.gravy

Quality Assurance Lead:
Lara Robinowitz | @CelestialShibe

Storyboard Artists:
Emmalaine Wright | @emmalainearts (insta)
Hannah Levingstone | @hannah_luloo
Ira Klages | @dux

Lead Animators & Q/A:
Ethan DeBoer | https://linktr.ee/deboer_art
Lara Robinowitz | @CelestialShibe
Owen Peurois | @owenpeurois

Animators:
Colors Giraldo | @colorsofdoom
Ethan DeBoer https://linktr.ee/deboer_art
Ira Klages | @dux
Jay McMichen | @Jay_TheJester
Jodi Kuchenbecker | @viral_genesis (insta)
Jordan Gilbert | @Twin_Knight (twitter) Twin Knight Studios (YT)
Keith Kavanagh | @johnnycigarettex
Lara Robinowitz | @CelestialShibe
Michela Biancini
Owen Peurois | @owenpeurois
Patrick O' Callaghan | @patrick.h264
Patrick Sholar | @Sholarscribbles
Renan Kogut | @kogut_r
Skylar O'Brien | @mutodaes
Vaughn Oeth | @gravy_navy
Zack Gilbert | @Twin_Knight (twitter) Twin Knight Studios (YT)

Background Lead:
Pierre Broissand | @pierrebrsnd (insta) / artstation.com/brsnd

Asset/Background Artists:
Emmalaine Wright | @emmalainearts (insta)
Hané Harnett | @peonyvibes (insta) @peony_vibes (twitter)
Olivia Wang | @whalesharkollie
Pierre Broissand | @pierrebrsnd (insta) / artstation.com/brsnd
Zoe Martin-Parkinson | @zoemar_son

Compositing Lead:
Renan Kogut | @kogut_r

Compositing:
Grey Colson | https://linktr.ee/earl.gravy
Ira Klages | @dux
Patrick O' Callaghan | @patrick.h264
Renan Kogut | @kogut_r

Narrator:
Rob Miles |    / robertmilesai

VO Editor:
Tony Dipiazza

Original Soundtrack & Sound Design:
Epic Mountain

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...

AI Sleeper Agents: How Anthropic Trains and Catches Them

Доступные форматы для скачивания:

Скачать видео

Информация по загрузке:

Скачать аудио

Похожие видео

What a 100-year-old horse teaches us about AI

What a 100-year-old horse teaches us about AI

What If You Keep Slowing Down?

What If You Keep Slowing Down?

Этот ракетный двигатель разработан не людьми

Этот ракетный двигатель разработан не людьми

НЕЙРОСЕТИ VS BLENDER 3D / МЫ ПРОИГРАЛИ

НЕЙРОСЕТИ VS BLENDER 3D / МЫ ПРОИГРАЛИ

How to Upload a Mind (In Three Not-So-Easy Steps)

How to Upload a Mind (In Three Not-So-Easy Steps)

Это инопланетное послание

Это инопланетное послание

The most complex model we actually understand

The most complex model we actually understand

Скрытая сложность желаний

Скрытая сложность желаний

Программирование на ассемблере без операционной системы

Программирование на ассемблере без операционной системы

Богиня Всего Остального

Богиня Всего Остального

Неожиданная правда о 4 миллиардах лет эволюции [Veritasium]

Неожиданная правда о 4 миллиардах лет эволюции [Veritasium]

Моделирование эволюции агрессии

Моделирование эволюции агрессии

Но что такое нейронная сеть? | Глава 1. Глубокое обучение

Но что такое нейронная сеть? | Глава 1. Глубокое обучение

LLM и GPT - как работают большие языковые модели? Визуальное введение в трансформеры

LLM и GPT - как работают большие языковые модели? Визуальное введение в трансформеры

Чему нейросети на самом деле учатся? Исследуем мозг ИИ-модели.

Чему нейросети на самом деле учатся? Исследуем мозг ИИ-модели.

How to Align AI: Put It in a Sandwich

How to Align AI: Put It in a Sandwich

The story of Omega-L and Omega-W

The story of Omega-L and Omega-W

The King and the Golem

The King and the Golem

AI could be a tool for global control (plus other major AI risks)

AI could be a tool for global control (plus other major AI risks)

I Solved Klotski

I Solved Klotski