Auditing Language Models for Hidden Objectives with Sam Marks
Автор: NDIF Team
Загружено: 2026-02-10
Просмотров: 48
Описание:
Sam Marks leads Anthropic's Cognitive Oversight team, a subteam of Alignment Science. Sam's research focuses on settings where understanding something about a model's internal computations could be useful for overseeing it or assessing its safety-relevant properties.
Here, he discusses his team's work, "Auditing language models for hidden objectives," which explores the efficacy of white-box and black-box research tools during alignment audits in a red-team/blue-team exercise.
Paper: https://arxiv.org/abs/2503.10965
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: