The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals

Автор: Latent Space

Загружено: 2026-02-23

Просмотров: 1091

Описание: Olivia Watkins (Frontier Evals team) and Mia Glaese (VP of Research at OpenAI, leading the Codex, human data, and alignment teams) discuss a new blog post (https://openai.com/index/why-we-no-lo...) arguing that SWE-Bench Verified—long treated as a key “North Star” coding benchmark—has become saturated and highly contaminated, making it less useful for measuring real coding progress.

SWE-Bench Verified originated as a major OpenAI-led cleanup of the original Princeton SWE-Bench benchmark, including a large human review effort with nearly 100 software engineers and multiple independent reviews to curate ~500 higher-quality tasks. But recent findings show that many remaining failures can reflect unfair or overly narrow tests (e.g., requiring specific naming or unspecified implementation details) rather than true model inability, and cite examples suggesting contamination such as models recalling repository-specific implementation details or task identifiers.

From now on, OpenAI plans to stop reporting SWE-Bench Verified and instead focus on SWE-Bench Pro (from Scale), which is harder, more diverse (more repos and languages), includes longer tasks (1–4 hours and 4+ hours), and shows substantially less evidence of contamination under their “contamination auditor agent” analysis.

We also discuss what future coding/agent benchmarks should measure beyond pass/fail tests—longer-horizon tasks, open-ended design decisions, code quality/maintainability, and real-world product-building—along with the tradeoffs between fast automated grading and human-intensive evaluation.

00:00 Meet the Frontier Evals Team
00:56 Why SWE Bench Stalled
01:47 How Verified Was Built
04:32 Contamination In The Wild
06:16 Unfair Tests And Narrow Specs
08:40 When Benchmarks Saturate
10:28 Switching To SWE Bench Pro
12:31 What Great Coding Evals Measure
18:17 Beyond Tests Dollars And Autonomy
21:49 Preparedness And Future Directions

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...

The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals

Доступные форматы для скачивания:

Скачать видео

Информация по загрузке:

Скачать аудио

Похожие видео

OpenAI Closes in on $100 Billion, OpenClaw Acquired, AI’s Productivity Question — With Aaron Levie

OpenAI Closes in on $100 Billion, OpenClaw Acquired, AI’s Productivity Question — With Aaron Levie

Арестович: В чем виноваты Залужный и Зеленский? Дневник войны

Арестович: В чем виноваты Залужный и Зеленский? Дневник войны

Мир AI-агентов уже наступил. Что меняется прямо сейчас

Мир AI-агентов уже наступил. Что меняется прямо сейчас

No One Is Using CoPilot...

No One Is Using CoPilot...

«Две трети россиян хотят скорее это закончить». Как Россию изменила война, выборы в Думу, переговоры

«Две трети россиян хотят скорее это закончить». Как Россию изменила война, выборы в Думу, переговоры

"The Universe Is A PROGRAM" Is this the SOURCE CODE of our Universe? - Stephen Wolfram

Как искусственный интеллект меняет подход к разработке программного обеспечения — саммит Pragmatic

Как искусственный интеллект меняет подход к разработке программного обеспечения — саммит Pragmatic

The AI Frontier: from Gemini 3 Deep Think distilling to Flash — Jeff Dean

The AI Frontier: from Gemini 3 Deep Think distilling to Flash — Jeff Dean

У программистов осталось 18 месяцев, Нейросеть удалила код AWS, Унитазы спасут ИТ | Как Там АйТи #87

У программистов осталось 18 месяцев, Нейросеть удалила код AWS, Унитазы спасут ИТ | Как Там АйТи #87

Magnus Carlsen Trolls The 8 Time Russian Chess Champion

Magnus Carlsen Trolls The 8 Time Russian Chess Champion

Powrót Macierewicza. PIS walczy z SAFE | Opolska, Jędrzejek, Ćwiklak | PYTANIE TYGODNIA

Powrót Macierewicza. PIS walczy z SAFE | Opolska, Jędrzejek, Ćwiklak | PYTANIE TYGODNIA

No, A.I. Is Not Going To Replace Software

No, A.I. Is Not Going To Replace Software

Inside OpenAI’s Scramble for Compute

Inside OpenAI’s Scramble for Compute

Большое интервью Екатерины Шульман: главное желание россиян, кислота войны и несчастные патриоты

Большое интервью Екатерины Шульман: главное желание россиян, кислота войны и несчастные патриоты

Мир ускоряется и прогрессирует.. А мы? || Дмитрий Потапенко* и Дмитрий Дёмушкин

Мир ускоряется и прогрессирует.. А мы? || Дмитрий Потапенко* и Дмитрий Дёмушкин

[State of Post-Training] From GPT-4.1 to 5.1: RLVR, Agent & Token Efficiency — Josh McGrath, OpenAI

[State of Post-Training] From GPT-4.1 to 5.1: RLVR, Agent & Token Efficiency — Josh McGrath, OpenAI

Почему моя статья обрушила рынок

Почему моя статья обрушила рынок

The best devs delete code...

The best devs delete code...

Путин хочет закрыть границы. Мобилизация. Трамп и брат-близнец в Москве | Пастухов, Еловский

Путин хочет закрыть границы. Мобилизация. Трамп и брат-близнец в Москве | Пастухов, Еловский

OpenAI is Crashing Microsoft Stock (Why I'm Buying)

OpenAI is Crashing Microsoft Stock (Why I'm Buying)