The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals
Автор: Latent Space
Загружено: 2026-02-23
Просмотров: 1091
Описание:
Olivia Watkins (Frontier Evals team) and Mia Glaese (VP of Research at OpenAI, leading the Codex, human data, and alignment teams) discuss a new blog post (https://openai.com/index/why-we-no-lo...) arguing that SWE-Bench Verified—long treated as a key “North Star” coding benchmark—has become saturated and highly contaminated, making it less useful for measuring real coding progress.
SWE-Bench Verified originated as a major OpenAI-led cleanup of the original Princeton SWE-Bench benchmark, including a large human review effort with nearly 100 software engineers and multiple independent reviews to curate ~500 higher-quality tasks. But recent findings show that many remaining failures can reflect unfair or overly narrow tests (e.g., requiring specific naming or unspecified implementation details) rather than true model inability, and cite examples suggesting contamination such as models recalling repository-specific implementation details or task identifiers.
From now on, OpenAI plans to stop reporting SWE-Bench Verified and instead focus on SWE-Bench Pro (from Scale), which is harder, more diverse (more repos and languages), includes longer tasks (1–4 hours and 4+ hours), and shows substantially less evidence of contamination under their “contamination auditor agent” analysis.
We also discuss what future coding/agent benchmarks should measure beyond pass/fail tests—longer-horizon tasks, open-ended design decisions, code quality/maintainability, and real-world product-building—along with the tradeoffs between fast automated grading and human-intensive evaluation.
00:00 Meet the Frontier Evals Team
00:56 Why SWE Bench Stalled
01:47 How Verified Was Built
04:32 Contamination In The Wild
06:16 Unfair Tests And Narrow Specs
08:40 When Benchmarks Saturate
10:28 Switching To SWE Bench Pro
12:31 What Great Coding Evals Measure
18:17 Beyond Tests Dollars And Autonomy
21:49 Preparedness And Future Directions
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: