GPT-5.4 Got the Best Score I've Ever Seen — Then I Found Something Stranger
Автор: Matt Maher
Загружено: 2026-03-10
Просмотров: 2165
Описание:
GPT-5.4 scored 95% on my planning benchmark — the highest I've ever recorded. But while I was testing it across every tool I use, a pattern showed up in the data that I genuinely did not expect. And it changes what I'd recommend.
I ran GPT-5.4, Opus 4.6, Sonnet 4.6, and Gemini 3.1 Pro through Codex CLI, Claude Code, Gemini CLI, and Cursor — all on the same planning benchmark. This benchmark measures whether a model can take a real product requirements document and build a plan that doesn't drop features. It's not a coding test. It's a planning attention test.
GPT-5.4 Extra High crushed it. But the bigger finding was what happened when I compared the same models across different tools — and what happened when I changed a single configuration in Claude Code.
If you're evaluating AI coding tools or trying to decide between Cursor, Claude Code, Codex CLI, or Gemini CLI, this video shows real benchmark data across all of them. If you use Claude Code and rely on planning mode, there's a specific finding here that could change how you work. Whether you're an engineer optimizing your AI workflow or just trying to pick the right tool, this covers model performance, tool performance, and the surprising gap between them.
The Benchmark if you want to try it:
https://github.com/bladnman/planning_...
#GPT54 #AICoding #Cursor #ClaudeCode #AIBenchmark
00:00 - Intro
00:31 - Marker 3
01:54 - GPT 5.4 results
06:53 - Things got interesting
07:06 - Cursor vs. CLI
09:12 - The Auto-Eval?
10:30 - Hot Take
12:03 - Closing
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: