GPT-5.4 Got the Best Score I've Ever Seen — Then I Found Something Stranger

Автор: Matt Maher

Загружено: 2026-03-10

Просмотров: 2165

Описание: GPT-5.4 scored 95% on my planning benchmark — the highest I've ever recorded. But while I was testing it across every tool I use, a pattern showed up in the data that I genuinely did not expect. And it changes what I'd recommend.

I ran GPT-5.4, Opus 4.6, Sonnet 4.6, and Gemini 3.1 Pro through Codex CLI, Claude Code, Gemini CLI, and Cursor — all on the same planning benchmark. This benchmark measures whether a model can take a real product requirements document and build a plan that doesn't drop features. It's not a coding test. It's a planning attention test.

GPT-5.4 Extra High crushed it. But the bigger finding was what happened when I compared the same models across different tools — and what happened when I changed a single configuration in Claude Code.

If you're evaluating AI coding tools or trying to decide between Cursor, Claude Code, Codex CLI, or Gemini CLI, this video shows real benchmark data across all of them. If you use Claude Code and rely on planning mode, there's a specific finding here that could change how you work. Whether you're an engineer optimizing your AI workflow or just trying to pick the right tool, this covers model performance, tool performance, and the surprising gap between them.

The Benchmark if you want to try it:
https://github.com/bladnman/planning_...

#GPT54 #AICoding #Cursor #ClaudeCode #AIBenchmark
00:00 - Intro
00:31 - Marker 3
01:54 - GPT 5.4 results
06:53 - Things got interesting
07:06 - Cursor vs. CLI
09:12 - The Auto-Eval?
10:30 - Hot Take
12:03 - Closing

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...

GPT-5.4 Got the Best Score I've Ever Seen — Then I Found Something Stranger

Доступные форматы для скачивания:

Скачать видео

Информация по загрузке:

Скачать аудио

Похожие видео

Apple Studio Display XDR: Теперь мы все профи!

Apple Studio Display XDR: Теперь мы все профи!

90% процесса проектирования вашего ИИ-агента мертво.

90% процесса проектирования вашего ИИ-агента мертво.

Will AI Actually Take Your Job? (Anthropic Report Breakdown)

Will AI Actually Take Your Job? (Anthropic Report Breakdown)

Дороничев: ИИ — пузырь, который скоро ЛОПНЕТ. Какие перемены ждут мир?

Дороничев: ИИ — пузырь, который скоро ЛОПНЕТ. Какие перемены ждут мир?

Обзор Claude AI: Как он заменил мне Gemini, NotebookLM и Antigravity.

Обзор Claude AI: Как он заменил мне Gemini, NotebookLM и Antigravity.

Apple’s New M5 Max Changes the Local AI Story

Apple’s New M5 Max Changes the Local AI Story

Claude Skills 2.0: используй их как 1%. Полный гайд от новичка до Pro о котором вам не рассказали.

Claude Skills 2.0: используй их как 1%. Полный гайд от новичка до Pro о котором вам не рассказали.

GPT 5.4 — ИИ Достиг Уровня Человека? ИИ НОВОСТИ

GPT 5.4 — ИИ Достиг Уровня Человека? ИИ НОВОСТИ

My Biggest AI Unlock — It Does Everything

My Biggest AI Unlock — It Does Everything

this EX-OPENAI RESEARCHER just released it...

this EX-OPENAI RESEARCHER just released it...

Vibe Graphing: в 10 раз доступнее, чем Vibe Coding (MAS-Factory)

Vibe Graphing: в 10 раз доступнее, чем Vibe Coding (MAS-Factory)

Alex Karp Leaves Audience Speechless on AI

Alex Karp Leaves Audience Speechless on AI

MacBook Neo Review - It Might Be TOO Cheap.

MacBook Neo Review - It Might Be TOO Cheap.

Autoresearch, Agent Loops and the Future of Work

Autoresearch, Agent Loops and the Future of Work

ChatGPT и Gemini устарели. Ты перейдешь на Claude и вот почему…

ChatGPT и Gemini устарели. Ты перейдешь на Claude и вот почему…

Introducing Expo Agent (beta): build real, production-quality native apps from your browser

Introducing Expo Agent (beta): build real, production-quality native apps from your browser

AI workflow для разработчика: как перестать получать мусор от нейросети и писать нормальный код

AI workflow для разработчика: как перестать получать мусор от нейросети и писать нормальный код

The AI Megatest – GPT-5.4 vs Claude 4.6 vs Gemini 3.1 Pro vs Grok 4.20

The AI Megatest – GPT-5.4 vs Claude 4.6 vs Gemini 3.1 Pro vs Grok 4.20

ЕОПТ 055 | СВОЙ СЕРВЕР ЗА 5 МИНУТ

ЕОПТ 055 | СВОЙ СЕРВЕР ЗА 5 МИНУТ

ПОЛНЫЙ ГАЙД Codex от OpenAI! MCP + Skills + ТЗ. ВСЁ что нужно знать

ПОЛНЫЙ ГАЙД Codex от OpenAI! MCP + Skills + ТЗ. ВСЁ что нужно знать