ycliper

Популярное

Музыка Кино и Анимация Автомобили Животные Спорт Путешествия Игры Юмор

Интересные видео

2025 Сериалы Трейлеры Новости Как сделать Видеоуроки Diy своими руками

Топ запросов

смотреть а4 schoolboy runaway турецкий сериал смотреть мультфильмы эдисон
Скачать

Beyond Vibe Testing: Smarter Eval for Agentic AI

Автор: Inference Time Tactics by NeuroMetric

Загружено: 2025-09-08

Просмотров: 78

Описание: In this episode of Inference Time Tactics, Rob, Cooper, and Byron explore Salesforce’s CRMArena-Pro benchmark and what it reveals about the limits of enterprise AI agents. They share why benchmark scores often fail in production, how inference-time tactics like best-of-N can improve reliability, and what NeuroMetric is building to make eval easier—from an ITC Test Engine to a drag-and-drop interface for rapid visualization and experimentation.

We talked about:

Why Salesforce’s CRMArena-Pro benchmark highlights the gap between lab benchmarks and real-world agent reliability.
How leading models perform inconsistently across single-turn and multi-turn enterprise tasks.
Why benchmark scores are weak predictors of operational success in production.
The role of inference-time tactics in reducing variance and improving stability.
NeuroMetric’s new platform: ITC Test Engine and drag-and-drop interface for experimentation.
Challenges in building agentic systems, from database integration to managing multi-prompt complexity.
Why large language models’ stochastic nature conflicts with business demands for reliability.
Latency, cost, and rate limits as major bottlenecks in scaling agentic workflows.
The limits of “vibe testing” and why rigorous evaluation frameworks are essential.
How Google’s Stacks tool speeds up evaluation with LLM-as-judge, and why it still falls short for enterprise needs.


Resources Mentioned:
CRMArena-Pro from Saleforce:
https://www.salesforce.com/blog/crmar...

Connect with Neurometric:
Website: https://www.neurometric.ai/
Substack: https://neurometric.substack.com/
X: https://x.com/neurometric/
Bluesky: https://bsky.app/profile/neurometric....

Hosts:
Rob May
https://x.com/robmay
  / robmay  

Calvin Cooper
https://x.com/cooper_nyc_
  / coopernyc  

Guest/s:
Byron Galbraith
https://x.com/bgalbraith
  / byrongalbraith  

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...
Beyond Vibe Testing: Smarter Eval for Agentic AI

Поделиться в:

Доступные форматы для скачивания:

Скачать видео

  • Информация по загрузке:

Скачать аудио

Похожие видео

From MIT Decoding Research to Today’s Inference Tradeoffs

From MIT Decoding Research to Today’s Inference Tradeoffs

Solving the Cold Start Problem in AI Inference

Solving the Cold Start Problem in AI Inference

GraphRAG: союз графов знаний и RAG: Эмиль Эйфрем

GraphRAG: союз графов знаний и RAG: Эмиль Эйфрем

Как будут отключать интернет в России. Прогноз Андрея Дороничева

Как будут отключать интернет в России. Прогноз Андрея Дороничева

Episode 6: PR's AI Revolution with Matt Kovacs | Agency Download Podcast

Episode 6: PR's AI Revolution with Matt Kovacs | Agency Download Podcast

Why Inference Time Compute Is the Future of AI

Why Inference Time Compute Is the Future of AI

Lessons from the Leading Edge: What 420 AI Deployments Reveal About Enterprise Success

Lessons from the Leading Edge: What 420 AI Deployments Reveal About Enterprise Success

The real reason Elon Musk bought Twitter | Yanis Varoufakis on the future of capitalism

The real reason Elon Musk bought Twitter | Yanis Varoufakis on the future of capitalism

Drag, Drop, and Deploy: Rethinking How We Build AI Systems

Drag, Drop, and Deploy: Rethinking How We Build AI Systems

Появляется новый тип искусственного интеллекта, и он лучше, чем LLMS?

Появляется новый тип искусственного интеллекта, и он лучше, чем LLMS?

30 Minute Focus - Dreamlight ⚡ Brain.fm ⚡ Music for Maximum Focus and Concentration

30 Minute Focus - Dreamlight ⚡ Brain.fm ⚡ Music for Maximum Focus and Concentration

Музыка для работы за компьютером | Фоновая музыка для концентрации и продуктивности

Музыка для работы за компьютером | Фоновая музыка для концентрации и продуктивности

The Strategic Trade Offs Behind Inference Time Compute Decisions

The Strategic Trade Offs Behind Inference Time Compute Decisions

LLM fine-tuning или ОБУЧЕНИЕ малой модели? Мы проверили!

LLM fine-tuning или ОБУЧЕНИЕ малой модели? Мы проверили!

🎙️ Is ERP Dead? The Rise of Alternative Enterprise Platforms

🎙️ Is ERP Dead? The Rise of Alternative Enterprise Platforms

Lessons from the Leading Edge: What 421 AI Deployments Reveal About Enterprise Success

Lessons from the Leading Edge: What 421 AI Deployments Reveal About Enterprise Success

Vintage Christmas TV Art Vintage Art TV Screen Art for Your TV Holiday Art Santa Screensaver Frame

Vintage Christmas TV Art Vintage Art TV Screen Art for Your TV Holiday Art Santa Screensaver Frame

Benchmarking Generalization: How AI Learns Beyond Training Data

Benchmarking Generalization: How AI Learns Beyond Training Data

Chicago Housing Market Shocker! Zillow’s ChatGPT Move & Mortgage Crackdown Explained

Chicago Housing Market Shocker! Zillow’s ChatGPT Move & Mortgage Crackdown Explained

1 A.M Study Session 📚 [lofi hip hop]

1 A.M Study Session 📚 [lofi hip hop]

© 2025 ycliper. Все права защищены.



  • Контакты
  • О нас
  • Политика конфиденциальности



Контакты для правообладателей: [email protected]