SAM 3: The Eyes for AI — Nikhila & Pengchuan (Meta Superintelligence), ft. Joseph Nelson (Roboflow)

Автор: Latent Space

Загружено: 2025-12-18

Просмотров: 2509

Описание: as with all demo-heavy and especially vision AI podcasts, we encourage watching along on our YouTube (and tossing us an upvote/subscribe if you like!)
From SAM 1's 11-million-image data engine to SAM 2's memory-based video tracking, MSL’s Segment Anything project has redefined what's possible in computer vision. Now SAM 3 takes the next leap: *concept segmentation*—prompting with natural language like "yellow school bus" or "tablecloth" to detect, segment, and track every instance across images and video, in real time, with human-level exhaustivity. And with the latest SAM Audio (https://x.com/aiatmeta/status/2000980..., SAM can now even segment audio output!
We sat down with Nikhila Ravi (SAM lead at Meta) and Pengchuan Zhang (SAM 3 researcher) alongside Joseph Nelson (CEO, Roboflow) to unpack how SAM 3 unifies interactive segmentation, open-vocabulary detection, video tracking, and more into a single model that runs in 30ms on images and scales to real-time video on multi-GPU setups. We dig into the data engine that automated exhaustive annotation from two minutes per image down to 25 seconds using AI verifiers fine-tuned on Llama, the new SACO (Segment Anything with Concepts) benchmark with 200,000+ unique concepts vs. the previous 1.2k, how SAM 3 separates recognition from localization with a presence token, why decoupling the detector and tracker was critical to preserve object identity in video, how SAM 3 Agents unlock complex visual reasoning by pairing SAM 3 with multimodal LLMs like Gemini, and the real-world impact: 106 million smart polygons created on Roboflow saving humanity an estimated 130+ years of labeling time across fields from cancer research to underwater trash cleanup to autonomous vehicle perception.
We discuss:

What SAM 3 is: a unified model for concept-prompted segmentation, detection, and tracking in images and video using atomic visual concepts like "purple umbrella" or "watering can"
How concept prompts work: short text phrases that find all instances of a category without manual clicks, plus visual exemplars (boxes, clicks) to refine and adapt on the fly
Real-time performance: 30ms per image (100 detected objects on H200), 10 objects on 2×H200 video, 28 on 4×, 64 on 8×, with parallel inference and "fast mode" tracking
The SACO benchmark: 200,000+ unique concepts vs. 1.2k in prior benchmarks, designed to capture the diversity of natural language and reach human-level exhaustivity
The data engine: from 2 minutes per image (all-human) to 45 seconds (model-in-loop proposals) to 25 seconds (AI verifiers for mask quality and exhaustivity checks), fine-tuned on Llama 3.2
Why exhaustivity is central: every instance must be found, verified by AI annotators, and manually corrected only when the model misses—automating the hardest part of segmentation at scale
Architecture innovations: presence token to separate recognition ("is it in the image?") from localization ("where is it?"), decoupled detector and tracker to preserve identity-agnostic detection vs. identity-preserving tracking
Building on Meta's ecosystem: Perception Encoder, DINO v2 detector, Llama for data annotation, and SAM 2's memory-based tracking backbone
SAM 3 Agents: using SAM 3 as a visual tool for multimodal LLMs (Gemini, Llama) to solve complex visual reasoning tasks like "find the bigger character" or "what distinguishes male from female in this image"
Fine-tuning with as few as 10 examples: domain adaptation for specialized use cases (Waymo vehicles, medical imaging, OCR-heavy scenes) and the outsized impact of negative examples
Real-world impact at Roboflow: 106M smart polygons created, saving 130+ years of labeling time across cancer research, underwater trash cleanup, autonomous drones, industrial automation, and more

—
MSL FAIR team

Nikhila: \
Pengchuan: https://pzzhang.github.io/pzzhang/

Joseph Nelson

X: \
LinkedIn: \

[FLIGHTCAST_CHATPERS]

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...

SAM 3: The Eyes for AI — Nikhila & Pengchuan (Meta Superintelligence), ft. Joseph Nelson (Roboflow)

Доступные форматы для скачивания:

Скачать видео

Информация по загрузке:

Скачать аудио

Похожие видео

Actuate 2024 | Sergey Levine | Robotic Foundation Models

Actuate 2024 | Sergey Levine | Robotic Foundation Models

Chelsea Finn: Building Robots That Can Do Anything

Chelsea Finn: Building Robots That Can Do Anything

Как это сделали Meta и TikTok: масштабирование рекламных продуктов на основе ИИ до миллиардов пол...

Как это сделали Meta и TikTok: масштабирование рекламных продуктов на основе ИИ до миллиардов пол...

Эксклюзив от человека, который придумал AGI

Эксклюзив от человека, который придумал AGI

Я уменьшился до размеров чипа M5.

Я уменьшился до размеров чипа M5.

Ray Kurzweil: The Singularity Has Started, Merging with AI, Humanity 1000x Smarter by 2045

Ray Kurzweil: The Singularity Has Started, Merging with AI, Humanity 1000x Smarter by 2045

Может ли у ИИ появиться сознание? — Семихатов, Анохин

Может ли у ИИ появиться сознание? — Семихатов, Анохин

After LLMs: Spatial Intelligence and World Models — Fei-Fei Li & Justin Johnson, World Labs

After LLMs: Spatial Intelligence and World Models — Fei-Fei Li & Justin Johnson, World Labs

Планы Open AI на 2026 от Сэма Альтмана

Планы Open AI на 2026 от Сэма Альтмана

Конференция NeurIPS 2025 в Сан-Диего. Создание графов знаний из текста с помощью LLM — объяснение...

Конференция NeurIPS 2025 в Сан-Диего. Создание графов знаний из текста с помощью LLM — объяснение...

How Claude Code Works - Jared Zoneraich, PromptLayer

How Claude Code Works - Jared Zoneraich, PromptLayer

Краткое объяснение больших языковых моделей

Краткое объяснение больших языковых моделей

⚡️GPT5-Codex-Max: Training Agents with Personality, Tools & Trust — Brian Fioca + Bill Chen, OpenAI

⚡️GPT5-Codex-Max: Training Agents with Personality, Tools & Trust — Brian Fioca + Bill Chen, OpenAI

The Thinking Game | Full documentary | Tribeca Film Festival official selection

The Thinking Game | Full documentary | Tribeca Film Festival official selection

Inside Google DeepMind: AGI, Robotics, & World Models Explained - Demis Hassabis

Inside Google DeepMind: AGI, Robotics, & World Models Explained - Demis Hassabis

Andrej Karpathy: Software Is Changing (Again)

Andrej Karpathy: Software Is Changing (Again)

AI Trends 2026: Quantum, Agentic AI & Smarter Automation

AI Trends 2026: Quantum, Agentic AI & Smarter Automation

Тренды в ИИ 2026. К чему готовиться каждому.

Тренды в ИИ 2026. К чему готовиться каждому.

Лучший документальный фильм про создание ИИ

Лучший документальный фильм про создание ИИ

18 КРУТЫХ способов для ChatGPT (что кажется нелегально)

18 КРУТЫХ способов для ChatGPT (что кажется нелегально)