Inside Goodfire AI: Turning Mechanistic Interpretability into a Platform — Myra Deng & Mark Bissell
Автор: Latent Space
Загружено: 2026-02-05
Просмотров: 87
Описание:
From Palantir and Two Sigma to building Goodfire into the poster-child for actionable mechanistic interpretability, Mark Bissell (Member of Technical Staff) and Myra Deng (Head of Product) are trying to turn “peeking inside the model” into a repeatable production workflow by shipping APIs, landing real enterprise deployments, and now scaling the bet with a recent $150M Series B funding round at a $1.25B valuation. (https://www.goodfire.ai/blog/our-seri...)
In this episode, we go far beyond the usual “SAEs are cool” take. We talk about Goodfire’s core bet: that the AI lifecycle is still fundamentally broken because the only reliable control we have is data and we post-train, RLHF, and fine-tune by “slurping supervision through a straw,” hoping the model picks up the right behaviors while quietly absorbing the wrong ones. Goodfire’s answer is to build a bi-directional interface between humans and models: read what’s happening inside, edit it surgically, and eventually use interpretability during training so customization isn’t just brute-force guesswork. (https://www.goodfire.ai/blog/on-optim...)
We discuss:
• Myra + Mark’s path: Palantir (health systems, forward-deployed engineering) → Goodfire early team; Two Sigma → Head of Product, translating frontier interpretability research into a platform and real-world deployments
• What “interpretability” actually means in practice: not just post-hoc poking, but a broader “science of deep learning” approach across the full AI lifecycle (data curation → post-training → internal representations → model design)
• Why post-training is the first big wedge: “surgical edits” for unintended behaviors likereward hacking, sycophancy, noise learned during customization plus the dream of targeted unlearning and bias removal without wrecking capabilities
• SAEs vs probes in the real world: why SAE feature spaces sometimes underperform classifiers trained on raw activations for downstream detection tasks (hallucination, harmful intent, PII), and what that implies about “clean concept spaces”
• Rakuten in production (https://www.goodfire.ai/research/raku... deploying interpretability-based token-level PII detection at inference time to prevent routing private data to downstream providers plus the gnarly constraints: no training on real customer PII, synthetic→real transfer, English + Japanese, and tokenization quirks
• Real-time steering at frontier scale: a demo of steering Kimi K2 (~1T params) live and finding features via SAE pipelines, auto-labeling via LLMs, and toggling a “Gen-Z slang” feature across multiple layers without breaking tool use
• Hallucinations as an internal signal: the case that models have latent uncertainty / “user-pleasing” circuitry you can detect and potentially mitigate more directly than black-box methods
• Steering vs prompting (https://www.goodfire.ai/blog/feature-... the emerging view that activation steering and in-context learning are more closely connected than people think, including work mapping between the two (even for jailbreak-style behaviors)
• Interpretability for science: using the same tooling across domains (genomics, medical imaging, materials) to debug spurious correlations and extract new knowledge up to and including early biomarker discovery work with major partners
—
Goodfire AI
• Website: https://goodfire.ai
• LinkedIn: / goodfire-ai
• X: https://x.com/GoodfireAI
Myra Deng
• Website: https://myradeng.com/
• LinkedIn: / myra-deng
• X: https://x.com/myra_deng
Mark Bissell
• LinkedIn: / mark-bissell
• X: https://x.com/MarkMBissell
00:00 Introduction
00:45 Welcome + episode setup + intro to Goodfire
02:16 Fundraise news + what’s changed recently
02:44 Guest backgrounds + what they do day-to-day
05:52 “What is interpretability?” (SAEs, probing, steering and quick map of the space)
08:29 Post-training failures (sycophancy/reward hacking) + using interp to guide learning
10:26 Surgical edits: bias vectors + grokking/double descent + subliminal learning
14:04 How Goodfire decides what to work on (customers → research agenda)
16:58 SAEs vs probes: what works better for real-world detection tasks
19:04 Rakuten case study: production PII monitoring + multilingual + token-level scrubbing
22:06 Live steering demo on a 1T-parameter model (and scaling challenges)
25:29 Feature labeling + auto-interpretation + can we “turn down” hallucinations?
31:03 Steering vs prompting equivalence + jailbreak math + customization implications
38:36 Open problems + how to get started in mech interp
46:29 Applications: healthcare + scientific discovery (biomarkers, Mayo Clinic, etc.)
57:10 Induction + sci-fi intuition (Ted Chiang)
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: