Scalable Inference Algorithms for Large Language Models | Woomin Song, KAIST | AER LABS

Автор: AER Labs

Загружено: 2026-01-08

Просмотров: 68

Описание: Scalable Inference Algorithms for LLMs: REFORM & STAND

In this presentation, Woomin Song introduces two training-free frameworks for efficient LLM inference: REFORM for long-context processing and STAND for accelerating test-time scaling.

Part 1: REFORM (NeurIPS 2025)
Learn how REFORM overcomes the quadratic computational cost of Transformer attention and KV cache memory bottlenecks. By combining Recurrent Chunking with On-Demand Cache Recomputation, REFORM achieves 75% accuracy on 1M-token Needle-In-A-Haystack benchmarks while significantly reducing latency and memory usage.

Part 2: STAND (EMNLP 2025)
Discover how STAND accelerates test-time scaling (chain-of-thought reasoning, majority voting, tree search) through model-free speculative decoding. By leveraging cross-trajectory n-gram overlaps and stochastic drafting, STAND achieves the same accuracy in under 40% of the decoding time.

Both works were conducted during the speaker's internship at Amazon.
Speaker: Woomin Song | Integrated M.S. + Ph.D. Student at KAIST
Affiliation: KAIST (Korea Advanced Institute of Science and Technology)

[Resume & Profile]
https://woominsong.github.io/
---
Timestamps:
[Part 1: REFORM - Long Context Processing]
[00:00] Introduction: Scalable Inference Algorithms for LLMs
[00:42] The Problem: Quadratic computational costs and KV cache bottlenecks
[01:52] The Challenge: Pre-trained context length limits
[02:18] Existing Solutions: Recurrent Compression (StreamingLLM, H2O)
[03:36] Existing Solutions: Random Access approaches and their limitations
[04:28] Introducing REFORM: Best of both worlds
[05:08] Key Observation: Attention heads as token selectors using cosine similarity
[05:52] Methodology Overview: Compress, Gather, and Recompute stages
[06:28] Step 1: Compress - Recurrent chunking with early exit strategy
[08:12] Handling KV Cache: Token eviction using attention scores
[08:52] Step 2: Gather - Cosine similarity search for relevant tokens
[09:16] Step 3: Recompute - Forwarding gathered inputs for generation
[09:32] Evaluation: Needle-In-A-Haystack (NIAH) benchmark results
[10:24] Synthetic Benchmarks: Comparison with InfLLM (23% vs 75% at 1M tokens)
[10:52] Realistic Benchmarks: InfiniteBench, RepoEval, and MM-NIAH results
[11:28] Efficiency Analysis: Inference time and peak GPU memory savings
[12:16] Comparison with RAG: Architecture-level advantages
[13:24] Ablation Studies: Compression strategies and head selection
[Part 2: STAND - Test-Time Scaling Acceleration]
[14:08] Introduction: Test-time scaling and the latency problem
[15:12] Background: Chain-of-thought, majority voting, and tree search
[16:32] The Research Problem: Speeding up without compromising accuracy
[17:04] Speculative Decoding: Draft-then-verify framework
[18:16] Key Observation: High n-gram overlap across reasoning trajectories
[19:08] Model-Free Drafters: Leveraging cross-trajectory information
[20:04] Stochastic vs Deterministic Drafting: Why sampling matters
[21:16] STAND Components: N-gram drafter with probability awareness
[22:08] Optimization Techniques: Gumbel top-k trick for faster sampling
[22:32] Tree Drafting: Optimizing tree structure for higher acceptance
[23:16] Evaluation: AIME 2024, GPQA Diamond, and LiveCodeBench results
[24:28] Results: Same accuracy in under 40% decoding time
[25:04] Batch Decoding Scenarios: STAND remains effective in parallel inference
[25:32] Ablation Studies: Contribution of stochastic drafting and tree optimization
[26:24] Key Finding: Deeper and narrower tree structures perform better
[26:52] Summary: N-gram based speculative decoding for test-time scaling
[Q&A Session]
[27:28] Q&A: How speculative decoding ensures output correctness
[31:04] Q&A: Greedy decoding vs sampling scenarios
[33:28] Q&A: Tree drafting explanation and benefits
[38:24] Q&A: Batch decoding and high-throughput inference scenarios

---
Hosted by AER Labs

#REFORM #STAND #KAIST #LLM #LongContext #SpeculativeDecoding #TestTimeScaling #DeepLearning #Transformer #Inference #AIResearch #NLP #MachineLearning #NeurIPS2025 #EMNLP2025TAND for accelerating test-time scaling.

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...

Scalable Inference Algorithms for Large Language Models | Woomin Song, KAIST | AER LABS

Доступные форматы для скачивания:

Скачать видео

Информация по загрузке:

Скачать аудио

Похожие видео

Unlocking Geometry with InstaFormer | Pierre Musacchio, SNU | AER LABS

Unlocking Geometry with InstaFormer | Pierre Musacchio, SNU | AER LABS

Optimizing Large-Scale RL with SGLang | Chenyang Zhao | AER Labs

Optimizing Large-Scale RL with SGLang | Chenyang Zhao | AER Labs

Understanding a High Throughput LLM Inference System | Ayush Satyam | AER Labs

Understanding a High Throughput LLM Inference System | Ayush Satyam | AER Labs

Determinism and Scalability in Post-Training RL Systems | Ethan Su | AER LABS

Determinism and Scalability in Post-Training RL Systems | Ethan Su | AER LABS

Transformer Neural Networks, ChatGPT's foundation, Clearly Explained!!!

Transformer Neural Networks, ChatGPT's foundation, Clearly Explained!!!

RAG vs Fine-Tuning vs Prompt Engineering: Optimizing AI Models

RAG vs Fine-Tuning vs Prompt Engineering: Optimizing AI Models

Как внимание стало настолько эффективным [GQA/MLA/DSA]

Как внимание стало настолько эффективным [GQA/MLA/DSA]

NVIDIA Dynamo: High performance Open Source Interface | William Arnold | AER Labs

NVIDIA Dynamo: High performance Open Source Interface | William Arnold | AER Labs

Physics of Language Models

Physics of Language Models

Управление поведением LLM без тонкой настройки

Управление поведением LLM без тонкой настройки

OML : AI-native Cryptography for Open-Model Attribution and Control | Edoardo Contente | AER LABS

OML : AI-native Cryptography for Open-Model Attribution and Control | Edoardo Contente | AER LABS

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Самая сложная модель из тех, что мы реально понимаем

Самая сложная модель из тех, что мы реально понимаем

Floating Point Non Associativity in Machine Learning | Brian Chau | AER Labs

Floating Point Non Associativity in Machine Learning | Brian Chau | AER Labs

LLM fine-tuning или ОБУЧЕНИЕ малой модели? Мы проверили!

LLM fine-tuning или ОБУЧЕНИЕ малой модели? Мы проверили!

GraphRAG: союз графов знаний и RAG: Эмиль Эйфрем

GraphRAG: союз графов знаний и RAG: Эмиль Эйфрем

Diffusion Language Models: The Next Big Shift in GenAI

Diffusion Language Models: The Next Big Shift in GenAI

AI, Machine Learning, Deep Learning and Generative AI Explained

AI, Machine Learning, Deep Learning and Generative AI Explained

Stanford CS229 I Machine Learning I Building Large Language Models (LLMs)

Stanford CS229 I Machine Learning I Building Large Language Models (LLMs)

Decoder-Only Transformers, ChatGPTs specific Transformer, Clearly Explained!!!

Decoder-Only Transformers, ChatGPTs specific Transformer, Clearly Explained!!!