Faster LLM Output Without New Hardware: Speculative Decoding
Автор: Zaharah
Загружено: 2025-12-08
Просмотров: 21
Описание:
Why is generating text with LLMs so slow? It’s not a compute problem, it’s a memory bandwidth problem. In this video, we explore Speculative Decoding, the technique that bypasses the "Memory Wall" by using a Draft-Verify architecture. We cover the hardware constraints of Autoregression, the mathematics of Rejection Sampling, and how you can achieve 2-3x faster inference speeds without losing quality.
Inference Optimization Techniques:
DistillSpec: https://arxiv.org/abs/2310.08461
Medusa: https://arxiv.org/abs/2401.10774
Distributed architectures: https://arxiv.org/pdf/2302.01318 , https://arxiv.org/pdf/2310.15141
Block verification: https://arxiv.org/pdf/2403.10444
Chapters:
0:00 – Why Speculative Decoding?
0:40 – Why LLMs Are Slow?
1:05 –The Memory Bottleneck Explained
2:00 – Draft Model vs Target Model
3:05 – What is Rejection Sampling?
5:14 – Acceptance Rate & Speed Gains
6:08 – Other Inference Optimization Techniques
6:43 – Implementation via vLLM
6:53 – Final Thoughts
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: