PagedAttention: Revolutionizing LLM Inference with Efficient Memory Management - DevConf.CZ 2025

Автор: DevConf

Загружено: 2025-06-26

Просмотров: 8

Описание: Speaker(s): Rahul Belokar, Sagar Jalindar Aivale

Large language models (LLMs) are pushing the boundaries of artificial intelligence, but their deployment is often hampered by memory bottlenecks arising from the ever-growing size of key-value (KV) caches. Traditional LLM serving systems struggle with inefficient memory utilization and limited scalability. Inspired by the concept of virtual memory paging in operating systems, PagedAttention offers a transformative solution. This novel technique partitions the KV cache into smaller, non-contiguous blocks, enabling dynamic allocation, efficient retrieval, and flexible reuse of memory. By decoupling the physical layout of the cache from the logical structure, PagedAttention minimizes memory fragmentation and overhead.

This approach, integrated within the vLLM framework, an open-source, high-performance LLM serving framework developed at UC Berkeley, yields significant performance gains.Designed to address memory bottlenecks in traditional LLM serving methods, vLLM leverages PagedAttention for efficient KV cache management, optimizing batch processing and eliminating redundant computations. As a result, PagedAttention achieves up to 30× higher throughput compared to traditional LLM serving methods like Hugging Face Transformers, Orca, and NVIDIA’s FasterTransformer. It also reduces KV cache waste to approximately 4%, ensuring near-optimal memory usage and enabling larger batch processing by minimizing memory overhead.

Furthermore, vLLM seamlessly supports advanced sampling techniques, including beam search, without compromising latency. While challenges such as the overhead of managing lookup tables and the potential for increased latency in certain scenarios exist, ongoing research is addressing these limitations. For example, optimized data structures and prefetching strategies can mitigate lookup overhead. Despite these challenges, PagedAttention represents a major advancement in LLM inference, unlocking the potential for scalable and efficient deployment, even on resource-constrained hardware. This breakthrough paves the way for wider adoption of LLMs and empowers researchers to explore even larger and more complex models.
---

Full schedule, including slides and other resources:
https://pretalx.devconf.info/devconf-...

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...

PagedAttention: Revolutionizing LLM Inference with Efficient Memory Management - DevConf.CZ 2025

Доступные форматы для скачивания:

Скачать видео

Информация по загрузке:

Скачать аудио

Похожие видео

Что, если рядом с нами взорвётся звезда? [Veritasium]

Что, если рядом с нами взорвётся звезда? [Veritasium]

synthwave radio 🌌 beats to chill/game to

synthwave radio 🌌 beats to chill/game to

Dan Walsh Red Hat Obituary - Lessons learned with a career in software? - DevConf.CZ 2025

Dan Walsh Red Hat Obituary - Lessons learned with a career in software? - DevConf.CZ 2025

How to Prevent AI Agents from Accessing Unauthorized Data - DevConf.CZ 2025

How to Prevent AI Agents from Accessing Unauthorized Data - DevConf.CZ 2025

Fighting logs with Log Detective - DevConf.CZ 2025

Fighting logs with Log Detective - DevConf.CZ 2025

Qubes OS: Security Through Isolation - DevConf.CZ 2025

Qubes OS: Security Through Isolation - DevConf.CZ 2025

"Kubernetes cost monitoring and management with KubeCost" by Niklaus Hirt

Passwordless Auth: a JSON-Based Approach for modern authentication w/ SSSD & GNOME - DevConf.CZ 2025

Passwordless Auth: a JSON-Based Approach for modern authentication w/ SSSD & GNOME - DevConf.CZ 2025

Трамп объявил о прекращении огня / Конец российского наступления?

Трамп объявил о прекращении огня / Конец российского наступления?

Азербайджан и Россия — дальше будет хуже | Рейды в Екатеринбурге, задержания в Баку

Азербайджан и Россия — дальше будет хуже | Рейды в Екатеринбурге, задержания в Баку