LMCache Office Hour 2026-02-12
Автор: LMCache Team
Загружено: 2026-02-13
Просмотров: 58
Описание:
LMCache Office Hour #3 featuring @Martin Hickey (IBM), presenting on Event-driven KV-Cache-Aware Routing for Distributed LLM Inference.
Chat Transcript:
00:09:45.125,00:09:48.125
Mo McElaney: Thanks to everyone who joined so far! Going to wait until 5 after the hour to get started.
00:20:31.137,00:20:34.137
Mo McElaney: "What is distributed inference?" https://www.redhat.com/en/topics/ai/w...
00:25:06.055,00:25:09.055
Ugur Kaynar: Is KV‑cache based routing becoming the de‑facto method for large scale disagg inference?
00:29:59.485,00:30:02.485
Mo McElaney: KV Cache Events in the LMCache docs... https://docs.lmcache.ai/production/kv...
00:37:36.285,00:37:39.285
Himanshu Sekhar Nayak: So here medium means kv blocks are sitting in cpu DRAM?
00:39:04.209,00:39:07.209
Himanshu Sekhar Nayak: Is it there for NVMe too?
00:40:28.290,00:40:31.290
Himanshu Sekhar Nayak: I mean storage
00:40:54.826,00:40:57.826
Himanshu Sekhar Nayak: thanks
00:47:04.601,00:47:07.601
kosseila Hd: what event do you think will benefit most the latency & performance when KVcache aware routing is enabled for users ?
00:48:13.525,00:48:16.525
kosseila Hd: 👍🏻
00:48:18.782,00:48:21.782
Himanshu Sekhar Nayak: I’ve been testing LMCache across versions 0.3.10 to 0.3.13 and I can clearly see overall performance improvements.
However, I noticed a behavioral difference in KV offloading:
In v0.3.10, when I send a small prompt (~20 tokens), KV blocks are offloaded to NVMe.
In v0.3.13, KV blocks are not offloaded for the same prompt. Offloading only seems to happen when (input_tokens + output_tokens) approaches max_model_len.
00:48:53.877,00:48:56.877
Himanshu Sekhar Nayak: Was there any intentional change in the offloading/store logic between 0.3.10 and 0.3.13?
00:50:31.212,00:50:34.212
Samuel Shen: save_unfull_chunk was turned off by default
00:51:07.330,00:51:10.330
Himanshu Sekhar Nayak: Is it due to bandwidth saturation for small chunks?
00:51:25.723,00:51:28.723
Samuel Shen: it helps us not have to store metadata for chunks for remote backends
00:51:28.898,00:51:31.898
Samuel Shen: since all chunks become uniform
00:52:18.667,00:52:21.667
Ugur Kaynar: Thank you
00:52:39.543,00:52:42.543
Himanshu Sekhar Nayak: thanks for answering
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: