Tactical Sports Analytics at Scale with Solidigm KV Cache Offload
Автор: Metrum AI
Загружено: 2026-05-26
Просмотров: 6
Описание:
This demo shows what happens when you offload KV cache from GPU VRAM to a Solidigm NVMe SSD for large language model inference, and why it matters at scale.
For long-context workloads that exceed standard VRAM capacity, the NVMe offload path serves cached KV blocks directly from the drive, bypassing redundant GPU recomputation and dramatically reducing time to first token. Under sustained load, this also lowers power draw and improves GPU efficiency per query.
In a benchmark of 50 concurrent users each sending 180,000 tokens, the NVMe offload path completed all requests while the GPU-only baseline was still processing, delivering faster response times, higher throughput, and better energy efficiency, enabling more users per accelerator at lower cost.
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: