Tactical Sports Analytics at Scale with Solidigm KV Cache Offload

Автор: Metrum AI

Загружено: 2026-05-26

Просмотров: 6

Описание: This demo shows what happens when you offload KV cache from GPU VRAM to a Solidigm NVMe SSD for large language model inference, and why it matters at scale.

For long-context workloads that exceed standard VRAM capacity, the NVMe offload path serves cached KV blocks directly from the drive, bypassing redundant GPU recomputation and dramatically reducing time to first token. Under sustained load, this also lowers power draw and improves GPU efficiency per query.

In a benchmark of 50 concurrent users each sending 180,000 tokens, the NVMe offload path completed all requests while the GPU-only baseline was still processing, delivering faster response times, higher throughput, and better energy efficiency, enabling more users per accelerator at lower cost.

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...

Tactical Sports Analytics at Scale with Solidigm KV Cache Offload

Доступные форматы для скачивания:

Скачать видео

Информация по загрузке:

Скачать аудио

Похожие видео