llm d NYC 2026 Meetup
Автор: llm-d Project
Загружено: 2026-03-12
Просмотров: 58
Описание:
Welcome to the recording of the first-ever llm-d Meetup, hosted on March 11, 2026, in New York City! This inaugural event brought together engineering leaders from IBM Research, AMD, and Red Hat to dive deep into the challenges of scaling LLM inference and the future of the open-source distributed stack.
In this session, we explore how llm-d (an open-source, full-stack solution) is establishing distributed inference as a first-class cloud-native workload. From managing the "prefill crunch" to state-aware scheduling on Kubernetes, our speakers break down the technical paths to production-ready AI.
📍 AGENDA & TIMESTAMPS
00:00 Welcome - Pete Cheslock (Red Hat)
01:49 Intro to llm-d for Open Source Distributed Inference - Carlos Costa (IBM)
35:40 Distributed LLM Serving on AMD with llm-d - Kenny Roche (AMD)
1:05:55 Scaling Wide-EP and Mixture-of-Experts (MoE) Models - Tyler Smith (Red Hat AI)
1:20:59 KV-Cache Wins: Prefix-Cache Scheduling & Offloading - Maroon Ayoub (IBM)
1:41:54 Closing & How to Get Involved with llm-d - Pete Cheslock
Carlos Costa (IBM Research) kicks off with an overview of the core challenges: hardware heterogeneity, varying request sizes, and the shift from monolithic to orchestrated inference.
Kenny Roche (AMD) discuss aligning llm-d with the ROCm stack and the performance potential of the ADER version of kernels.
Tyler Smith (Red Hat AI) dive into Expert Parallelism (EP) and lessons learned scaling sparse models like DeepSeek-style architectures.
1:05:10 KV-Cache Wins: Prefix-Cache Scheduling & Offloading
Maroon Ayoub (IBM Research) explains why KV cache hit rates are the most important metric for production and introduces North-South/East-West management paths.
💡 KEY TECHNICAL HIGHLIGHTS
State-Aware Scheduling: Learn how llm-d achieves significantly faster performance by optimizing KV cache reuse across clusters.
Prefill-Decode (PND) Disaggregation: A deep dive into separating compute-bound prefill from memory-bound decode for better latency.
Offloading Strategies: How to overcome GPU memory limits using CPU and file system-based storage offloading for terabytes of KV cache.
Future Frontiers: A sneak peek at the llm-d roadmap, featuring reinforcement learning (RL) support and expansion to the SGLang inference engine.
🔗 JOIN THE COMMUNITY
Join the llm-d community:
🌎 https://llm-d.ai
💬 https://llm-d.ai/slack
💻 https://github.com/llm-d
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: