Maximize LLM Inference Performance + Auto-Profile/Optimize PyTorch/CUDA Code

Автор: AI Performance Engineering

Загружено: 2025-08-18

Просмотров: 1302

Описание: Talk #1: Everything You Need to Know About Reducing Voice-Agent Latency (by Philip Kiely @ Baseten)
Rolling your own optimized voice agent introduces hard problems at each layer of the stack. In this talk, Philip will provide an overview of the runtime optimizations, infrastructure setup, and client code required to get consistently low latencies for voice at scale.

Talk #2: PyTorch Profiling That Actually Tells You What to Fix (by Emilio Andere @ Herdora)
Automate PyTorch profiler analysis by tracing bottlenecks to root causes including kernel memory patterns, tensor layouts, missing fusions - mapping them to specific code fixes.

Talk #3: Auto-Optimizing PyTorch and CUDA Code (by Chris Fregly)
Automate PyTorch and CUDA performance optimizations for all environments including GPUs.

Zoom link: https://us02web.zoom.us/j/82308186562

Related Links
Github Repo: http://github.com/cfregly/ai-performa...
O'Reilly Book: https://www.amazon.com/Systems-Perfor...
YouTube: / @aiperformanceengineering
Generative AI Free Course on DeepLearning.ai: https://bit.ly/gllm

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...

Доступные форматы для скачивания:

Скачать видео

Информация по загрузке:

Скачать аудио

Похожие видео

Почему AI генерит мусор — и как заставить его писать нормальный код

Почему AI генерит мусор — и как заставить его писать нормальный код

Dynamic/Adaptive RL-based Inference CUDA Kernel Optimization +Accelerated PyTorch +Modular Mojo/MAX

Dynamic/Adaptive RL-based Inference CUDA Kernel Optimization +Accelerated PyTorch +Modular Mojo/MAX

part 2 webinar

Exploring the Latency/Throughput & Cost Space for LLM Inference // Timothée Lacroix // CTO Mistral

Exploring the Latency/Throughput & Cost Space for LLM Inference // Timothée Lacroix // CTO Mistral

AI Agent Inference Performance Optimizations + vLLM vs. SGLang vs. TensorRT w/ Charles Frye (Modal)

AI Agent Inference Performance Optimizations + vLLM vs. SGLang vs. TensorRT w/ Charles Frye (Modal)

GES Cohort 3 AI Search Webinar

GES Cohort 3 AI Search Webinar

Lianmin Zheng on Efficient LLM Inference with SGLang

Lianmin Zheng on Efficient LLM Inference with SGLang

NVIDIA Dynamo + Disaggregated Prefill-Decode LLM Serving + PyTorch/CUDA Performance with Luminal

NVIDIA Dynamo + Disaggregated Prefill-Decode LLM Serving + PyTorch/CUDA Performance with Luminal

Глубокое погружение: оптимизация вывода LLM

Глубокое погружение: оптимизация вывода LLM

LLM и GPT - как работают большие языковые модели? Визуальное введение в трансформеры

LLM и GPT - как работают большие языковые модели? Визуальное введение в трансформеры

Andrej Karpathy: Software Is Changing (Again)

Andrej Karpathy: Software Is Changing (Again)

Освоение оптимизации вывода LLM: от теории до экономически эффективного внедрения: Марк Мойу

Освоение оптимизации вывода LLM: от теории до экономически эффективного внедрения: Марк Мойу

AI, Machine Learning, Deep Learning and Generative AI Explained

AI, Machine Learning, Deep Learning and Generative AI Explained

Блиц-доклад: Самый быстрый путь к производству: вывод PyTorch на Python — Марк Саруфим, Meta

Блиц-доклад: Самый быстрый путь к производству: вывод PyTorch на Python — Марк Саруфим, Meta

Лучший документальный фильм про создание ИИ

Лучший документальный фильм про создание ИИ

LLM inference optimization: Architecture, KV cache and Flash attention

LLM inference optimization: Architecture, KV cache and Flash attention

Databricks' vLLM Optimization for Cost-Effective LLM Inference | Ray Summit 2024

Databricks' vLLM Optimization for Cost-Effective LLM Inference | Ray Summit 2024

Эффективный вывод LLM на периферийных устройствах с использованием NNTrainer — Ынджу Ян и Донхак Пак

Эффективный вывод LLM на периферийных устройствах с использованием NNTrainer — Ынджу Ян и Донхак Пак

AI-Powered GPU Kernel Optimization(Mako.dev) + Distributed PyTorch with nbdistributed (Hugging Face)

AI-Powered GPU Kernel Optimization(Mako.dev) + Distributed PyTorch with nbdistributed (Hugging Face)

Claude Code с КОМАНДОЙ агентов - автономная машина разработки

Claude Code с КОМАНДОЙ агентов - автономная машина разработки