Maximizing GPU Efficiency w/ Quentin Anthony, Model Training Lead @ Zyphra | Beyond CUDA Summit 2025
Автор: TensorWave
Загружено: 2025-04-08
Просмотров: 124
Описание:
Quentin Anthony, Model Training Lead at Zyphra, explores how model sizing impacts GPU efficiency in transformer-based models. Discover why carefully selecting model dimensions, including hidden layers, attention heads, and tensor parallelism, can dramatically boost throughput and performance across diverse hardware like AMD GPUs, mobile devices, and cloud-based systems. Learn about practical insights, such as handling wave quantization and flash attention optimization, along with strategic considerations for balancing training and inference costs.
𝘾𝙤𝙣𝙣𝙚𝙘𝙩 𝙬𝙞𝙩𝙝 𝙌𝙪𝙚𝙣𝙩𝙞𝙣 𝘼𝙣𝙩𝙝𝙤𝙣𝙮 -
/ quentin-anthony
📢 Let us know your thoughts in the comments
--
Timestamps
0:00 - 1:20 - Introduction: Model Sizing for Efficient GPU Utilization
1:21 - 2:39 - Why Model Dimensions Matter for GPU Performance
2:40 - 4:31 - Breakdown of GPU Kernels in Transformer Models
4:32 - 6:10 - MLP Kernel Efficiency and Roofline Model
6:11 - 7:59 - Attention Kernel Efficiency Before Flash Attention
8:00 - 10:25 - Optimizing Attention Kernels with Flash Attention
10:26 - 11:51 - Wave Quantization and GPU Occupancy Explained
11:52 - 13:17 - Training vs. Inference Efficiency Trade-offs
13:18 - 13:57 - Key Takeaways and Future Research Directions
---
About TensorWave:
TensorWave is the AI and HPC cloud purpose-built for performance. Powered exclusively by AMD Instinct™ Series GPUs, we deliver high-bandwidth, memory-optimized infrastructure that scales with your most demanding models—training or inference.
--
Connect with TensorWave:
https://www.tensorwave.com
https://www.x.com/tensorwavecloud
/ tensorwave
/ tensorwave_cloud
/ @tensorwavecloud
#AICompute #GPUs #BeyondCUDA #AIInfrastructure
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: