GPU MODE

A GPU reading group and community https://discord.gg/gpumode
Supplementary content here https://github.com/gpu-mode
Created by Mark Saroufim and Andreas Köpf

Lecture 83: Formalized Kernel Derivation

Lecture 83: Formalized Kernel Derivation

Lecture 82 Helion: A high-level DSL for ML kernels

Lecture 82 Helion: A high-level DSL for ML kernels

Lecture 81: High-performance purely functional data-parallel array programming

Lecture 81: High-performance purely functional data-parallel array programming

Lecture 80: How FlashAttention 4 Works

Lecture 80: How FlashAttention 4 Works

Lecture 79 Mirage (MPK): Compiling LLMs into Mega Kernels

Lecture 79 Mirage (MPK): Compiling LLMs into Mega Kernels

Lecture 78 Iris: Multi-GPU Programming in Triton

Lecture 78 Iris: Multi-GPU Programming in Triton

Лекция 77: Предметно-ориентированные языки для ядер графических процессоров

Лекция 77: Предметно-ориентированные языки для ядер графических процессоров

Lecture 76: BackendBench fixing the LLM kernel correctness problem

Lecture 76: BackendBench fixing the LLM kernel correctness problem

Lecture 75 [ScaleML Series] GPU Programming Fundamentals + ThunderKittens

Lecture 75 [ScaleML Series] GPU Programming Fundamentals + ThunderKittens

Lecture 74: [ScaleML Series] Positional Encodings and PaTH Attention

Lecture 74: [ScaleML Series] Positional Encodings and PaTH Attention

Lecture 73: [ScaleML Series] Quantization in Large Models

Lecture 73: [ScaleML Series] Quantization in Large Models

Лекция 72: [Серия ScaleML] Эффективное и действенное моделирование в длинном контексте для больши...

Лекция 72: [Серия ScaleML] Эффективное и действенное моделирование в длинном контексте для больши...

Lecture 71: [ScaleML Series] FlexOlmo: Open Language Models for Flexible Data Use

Lecture 71: [ScaleML Series] FlexOlmo: Open Language Models for Flexible Data Use

Lecture 70: PCCL Fault tolerant collectives

Lecture 70: PCCL Fault tolerant collectives

Lecture 68: Landscape of GPU Centric communication

Lecture 68: Landscape of GPU Centric communication

Lecture 69: Quartet 4 bit training

Lecture 69: Quartet 4 bit training

Lecture 67: NCCL and NVSHMEM

Lecture 67: NCCL and NVSHMEM

Lecture 66: Game Arena

Lecture 66: Game Arena

Lecture 65: Neighborhood Attention

Lecture 65: Neighborhood Attention

Lecture 64: Multi-GPU programming

Lecture 64: Multi-GPU programming

Lecture 63: Search-Based Deep Learning Compilers

Lecture 63: Search-Based Deep Learning Compilers

Lecture 62: Exo 2 Growing a scheduling language

Lecture 62: Exo 2 Growing a scheduling language

Lecture 61: D-Matrix Corsair

Lecture 61: D-Matrix Corsair

Lecture 60: Optimizing Linear Attention

Lecture 60: Optimizing Linear Attention

Lecture 59: FastVideo

Lecture 59: FastVideo

Lecture 58: Disaggregated LLM Inference

Lecture 58: Disaggregated LLM Inference

Lecture 57: CuTe

Lecture 57: CuTe

Lecture 56: Kernel Benchmarking Tales

Lecture 56: Kernel Benchmarking Tales

Lecture 55: Modular’s unified device accelerator language

Lecture 55: Modular’s unified device accelerator language

Lecture 54: Small RL Models at the Speed of Light with LeanRL

Lecture 54: Small RL Models at the Speed of Light with LeanRL