Writing Mixture of Experts LLMs from Scratch in PyTorch

Автор: Neural Breakdown with AVB

Загружено: 2025-03-11

Просмотров: 4696

Описание: In this video, we discuss Mixture of Experts Transformers - the backbone behind popular LLMs like DeepSeek V3, Mixtral 8x22B, and more. You will learn concepts like Dense MOEs, Sparse MOEs, Top-K Routing, Noisy Routing, Expert Capacity, Switch Transformers, Auxilliary load balancing losses, and many more. Everything is presented visually to help conceptualize what is going on, and code snippets are provided to make it more concrete!

Follow on Twitter: https://x.com/neural_avb

To support this channel, you can buy me a coffee at: https://ko-fi.com/neuralavb

Join the channel on Patreon to receive updates about the channel, and get access to bonus content used in all my videos. You will get the slides, notebooks, code snippets, word docs, and animations that went into producing this video. Here is the link:
  / neuralbreakdownwithavb

Visit AI Agent Store Page: https://aiagentstore.ai/?ref=avishek

#pytorch #transformers #deepseek

Videos and playlists you would like:
Attention to Transformers playlist:    • Attention to Transformers from zero to her...
Guide to fine-tuning open source LLMs:    • Finetune LLMs to teach them ANYTHING with ...
Generative Language Modeling from scratch:    • From Attention to Generative Language Mode...

References and additional links:
Sparse Mixture of Experts paper: https://arxiv.org/abs/1701.06538
Mixtral of Experts: https://arxiv.org/abs/2401.04088
DeepSeek V2: https://arxiv.org/abs/2405.04434
DeepSeek V3: https://arxiv.org/abs/2412.19437
Switch Transformers / Expert Capacity: https://arxiv.org/abs/2101.03961
A Blog post: https://brunomaga.github.io/Mixture-o...
A visual guide: https://newsletter.maartengrootendors...
Survey paper: https://arxiv.org/pdf/2407.06204

Timestamps:
0:00 - Intro
1:52 - Mixture of Experts Intuition
4:53 - Transformers 101
9:20 - Dense MOEs
14:50 - Sparse MOEs
16:34 - Router Collapse and Top-K Routing
19:20 - Noisy TopK, Load Balancing
20:56 - Routing Analysis by Mixtral
22:30 - Auxilliary Losses & DeepSeek
24:05 - Expert Capacity
26:07 - 6 Points to Remember

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...

Writing Mixture of Experts LLMs from Scratch in PyTorch

Доступные форматы для скачивания:

Скачать видео

Информация по загрузке:

Скачать аудио

Похожие видео

Технология DeepSeek Manifold Constrained Hyper Connections (mHC) и эволюция ResNets

Технология DeepSeek Manifold Constrained Hyper Connections (mHC) и эволюция ResNets

The Brain’s Learning Algorithm Isn’t Backpropagation

The Brain’s Learning Algorithm Isn’t Backpropagation

Рекурсивные языковые модели (РЛМ) — давайте создадим самых крутых агентов! (Теория и код)

Рекурсивные языковые модели (РЛМ) — давайте создадим самых крутых агентов! (Теория и код)

Как внимание стало настолько эффективным [GQA/MLA/DSA]

Как внимание стало настолько эффективным [GQA/MLA/DSA]

Управление поведением LLM без тонкой настройки

Управление поведением LLM без тонкой настройки

Mixture of Experts: How LLMs get bigger without getting slower

Mixture of Experts: How LLMs get bigger without getting slower

Turns out Attention wasn't all we needed - How have modern Transformer architectures evolved?

Turns out Attention wasn't all we needed - How have modern Transformer architectures evolved?

A Visual Guide to Mixture of Experts (MoE) in LLMs

A Visual Guide to Mixture of Experts (MoE) in LLMs

How to finetune LLMs to THINK with Reinforcement Learning (GRPO from scratch!)

How to finetune LLMs to THINK with Reinforcement Learning (GRPO from scratch!)

LLM и GPT - как работают большие языковые модели? Визуальное введение в трансформеры

LLM и GPT - как работают большие языковые модели? Визуальное введение в трансформеры

How to solve Reinforcement Learning when there are ZERO rewards (Curiosity & RND)

How to solve Reinforcement Learning when there are ZERO rewards (Curiosity & RND)

Building awesome Speech To Text Transformers from scratch - One line of Pytorch at a time!

Building awesome Speech To Text Transformers from scratch - One line of Pytorch at a time!

Let me explain PyTorch in 7 Concepts

Let me explain PyTorch in 7 Concepts

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Давайте обучим языковые модели обработки изображений (VLM) с нуля, используя только текстовые язы...

Давайте обучим языковые модели обработки изображений (VLM) с нуля, используя только текстовые язы...

Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer

Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer

How DeepSeek Rewrote the Transformer [MLA]

How DeepSeek Rewrote the Transformer [MLA]

How Attention Mechanism Works in Transformer Architecture

How Attention Mechanism Works in Transformer Architecture

THIS is why large language models can understand the world

THIS is why large language models can understand the world

Почему «Трансформеры» заменяют CNN?

Почему «Трансформеры» заменяют CNN?