Distillation of Transformer Models

Автор: Trelis Research

Загружено: 2024-09-25

Просмотров: 5470

Описание: ➡️ Get Life-time Access to the Complete Scripts (and future improvements): https://Trelis.com/ADVANCED-fine-tuning
➡️ One-click fine-tuning and LLM templates: https://github.com/TrelisResearch/one...
➡️ Newsletter: https://blog.Trelis.com
➡️ Resources/Support/Discord: https://Trelis.com/About
➡️ Thumbnail made with this tutorial: • Fine Tune Flux Diffusion Models with Your ...

With credit to Rohan Sharma for work on these scripts on a Trelis Internship: https://trelis.com/internships/. Find Rohan on GitHub: https://github.com/rs545837/

Thanks also to Elie Bakouch of HuggingFace for guidance on using SmolLM corpus: https://huggingface.co/eliebak

VIDEO RESOURCES:
Slides: https://docs.google.com/presentation/...
Minitron Distillation Paper: https://d1qx31qr3h6wln.cloudfront.net...
Distil-Whisper Paper: https://arxiv.org/pdf/2311.00430
SmolLM Corpus: https://huggingface.co/datasets/Huggi...
Trelis SmolLM 2% split: https://huggingface.co/datasets/Treli...
WebInstruct: https://huggingface.co/datasets/TIGER...

TIMESTAMPS:
0:00 AI model distillation (Whisper, Flux, Minitron, gpt-4o-mini?)
0:46 Video Overview - Distillation Tutorial and Code Walk-through
2:00 Distillation Examples (Diffusion - Flux Schnell / Dev, Transcription - Distil-Whisper, LLMs - Nvidia Minitron)
6:51 How distillation works
7:22 Student model initialization
8:36 Layer / depth pruning
11:52 Width pruning
15:25 Pre-training versus distillation
18:40 Cross-entropy loss vs KL-divergence
22:41 Instruction fine-tuning
23:28 Distilling SmolLM 135M to a 99M model
24:43 Code walk-through setup.
26:49 Pruning Notebook
28:56 Layer Pruning
31:41 Width Pruning
35:01 Why pruning works?
36:17 Distillation Script - Multi-GPU Setup
39:36 Distillation Script Walk-through
54:05 Distillation Configuration File Walk-through
56:32 Distillation Startup and Performance Monitoring with tensorboard
1:03:01 Instruction fine-tuning and dataset selection
1:09:02 Instruction FT Startup and Performance Monitoring with tensorboard
1:12:40 Running inference to evaluate distillation performance
1:12:54 Teacher model performance (base SmolLM 135M)
1:13:53 SmolLM Instruct model performance
1:14:15 Raw pruned model performance (layer pruned) 99M
1:14:38 Width + Layer pruning performance (raw) 99M
1:15:18 Distilled model performance (before instruction tuning) 99M
1:15:57 Instruction tuning performance evaluation
1:16:21 SmolLM 135M Instruct performance
1:17:17 Instruction tuned distilled model performance (99M model)
1:18:33 Final Tips (best pruning approach, learning rate, batch size and model size effects)
1:20:21 Video Resources

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...

Distillation of Transformer Models

Доступные форматы для скачивания:

Скачать видео

Информация по загрузке:

Скачать аудио

Похожие видео

Full Fine tuning with Fewer GPUs - Galore, Optimizer Tricks, Adafactor

Full Fine tuning with Fewer GPUs - Galore, Optimizer Tricks, Adafactor

LLM fine-tuning или ОБУЧЕНИЕ малой модели? Мы проверили!

LLM fine-tuning или ОБУЧЕНИЕ малой модели? Мы проверили!

Knowledge Distillation: How LLMs train each other

Knowledge Distillation: How LLMs train each other

Advanced Data Prep and Visualisation Techniques for Fine-tuning LLMs

Advanced Data Prep and Visualisation Techniques for Fine-tuning LLMs

Лучше, а не больше: преобразование LLM в специализированные модели

Лучше, а не больше: преобразование LLM в специализированные модели

Краткое объяснение больших языковых моделей

Краткое объяснение больших языковых моделей

Synthetic Data Generation and Fine tuning (OpenAI GPT4o or Llama 3)

Synthetic Data Generation and Fine tuning (OpenAI GPT4o or Llama 3)

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

How does GRPO work?

How does GRPO work?

[GRPO Explained] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

[GRPO Explained] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Как внимание стало настолько эффективным [GQA/MLA/DSA]

Как внимание стало настолько эффективным [GQA/MLA/DSA]

My TOP TEN TIPS for Fine-tuning

My TOP TEN TIPS for Fine-tuning

Почему «Трансформеры» заменяют CNN?

Почему «Трансформеры» заменяют CNN?

Compressing Large Language Models (LLMs) | w/ Python Code

Compressing Large Language Models (LLMs) | w/ Python Code

Квантование против обрезки против дистилляции: оптимизация нейронных сетей для вывода

Квантование против обрезки против дистилляции: оптимизация нейронных сетей для вывода

CLI АГЕНТЫ - что это такое и почему я ОТКАЗАЛСЯ от ChatGPT?

CLI АГЕНТЫ - что это такое и почему я ОТКАЗАЛСЯ от ChatGPT?

I Trained an LLM to Think Deeper (Here's How)

I Trained an LLM to Think Deeper (Here's How)

Understanding TRM and HRM

Understanding TRM and HRM

Вебинар Стэнфорда: большие языковые модели вызывают ажиотаж, но составные системы — это будущее ИИ

Вебинар Стэнфорда: большие языковые модели вызывают ажиотаж, но составные системы — это будущее ИИ

Как создаются степени магистра права?

Как создаются степени магистра права?