Scalable MatMul-free Language Modeling (Paper Explained)

Автор: Yannic Kilcher

Загружено: 2024-07-08

Просмотров: 35091

Описание: Matrix multiplications (MatMuls) are pervasive throughout modern machine learning architectures. However, they are also very resource intensive and require special accelerators (GPUs). This paper explores architectures that do away with MatMuls and use quantization and recurrence to keep performance up.

OUTLINE:
0:00 - Intro
2:30 - MatMul is everywhere
5:55 - Ternary accumulation as a substitute for matrix multiplication
16:35 - Replacing attention layers with recurrent layers
32:40 - Replacing dense layers with ternary channel mixing
38:30 - Language modelling results & scaling laws
45:00 - Other experimental results
48:20 - Conclusion

Paper: https://arxiv.org/abs/2406.02528
Code: https://github.com/ridgerchu/matmulfr...

Abstract:
Matrix multiplication (MatMul) typically dominates the overall computational cost of large language models (LLMs). This cost only grows as LLMs scale to larger embedding dimensions and context lengths. In this work, we show that MatMul operations can be completely eliminated from LLMs while maintaining strong performance at billion-parameter scales. Our experiments show that our proposed MatMul-free models achieve performance on-par with state-of-the-art Transformers that require far more memory during inference at a scale up to at least 2.7B parameters. We investigate the scaling laws and find that the performance gap between our MatMul-free models and full precision Transformers narrows as the model size increases. We also provide a GPU-efficient implementation of this model which reduces memory usage by up to 61% over an unoptimized baseline during training. By utilizing an optimized kernel during inference, our model's memory consumption can be reduced by more than 10x compared to unoptimized models. To properly quantify the efficiency of our architecture, we build a custom hardware solution on an FPGA which exploits lightweight operations beyond what GPUs are capable of. We processed billion-parameter scale models at 13W beyond human readable throughput, moving LLMs closer to brain-like efficiency. This work not only shows how far LLMs can be stripped back while still performing effectively, but also points at the types of operations future accelerators should be optimized for in processing the next generation of lightweight LLMs. Our code implementation is available at this https URL.

Authors: Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, Jason K. Eshraghian

Links:
Homepage: https://ykilcher.com
Merch: https://ykilcher.com/merch
YouTube:    / yannickilcher
Twitter:   / ykilcher
Discord: https://ykilcher.com/discord
LinkedIn:   / ykilcher

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: https://www.subscribestar.com/yannick...
Patreon:   / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...

Scalable MatMul-free Language Modeling (Paper Explained)

Доступные форматы для скачивания:

Скачать видео

Информация по загрузке:

Скачать аудио

Похожие видео

LLM и GPT - как работают большие языковые модели? Визуальное введение в трансформеры

LLM и GPT - как работают большие языковые модели? Визуальное введение в трансформеры

Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Paper Explained)

Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Paper Explained)

RWKV: Reinventing RNNs for the Transformer Era (Paper Explained)

RWKV: Reinventing RNNs for the Transformer Era (Paper Explained)

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Что на самом деле показывает опыт с двумя щелями — предупреждение Фейнмана о реальности

Что на самом деле показывает опыт с двумя щелями — предупреждение Фейнмана о реальности

Сопоставление потоков для генеративного моделирования (с пояснениями в статье)

Сопоставление потоков для генеративного моделирования (с пояснениями в статье)

1-Bit LLM: The Most Efficient LLM Possible?

1-Bit LLM: The Most Efficient LLM Possible?

making computers multiply FASTER! (matrix hacking)

making computers multiply FASTER! (matrix hacking)

Энергетический гамбит: Как США забирают $27 трлн, внедряют ИИ и спасают Доллар в Иране

Энергетический гамбит: Как США забирают $27 трлн, внедряют ИИ и спасают Доллар в Иране

Mixtral of Experts (Paper Explained)

Mixtral of Experts (Paper Explained)

Самая Сложная Задача В Истории Самой Сложной Олимпиады

Самая Сложная Задача В Истории Самой Сложной Олимпиады

[GRPO Explained] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

[GRPO Explained] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Внимание — это всё, что вам нужно (Transformer) — объяснение модели (включая математику), вывод и...

Внимание — это всё, что вам нужно (Transformer) — объяснение модели (включая математику), вывод и...

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Фильм Алексея Семихатова «ГРАВИТАЦИЯ»

Фильм Алексея Семихатова «ГРАВИТАЦИЯ»

Как строили пирамиды. Сердце пирамид

Как строили пирамиды. Сердце пирамид

Система ПРО США рушится? Иран пробивает оборону - Джонсон и Уилкерсон

Система ПРО США рушится? Иран пробивает оборону - Джонсон и Уилкерсон

Kolmogorov Arnold Networks (KAN) Paper Explained - An exciting new paradigm for Deep Learning?

Kolmogorov Arnold Networks (KAN) Paper Explained - An exciting new paradigm for Deep Learning?

Retentive Network: A Successor to Transformer for Large Language Models (Paper Explained)

Retentive Network: A Successor to Transformer for Large Language Models (Paper Explained)

Визуализация внимания, сердце трансформера | Глава 6, Глубокое обучение

Визуализация внимания, сердце трансформера | Глава 6, Глубокое обучение