Wonmin Byeon (NVIDIA), "An Alternative Architecture for Efficient Large Language Models (LLMs)"
Автор: Users & Information Lab KAIST
Загружено: 2024-07-19
Просмотров: 260
Описание:
Paper: An Empirical Study of Mamba-based Language Models (https://arxiv.org/abs/2406.07887)
Widely used Large Language Models (LLMs) are based on Transformer architectures. While Transformer-based language models are highly parallelizable and can model massive amounts of data, they introduce significant computational overhead due to the quadratic self-attention calculations, especially on longer sequences. They also have large inference-time memory requirements from the key-value cache. More recently, State Space Models (SSM) like Mamba have been shown to have fast parallelizable training and inference. Studies show that SSMs can match or exceed the language modeling capabilities of Transformers, making them an attractive alternative. In this talk, I present the strengths and weaknesses of Mamba, Mamba-2, and Transformer models at larger scales. I also introduce a hybrid architecture consisting of Mamba-2, attention, and MLP layers. While pure SSMs match or exceed Transformers on many tasks, they lag behind Transformers on tasks that require strong copying or in-context learning abilities. In contrast, the hybrid model closely matches or exceeds the Transformer on all standard and long-context tasks and is predicted to be up to 8x faster when generating tokens at inference time.
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: