MiniMax-01: Scaling Foundation Models with Lightning Attention
Автор: Yanqing Node
Загружено: 2025-01-21
Просмотров: 79
Описание:
MiniMax-01: Scaling Foundation Models with Lightning Attention - Briefing Doc
Source: https://arxiv.org/pdf/2501.08313
Authors: MiniMax
Main Themes:
Scaling Large Language Models (LLMs) and Vision Language Models (VLMs) to 1 million token context windows.
Introducing a novel attention mechanism, Lightning Attention, for improved efficiency and long-context capabilities.
Development of MiniMax-Text-01, a 456 billion parameter LLM, and MiniMax-VL-01, a multi-modal VLM.
Extensive benchmarking and ablation studies demonstrating the performance and scaling benefits of their approach.
Key Ideas and Facts:
Context Window Limitation: Existing LLMs and VLMs have limited context windows (32K to 256K tokens), hindering practical applications that require larger context, like processing books, code projects, or extensive in-context learning examples. MiniMax aims to address this limitation by scaling their models to a 1 million token context window.
Lightning Attention: This novel attention mechanism is designed for efficient long-context language modeling. It tackles the computational bottleneck of the cumsum operation in existing linear attention mechanisms by employing a tiling technique that divides the computation into intra-block and inter-block operations.
"Lightning Attention proposes a novel tiling technique that effectively circumvents the cumsum operation."
Hybrid-Lightning Architecture: MiniMax-Text-01 utilizes a hybrid architecture combining both linear attention (Lightning Attention) and softmax attention, resulting in a model with superior retrieval and extrapolation capabilities compared to models relying solely on softmax attention.
"Consequently, the hybrid-lightning model exhibits superior retrieval and extrapolation capabilities compared to models relying solely on softmax attention."
Model Scaling and Performance: Through careful hyperparameter design and a three-stage training procedure, MiniMax-Text-01 scales to 456 billion parameters and achieves state-of-the-art performance on various benchmarks, including MMLU, MMLU-Pro, C-SimpleQA, and IFEval.
Multi-modal Capabilities: MiniMax-VL-01 integrates a lightweight Vision Transformer (ViT) module with MiniMax-Text-01, creating a multi-modal VLM capable of handling both text and visual information.
Varlen Ring Attention: To handle variable length sequences efficiently, particularly in the data-packing format, MiniMax introduces Varlen Ring Attention, a redesigned algorithm that avoids the excessive padding and computational waste associated with traditional methods.
"This approach avoids the excessive padding and subsequent computational waste associated with traditional methods by applying the ring attention algorithm directly to the entire sequence after data-packing."
Optimized Implementation and Training: MiniMax focuses on optimizing the implementation and training process through techniques like batched kernel fusion, separated prefill and decoding execution, multi-level padding, and StridedBatchedMatmul extension.
Extensive Evaluation: MiniMax conducts comprehensive evaluations across a diverse set of benchmarks, including long-context tasks like Needle-In-A-Haystack (NIAH) and Multi-Round Needles-In-A-Haystack (MR-NIAH), demonstrating the efficacy of their long-context capabilities.
Alignment with Human Preferences: The paper highlights the importance of aligning LLMs with human preferences during training. They achieve this through techniques like Importance Sampling Weight Clipping and KL Divergence Optimization.
"To address this issue, we implement additional clipping that abandoned this case in the loss function, which effectively regulates the importance sampling magnitude and mitigates noise propagation."
Real-World Applications: MiniMax showcases the practical application of their models in various tasks, including long-context translation, summarizing long papers with figures, and multi-modal question answering.
Conclusion:
MiniMax's research presents a significant contribution to the field of LLMs and VLMs by successfully scaling models to 1 million token context windows and achieving impressive performance gains through the innovative Lightning Attention mechanism. Their work paves the way for more powerful and efficient language models capable of handling complex real-world applications requiring extensive context understanding and multi-modal capabilities.
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: