Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
Автор: Summarize that research paper for me!
Загружено: 2025-09-09
Просмотров: 179
Описание:
Title:
Mixture-of-Recursions: Learning Dynamic Recursive
Depths for Adaptive Token-Level Computation
Source:
https://arxiv.org/pdf/2507.10524
Summary:
This paper introduces Mixture-of-Recursions (MoR), a novel framework designed to address the significant computational and memory demands of scaling large language models (LLMs). MoR uniquely unifies parameter sharing and adaptive computation within a single Recursive Transformer architecture, aiming to deliver large-model quality and capabilities without incurring the typical high costs.
Key mechanisms of MoR include:
• Shared Layer Stack: MoR reuses a shared stack of layers across recursion steps, a technique known as layer tying, to achieve substantial parameter efficiency. The "Middle-Cycle" sharing strategy, which maintains distinct first and last layers while sharing intermediate weights, was identified as the most effective parameter-sharing approach.
• Lightweight Routers: These routers enable adaptive token-level computation by dynamically assigning different recursion depths to individual tokens. This ensures that computational resources are directed where most needed, with semantically important tokens, for instance, typically undergoing more recursion steps. The paper explores two routing strategies:
◦ Expert-choice routing: At each recursion step, the router selects a top-𝑘 subset of tokens to continue processing, ensuring perfect load balancing but requiring mitigation for potential causality violation during training, often through an auxiliary loss. The expert-choice router combined with an auxiliary loss and a simple linear architecture demonstrated optimal performance.
◦ Token-choice routing: Each token's entire compute path is determined upfront by assigning it a fixed recursion depth, thus avoiding causality issues but potentially leading to load imbalance, often necessitating a balancing loss.
• Efficient KV Caching Strategies: MoR introduces two methods for managing Key-Value (KV) cache memory and I/O:
◦ Recursion-wise KV caching: This strategy selectively caches KV pairs only for tokens active at a given recursion step, restricting attention to these entries. This significantly improves memory and I/O efficiency and reduces attention FLOPs, and is generally preferred for accuracy with precise token routing.
◦ Recursive KV sharing: KV pairs generated in the first recursion step are cached and then reused across all subsequent recursion steps. This approach leads to maximal memory savings and decreased prefill latency, making it particularly beneficial when memory efficiency is prioritized, though it may result in a slight performance reduction for expert-choice routing.
Empirical Validation and Benefits: Across model scales ranging from 135M to 1.7B parameters, MoR consistently establishes a new Pareto frontier. It significantly lowers validation perplexity and improves few-shot accuracy when compared to vanilla and existing recursive baselines, even while using approximately one-third less unique parameters and an equal training FLOPs budget. MoR's enhanced computational efficiency translates into higher inference throughput (achieving up to a 2.06× speedup) due to reduced KV cache sizes and the integration of continuous depth-wise batching. The framework also exhibits scalable performance, matching or exceeding vanilla Transformers at larger scales (≥360M parameters) despite its parameter reduction. Furthermore, MoR enables test-time scaling, allowing for improved generation quality by allocating more recursion steps during inference.
#MixtureOfRecursions #MoRTransformer #RecursiveTransformers #LanguageModels #LLMs #AdaptiveComputation #ParameterEfficiency #NeuralNetworks #DeepLearning #AI #ModelEfficiency #ComputationalEfficiency #MemoryEfficiency #KVcaching #DynamicDepth #TokenLevelComputation #ThroughputImprovement #FLOPsReduction #ScalableAI #ModelOptimization #InferenceOptimization #TrainingEfficiency #ParetoFrontier #HighPerformanceAI #Transformers #LayerTying #WeightSharing #Routers #ExpertChoiceRouting #TokenChoiceRouting #RecursionWiseCaching #RecursiveKVSharing #ContinuousBatching #LlamaBasedArchitecture #LatentReasoning #LargeModelQuality #ReducedCostAI #FewShotLearning #PerplexityReduction #GenerativeAI #LLMDeployment #FutureOfAI #AIResearch #MachineLearning #ArtificialIntelligence #TechInnovation #ComputerScience #ResearchPaper
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: