Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

Автор: Summarize that research paper for me!

Загружено: 2025-09-09

Просмотров: 179

Описание: Title:
Mixture-of-Recursions: Learning Dynamic Recursive
Depths for Adaptive Token-Level Computation

Source:
https://arxiv.org/pdf/2507.10524

Summary:
This paper introduces Mixture-of-Recursions (MoR), a novel framework designed to address the significant computational and memory demands of scaling large language models (LLMs). MoR uniquely unifies parameter sharing and adaptive computation within a single Recursive Transformer architecture, aiming to deliver large-model quality and capabilities without incurring the typical high costs.
Key mechanisms of MoR include:
• Shared Layer Stack: MoR reuses a shared stack of layers across recursion steps, a technique known as layer tying, to achieve substantial parameter efficiency. The "Middle-Cycle" sharing strategy, which maintains distinct first and last layers while sharing intermediate weights, was identified as the most effective parameter-sharing approach.
• Lightweight Routers: These routers enable adaptive token-level computation by dynamically assigning different recursion depths to individual tokens. This ensures that computational resources are directed where most needed, with semantically important tokens, for instance, typically undergoing more recursion steps. The paper explores two routing strategies:
◦ Expert-choice routing: At each recursion step, the router selects a top-𝑘 subset of tokens to continue processing, ensuring perfect load balancing but requiring mitigation for potential causality violation during training, often through an auxiliary loss. The expert-choice router combined with an auxiliary loss and a simple linear architecture demonstrated optimal performance.
◦ Token-choice routing: Each token's entire compute path is determined upfront by assigning it a fixed recursion depth, thus avoiding causality issues but potentially leading to load imbalance, often necessitating a balancing loss.
• Efficient KV Caching Strategies: MoR introduces two methods for managing Key-Value (KV) cache memory and I/O:
◦ Recursion-wise KV caching: This strategy selectively caches KV pairs only for tokens active at a given recursion step, restricting attention to these entries. This significantly improves memory and I/O efficiency and reduces attention FLOPs, and is generally preferred for accuracy with precise token routing.
◦ Recursive KV sharing: KV pairs generated in the first recursion step are cached and then reused across all subsequent recursion steps. This approach leads to maximal memory savings and decreased prefill latency, making it particularly beneficial when memory efficiency is prioritized, though it may result in a slight performance reduction for expert-choice routing.
Empirical Validation and Benefits: Across model scales ranging from 135M to 1.7B parameters, MoR consistently establishes a new Pareto frontier. It significantly lowers validation perplexity and improves few-shot accuracy when compared to vanilla and existing recursive baselines, even while using approximately one-third less unique parameters and an equal training FLOPs budget. MoR's enhanced computational efficiency translates into higher inference throughput (achieving up to a 2.06× speedup) due to reduced KV cache sizes and the integration of continuous depth-wise batching. The framework also exhibits scalable performance, matching or exceeding vanilla Transformers at larger scales (≥360M parameters) despite its parameter reduction. Furthermore, MoR enables test-time scaling, allowing for improved generation quality by allocating more recursion steps during inference.

#MixtureOfRecursions #MoRTransformer #RecursiveTransformers #LanguageModels #LLMs #AdaptiveComputation #ParameterEfficiency #NeuralNetworks #DeepLearning #AI #ModelEfficiency #ComputationalEfficiency #MemoryEfficiency #KVcaching #DynamicDepth #TokenLevelComputation #ThroughputImprovement #FLOPsReduction #ScalableAI #ModelOptimization #InferenceOptimization #TrainingEfficiency #ParetoFrontier #HighPerformanceAI #Transformers #LayerTying #WeightSharing #Routers #ExpertChoiceRouting #TokenChoiceRouting #RecursionWiseCaching #RecursiveKVSharing #ContinuousBatching #LlamaBasedArchitecture #LatentReasoning #LargeModelQuality #ReducedCostAI #FewShotLearning #PerplexityReduction #GenerativeAI #LLMDeployment #FutureOfAI #AIResearch #MachineLearning #ArtificialIntelligence #TechInnovation #ComputerScience #ResearchPaper

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

Доступные форматы для скачивания:

Скачать видео

Информация по загрузке:

Скачать аудио

Похожие видео

Mixture of Recursions: The Power of Recursive Transformers

Mixture of Recursions: The Power of Recursive Transformers

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

Почему «Трансформеры» заменяют CNN?

Почему «Трансформеры» заменяют CNN?

Как происходит модернизация остаточных соединений [mHC]

Как происходит модернизация остаточных соединений [mHC]

Управление поведением LLM без тонкой настройки

Управление поведением LLM без тонкой настройки

Smaller, Faster, Smarter: Why MoR Might Replace Transformers | Front Page

Smaller, Faster, Smarter: Why MoR Might Replace Transformers | Front Page

Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions

Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions

LLM fine-tuning или ОБУЧЕНИЕ малой модели? Мы проверили!

LLM fine-tuning или ОБУЧЕНИЕ малой модели? Мы проверили!

ЛУЧШАЯ БЕСПЛАТНАЯ НЕЙРОСЕТЬ Google, которой нет аналогов

ЛУЧШАЯ БЕСПЛАТНАЯ НЕЙРОСЕТЬ Google, которой нет аналогов

Как внимание стало настолько эффективным [GQA/MLA/DSA]

Как внимание стало настолько эффективным [GQA/MLA/DSA]

ИИ - ЭТО ИЛЛЮЗИЯ ИНТЕЛЛЕКТА. Но что он такое и почему совершил революцию?

ИИ - ЭТО ИЛЛЮЗИЯ ИНТЕЛЛЕКТА. Но что он такое и почему совершил революцию?

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Градиентный спуск, как обучаются нейросети | Глава 2, Глубинное обучение

Градиентный спуск, как обучаются нейросети | Глава 2, Глубинное обучение

Самая Сложная Задача В Истории Самой Сложной Олимпиады

Самая Сложная Задача В Истории Самой Сложной Олимпиады

Краткое объяснение больших языковых моделей

Краткое объяснение больших языковых моделей

Визуализация гравитации

Визуализация гравитации

Самая сложная модель из тех, что мы реально понимаем

Самая сложная модель из тех, что мы реально понимаем

Google Mixture of Recursions paper explained

Google Mixture of Recursions paper explained

Трещины в сфере ИИ расширяются (CoT, RAG)

Трещины в сфере ИИ расширяются (CoT, RAG)

“More Robots Than Humans”, Elon Musk Says AI & Robots Will End Scarcity and Transform Humans | AI1G

“More Robots Than Humans”, Elon Musk Says AI & Robots Will End Scarcity and Transform Humans | AI1G