DeepSeek Engram: Conditional Memory via Scalable Lookup: A New Sparsity Axis for LLMs

Автор: MillionScope

Загружено: 2026-01-13

Просмотров: 112

Описание: Conditional Memory via Scalable Lookup: A New Sparsity Axis for LLMs

Abstract

While Mixture-of-Experts (MoE) scales capacity via conditional computation, Transformers lack native knowledge lookup, forcing inefficient retrieval simulation. We introduce conditional memory as a complementary sparsity axis through Engram—a modernized n-gram embedding module.

We identify a U-shaped scaling law optimizing the trade-off between neural computation (MoE) and static memory (Engram). Our 27B-parameter Engram model surpasses iso-parameter/iso-FLOPs MoE baselines, with unexpected gains in reasoning (BBH, ARC-Challenge) and code/math (HumanEval, MATH) beyond knowledge tasks (MMLU, CMMLU).

Engram relieves early layers from static reconstruction, effectively deepening the network. By delegating local dependencies to lookups, it frees attention for global context, boosting long-context retrieval. Deterministic addressing enables prefetching from host memory with negligible overhead.

1. Introduction

Language modeling involves two distinct tasks:
1. Compositional reasoning: Requires deep, dynamic computation
2. Knowledge retrieval: Local, static patterns (entities, formulas)

Transformers simulate retrieval through expensive computation—reconstructing static lookup tables at runtime. We propose Engram, a conditional memory module combining classic n-gram structure with modern adaptations: tokenizer compression, multi-head hashing, contextualized gating, and multi-branch integration.

2. Architecture

Sparse Retrieval: Multi-head hashing maps contexts to embedding tables. Vocabulary projection reduces size by ~23% by collapsing semantically equivalent tokens.

Context-aware Gating: Attention-inspired mechanism uses current hidden states as queries against retrieved memory (keys/values), producing scalar gates that weight contributions.

Multi-branch Integration: Parameter-sharing strategy: single embedding table shared across branches, with branch-specific gating matrices.

System Efficiency: Deterministic addressing enables asynchronous prefetching from host DRAM during GPU computation of preceding layers. Zipfian n-gram distribution allows effective multi-level caching.

3. Scaling Laws

Sparsity Allocation Problem: How to distribute fixed parameter budgets between MoE experts and Engram memory?

Key Finding: U-shaped relationship between validation loss and allocation ratio α. Pure MoE (α=1) is suboptimal; allocating ~20-25% to Engram (α≈0.75) yields best performance.

Infinite Memory Regime: When memory budget is relaxed, scaling memory slots improves validation loss following a power law, confirming conditional memory as a scalable sparsity axis.

4. Large-Scale Pre-training (262B tokens)

MoE-27B: 26.7B params (72 experts)
Engram-27B: 26.7B params (55 experts + 5.7B memory)
Engram-40B: 18.5B memory parameters

Result: Engram-27B consistently outperforms iso-parameter MoE-27B, with massive gains in reasoning and code/math—not just knowledge retrieval.

5. Long Context (32k via YaRN)

Offloading local dependencies to lookups preserves attention capacity for global context.

6. Analysis

Effective Depth: LogitLens shows Engram achieves faster prediction convergence in early layers. CKA alignment reveals Engram Layer 5 ≈ MoE Layer 12—confirming increased effective depth.

Sensitivity Analysis:
Factual knowledge: 71% degradation when Engram suppressed
Reading comprehension: Only 7% degradation

System Efficiency: 100B-parameter Engram offloading incurs less than 1% throughput penalty via communication-computation overlap.

Gating Visualization: Strong activation on multi-token entities ("Alexander the Great") and formulaic phrases ("By the way").

7. Conclusion

We introduce conditional memory via Engram, establishing U-shaped scaling laws for Sparsity Allocation. Hybrid MoE+Engram strictly outperforms pure MoE. Engram-27B achieves superior efficiency by relieving backbones of static reconstruction and freeing attention for global reasoning. With zero-overhead offloading capability, conditional memory represents an indispensable primitive for next-generation sparse models.

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...

DeepSeek Engram: Conditional Memory via Scalable Lookup: A New Sparsity Axis for LLMs

Доступные форматы для скачивания:

Скачать видео

Информация по загрузке:

Скачать аудио

Похожие видео

Что такое энграмма DeepSeek?

Что такое энграмма DeepSeek?

Самая сложная модель из тех, что мы реально понимаем

Самая сложная модель из тех, что мы реально понимаем

Когда газовая промышленность потерпела крах, мы выживали на солевых газах.

Когда газовая промышленность потерпела крах, мы выживали на солевых газах.

🧪🧪🧪🧪Как увидеть гиперпространство (4-е измерение)

🧪🧪🧪🧪Как увидеть гиперпространство (4-е измерение)

Elon Musk Makes Shocking Future Predictions At The World Economic Forum In Davos

Elon Musk Makes Shocking Future Predictions At The World Economic Forum In Davos

China’s Next AI Shock Is Hardware

China’s Next AI Shock Is Hardware

Разработка с помощью Gemini 3, AI Studio, Antigravity и Nano Banana | Подкаст Agent Factory

Разработка с помощью Gemini 3, AI Studio, Antigravity и Nano Banana | Подкаст Agent Factory

Understanding the Discrete Fourier Transform and the FFT

Understanding the Discrete Fourier Transform and the FFT

Насколько мы близки к созданию твердотельных батарей?

Насколько мы близки к созданию твердотельных батарей?

Как я автоматизировал NotebookLM с помощью Claude Code и Telegram

Как я автоматизировал NotebookLM с помощью Claude Code и Telegram

Function Calling with Spring AI

Function Calling with Spring AI

Как внимание стало настолько эффективным [GQA/MLA/DSA]

Как внимание стало настолько эффективным [GQA/MLA/DSA]

DeepSeek's Engram: The Memory Trick That Changes Everything

DeepSeek's Engram: The Memory Trick That Changes Everything

Экспресс-курс RAG для начинающих

Экспресс-курс RAG для начинающих

Гипотеза Пуанкаре — Алексей Савватеев на ПостНауке

Гипотеза Пуанкаре — Алексей Савватеев на ПостНауке

Excel против Power BI против SQL против Python | Сравнение на фондовом рынке

Excel против Power BI против SQL против Python | Сравнение на фондовом рынке

Conversation with Elon Musk | World Economic Forum Annual Meeting 2026

Conversation with Elon Musk | World Economic Forum Annual Meeting 2026

GPT Image 1.5 vs Nano Banana Pro — кто реально работает для бизнеса

GPT Image 1.5 vs Nano Banana Pro — кто реально работает для бизнеса

Объяснение mHC: как DeepSeek перестраивает программы магистратуры в области прикладных наук (LLM)...

Объяснение mHC: как DeepSeek перестраивает программы магистратуры в области прикладных наук (LLM)...

Алгоритмический скальпель: как Python помогает находить и использовать рыночные неэффективности

Алгоритмический скальпель: как Python помогает находить и использовать рыночные неэффективности