For the Next Generation of Attention: I Propose LLD for Latent Dynamic Forget
Автор: Xiaol.x
Загружено: 2026-01-28
Просмотров: 106
Описание:
For the Next Generation of Attention: I Propose LLD for Latent Dynamic Forget
This video is a full visual and mathematical journey through the evolution of linear attention update rules, ending with a new proposal: LLD – Latent Low‑Rank Delta, a mechanism for latent dynamic forget designed for the next generation of attention models.
We start from the basics: what the state matrix S_t actually represents, and how classic linear attention simply accumulates information over time. Then we walk through the main families of update rules:
Pure Accumulation (LA): infinite memory but unstable.
Decay Mechanisms (RetNet, Mamba2, GLA, HGRN2): passive forgetting through scalar or channel‑wise decay.
Geometric Erasure / Coupled Forgetting (Longhorn, GDN, KDA): “erase what you write”, but locked to the input key.
Decoupled Erasure (Comba, RWKV‑7): learned erase vectors, powerful accumulation but still struggling with clean, targeted reset.
In the second half, we introduce LLD as a new state update rule:
S_new = (I - λ_t * u_t * v_t^T) * S_old
where the low‑rank pair (u_t, v_t) is produced by a latent bottleneck, not tied directly to the input key. Through an animated “signal vs noise” scenario, you’ll see how LLD can:
Keep early signal strong in certain channels.
Perform cross‑channel, targeted erasure of later noise.
Combine the benefits of accumulation (learning) and precise reset (forgetting) in a single linear mechanism.
We conclude with a forensic heatmap analysis comparing Softmax, RWKV‑7, KDA, and LLD under the same stress test. By zooming into specific regions of the heatmap, you’ll see:
Softmax as the ideal reference (perfect diagonal, clean noise suppression).
RWKV‑7 as a strong accumulator that also hoards noise.
KDA leaving “ghost memories” and partially washing out signal.
LLD preserving rich signal while cleanly erasing noise across channels.
This video is for you if you care about how modern attention and state‑space models really manage memory, and you want a concrete, visual argument for why latent dynamic forget via LLD is a promising direction for future architectures.
Повторяем попытку...
Доступные форматы для скачивания:
Скачать видео
-
Информация по загрузке: