-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathindex.html
More file actions
103 lines (94 loc) · 7.71 KB
/
index.html
File metadata and controls
103 lines (94 loc) · 7.71 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Docs Files</title>
</head>
<body>
<h1>A New Axis of Sparsity for Large Language Models</h1>
<p>While Mixture-of-Experts (MoE) scales capacity via conditional computation, Transformers lack
a native primitive for knowledge lookup, forcing them to inefficiently simulate retrieval through
computation. To address this, we introduce conditional memory as a complementary sparsity
axis, instantiated via Engram, a module that modernizes classic 𝑁-gram embedding for O (1)
lookup. By formulating the Sparsity Allocation problem, we uncover a U-shaped scaling law
that optimizes the trade-off between neural computation (MoE) and static memory (Engram).
Guided by this law, we scale Engram to 27B parameters, achieving superior performance
over a strictly iso-parameter and iso-FLOPs MoE baseline. Most notably, while the memory
module is expected to aid knowledge retrieval (e.g., MMLU +3.4; CMMLU +4.0), we observe
even larger gains in general reasoning (e.g., BBH +5.0; ARC-Challenge +3.7) and code/math
domains (HumanEval +3.0; MATH +2.4). Mechanistic analyses reveal that Engram relieves
the backbone’s early layers from static reconstruction, effectively deepening the network for
complex reasoning. Furthermore, by delegating local dependencies to lookups, it frees up
attention capacity for global context, substantially boosting long-context retrieval (e.g., MultiQuery NIAH:
84.2 → 97.0). Finally, Engram establishes infrastructure-aware efficiency: its
deterministic addressing enables runtime prefetching from host memory, incurring negligible
overhead. We envision conditional memory as an indispensable modeling primitive for nextgeneration sparse
models.
</p>
<h2>Introduction</h2>
<p>
Sparsity is a recurring design principle for intelligent systems, spanning from biological neural
circuits (Lennie, 2003; Olshausen and Field, 1997) to modern Large Language Models (LLMs).
Currently, this principle is primarily realized through Mixture-of-Experts (MoE) (Dai et al., 2024;
Shazeer et al., 2017), which scales capacity via conditional computation. Owing to its ability to
drastically increase model size without proportional increases in compute, MoE has become the
de facto standard for frontier models (Comanici et al., 2025; Guo et al., 2025; Team et al., 2025).
Despite the success of this conditional computation paradigm, the intrinsic heterogeneity
of linguistic signals suggests significant room for structural optimization. Specifically, language
modeling entails two qualitatively different sub-tasks: compositional reasoning and knowlarXiv:2601.07372v1
[cs.CL] 12 Jan 2026
edge retrieval. While the former demands deep, dynamic computation, a substantial portion
of text—such as named entities and formulaic patterns—is local, static, and highly stereotyped (Constant et al.,
2017; Erman, 2000). The effectiveness of classical 𝑁-gram models (Brants
et al., 2007; Liu et al., 2024b; Nguyen, 2024) in capturing such local dependencies implies that
these regularities are naturally represented as computationally inexpensive lookups. Since
standard Transformers (Vaswani et al., 2017) lack a native knowledge lookup primitive, current
LLMs are forced to simulate retrieval through computation. For instance, resolving a common
multi-token entity requires consuming multiple early layers of attention and feed-forward networks
(Ghandeharioun et al., 2024; Jin et al., 2025) (see Table 3). This process essentially amounts
to an expensive runtime reconstruction of a static lookup table, wasting valuable sequential
depth on trivial operations that could otherwise be allocated to higher-level reasoning.
To align model architecture with this linguistic duality, we advocate for a complementary
axis of sparsity: conditional memory. Whereas conditional computation sparsely activates
parameters to process dynamic logic (Bengio et al., 2013; Shazeer et al., 2017), conditional
memory relies on sparse lookup operations to retrieve static embeddings for fixed knowledge.
As a preliminary exploration of this paradigm, we revisit 𝑁-gram embeddings (Bojanowski et al.,
2017) as a canonical instantiation: local context serves as a key to index a massive embedding
table via constant-time O (1) lookups (Huang et al., 2025a; Pagnoni et al., 2025; Tito Svenstrup
et al., 2017; Yu et al., 2025). Our investigation reveals that, perhaps surprisingly, this static
retrieval mechanism can serve as an ideal complement to modern MoE architecture—but
only if it is properly designed. In this paper, we propose Engram, a conditional memory
module grounded in the classic 𝑁-gram structure but equipped with modern adaptations
such as tokenizer compression, multi-head hashing, contextualized gating, and multi-branch
integration (detailed in Section 2).
To quantify the synergy between these two primitives, we formulate the Sparsity Allocation
problem: given a fixed total parameter budget, how should capacity be distributed between
MoE experts and Engram memory? Our experiments uncover a distinct U-shaped scaling
law, revealing that even simple lookup mechanisms, when treated as a first-class modeling
primitive, act as essential complements to neural computation. Guided by this allocation law, we
scale Engram to a 27B-parameter model. Compared to a strictly iso-parameter and iso-FLOPs
MoE baseline, Engram-27B achieves superior efficiency across diverse domains. Crucially, the
gains are not limited to knowledge-intensive tasks (e.g., MMLU: +3.4; CMMLU: +4.0; MMLUPro: +1.8), where memory
capacity is intuitively beneficial; we observe even more significant
improvements in general reasoning (e.g., BBH: +5.0; ARC-Challenge: +3.7; DROP: +3.3) and
code/math domains (e.g., HumanEval: +3.0; MATH: +2.4; GSM8K: +2.2).
Mechanistic analysis via LogitLens (nostalgebraist, 2020) and CKA (Hendrycks et al., 2021a)
reveals the source of these gains: Engram relieves the backbone from reconstructing static
knowledge in early layers, thereby increasing effective depth available for complex reasoning. Furthermore, by
delegating local dependencies to lookups, Engram frees up attention
capacity to focus on global context, enabling exceptional performance in long-context scenarios—substantially
outperforming baselines on LongPPL (Fang et al.) and RULER (Hsieh et al.)
(e.g., Multi-Query NIAH: 97.0 vs. 84.2; Variable Tracking: 89.0 vs. 77.0).
Finally, we establish infrastructure-aware efficiency as a first-class principle. Unlike MoE’s
dynamic routing, Engram employs deterministic IDs to enable runtime prefetching, overlapping
communication with computation. Empirical results show that offloading a 100B-parameter
table to host memory incurs negligible overhead (< 3%). This demonstrates that Engram effectively bypasses GPU
memory constraints, facilitating aggressive parameter expansion. </p>
<p>
Figure 1 | The Engram Architecture. The module augments the backbone by retrieving static 𝑁-
gram memory and fusing it with dynamic hidden states via context-aware gating. This module
is applied only to specific layers to decouple memory from compute, leaving the standard input
embedding and un-embedding module intact.
</p>
</body>
</html>