scn-sanchar.github.io/index.html at main · scn-sanchar/scn-sanchar.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="UTF-8">
    <title>Docs Files</title>
</head>

<body>
    <h1>A New Axis of Sparsity for Large Language Models</h1>

    <p>While Mixture-of-Experts (MoE) scales capacity via conditional computation, Transformers lack
        a native primitive for knowledge lookup, forcing them to inefficiently simulate retrieval through
        computation. To address this, we introduce conditional memory as a complementary sparsity
        axis, instantiated via Engram, a module that modernizes classic 𝑁-gram embedding for O (1)
        lookup. By formulating the Sparsity Allocation problem, we uncover a U-shaped scaling law
        that optimizes the trade-off between neural computation (MoE) and static memory (Engram).
        Guided by this law, we scale Engram to 27B parameters, achieving superior performance
        over a strictly iso-parameter and iso-FLOPs MoE baseline. Most notably, while the memory
        module is expected to aid knowledge retrieval (e.g., MMLU +3.4; CMMLU +4.0), we observe
        even larger gains in general reasoning (e.g., BBH +5.0; ARC-Challenge +3.7) and code/math
        domains (HumanEval +3.0; MATH +2.4). Mechanistic analyses reveal that Engram relieves
        the backbone’s early layers from static reconstruction, effectively deepening the network for
        complex reasoning. Furthermore, by delegating local dependencies to lookups, it frees up
        attention capacity for global context, substantially boosting long-context retrieval (e.g., MultiQuery NIAH:
        84.2 → 97.0). Finally, Engram establishes infrastructure-aware efficiency: its
        deterministic addressing enables runtime prefetching from host memory, incurring negligible
        overhead. We envision conditional memory as an indispensable modeling primitive for nextgeneration sparse
        models.
    </p>

    <h2>Introduction</h2>

    <p>
        Sparsity is a recurring design principle for intelligent systems, spanning from biological neural
        circuits (Lennie, 2003; Olshausen and Field, 1997) to modern Large Language Models (LLMs).
        Currently, this principle is primarily realized through Mixture-of-Experts (MoE) (Dai et al., 2024;
        Shazeer et al., 2017), which scales capacity via conditional computation. Owing to its ability to
        drastically increase model size without proportional increases in compute, MoE has become the
        de facto standard for frontier models (Comanici et al., 2025; Guo et al., 2025; Team et al., 2025).
        Despite the success of this conditional computation paradigm, the intrinsic heterogeneity
        of linguistic signals suggests significant room for structural optimization. Specifically, language
        modeling entails two qualitatively different sub-tasks: compositional reasoning and knowlarXiv:2601.07372v1
        [cs.CL] 12 Jan 2026
        edge retrieval. While the former demands deep, dynamic computation, a substantial portion
        of text—such as named entities and formulaic patterns—is local, static, and highly stereotyped (Constant et al.,
        2017; Erman, 2000). The effectiveness of classical 𝑁-gram models (Brants
        et al., 2007; Liu et al., 2024b; Nguyen, 2024) in capturing such local dependencies implies that
        these regularities are naturally represented as computationally inexpensive lookups. Since
        standard Transformers (Vaswani et al., 2017) lack a native knowledge lookup primitive, current
        LLMs are forced to simulate retrieval through computation. For instance, resolving a common
        multi-token entity requires consuming multiple early layers of attention and feed-forward networks
        (Ghandeharioun et al., 2024; Jin et al., 2025) (see Table 3). This process essentially amounts
        to an expensive runtime reconstruction of a static lookup table, wasting valuable sequential
        depth on trivial operations that could otherwise be allocated to higher-level reasoning.
        To align model architecture with this linguistic duality, we advocate for a complementary
        axis of sparsity: conditional memory. Whereas conditional computation sparsely activates
        parameters to process dynamic logic (Bengio et al., 2013; Shazeer et al., 2017), conditional
        memory relies on sparse lookup operations to retrieve static embeddings for fixed knowledge.
        As a preliminary exploration of this paradigm, we revisit 𝑁-gram embeddings (Bojanowski et al.,
        2017) as a canonical instantiation: local context serves as a key to index a massive embedding
        table via constant-time O (1) lookups (Huang et al., 2025a; Pagnoni et al., 2025; Tito Svenstrup
        et al., 2017; Yu et al., 2025). Our investigation reveals that, perhaps surprisingly, this static
        retrieval mechanism can serve as an ideal complement to modern MoE architecture—but
        only if it is properly designed. In this paper, we propose Engram, a conditional memory
        module grounded in the classic 𝑁-gram structure but equipped with modern adaptations
        such as tokenizer compression, multi-head hashing, contextualized gating, and multi-branch
        integration (detailed in Section 2).
        To quantify the synergy between these two primitives, we formulate the Sparsity Allocation
        problem: given a fixed total parameter budget, how should capacity be distributed between
        MoE experts and Engram memory? Our experiments uncover a distinct U-shaped scaling
        law, revealing that even simple lookup mechanisms, when treated as a first-class modeling
        primitive, act as essential complements to neural computation. Guided by this allocation law, we
        scale Engram to a 27B-parameter model. Compared to a strictly iso-parameter and iso-FLOPs
        MoE baseline, Engram-27B achieves superior efficiency across diverse domains. Crucially, the
        gains are not limited to knowledge-intensive tasks (e.g., MMLU: +3.4; CMMLU: +4.0; MMLUPro: +1.8), where memory
        capacity is intuitively beneficial; we observe even more significant
        improvements in general reasoning (e.g., BBH: +5.0; ARC-Challenge: +3.7; DROP: +3.3) and
        code/math domains (e.g., HumanEval: +3.0; MATH: +2.4; GSM8K: +2.2).
        Mechanistic analysis via LogitLens (nostalgebraist, 2020) and CKA (Hendrycks et al., 2021a)
        reveals the source of these gains: Engram relieves the backbone from reconstructing static
        knowledge in early layers, thereby increasing effective depth available for complex reasoning. Furthermore, by
        delegating local dependencies to lookups, Engram frees up attention
        capacity to focus on global context, enabling exceptional performance in long-context scenarios—substantially
        outperforming baselines on LongPPL (Fang et al.) and RULER (Hsieh et al.)
        (e.g., Multi-Query NIAH: 97.0 vs. 84.2; Variable Tracking: 89.0 vs. 77.0).
        Finally, we establish infrastructure-aware efficiency as a first-class principle. Unlike MoE’s
        dynamic routing, Engram employs deterministic IDs to enable runtime prefetching, overlapping
        communication with computation. Empirical results show that offloading a 100B-parameter
        table to host memory incurs negligible overhead (< 3%). This demonstrates that Engram effectively bypasses GPU
            memory constraints, facilitating aggressive parameter expansion. </p>

    <p>
Figure 1 | The Engram Architecture. The module augments the backbone by retrieving static 𝑁-
gram memory and fusing it with dynamic hidden states via context-aware gating. This module
is applied only to specific layers to decouple memory from compute, leaving the standard input
embedding and un-embedding module intact.
    </p>


</body>

</html>