Reading the following paper:

1. Core Motivation: The Linguistic Duality

Language modeling consists of two distinct sub-tasks: compositional reasoning and knowledge retrieval. While Mixture-of-Experts (MoE) provides “conditional computation” for dynamic logic, standard Transformers lack a native primitive for “conditional memory”. Consequently, current models waste sequential depth and attention capacity in early layers to reconstruct static, local patterns (e.g., named entities like “Diana, Princess of Wales”) through expensive runtime computation. Engram is introduced to address this by providing a modernized $N$-gram lookup module with $O(1)$ complexity.


2. Technical Architecture

The Engram module functions through a two-phase process: retrieval and fusion.

  • Sparse Retrieval via Hashed $N$-grams:
    • Tokenizer Compression: To increase semantic density, raw tokens are projected into canonical identifiers using a surjective function (e.g., collapsing “Apple” and “apple”), which reduced the effective vocabulary by 23% in testing.
    • Multi-Head Hashing: To manage the intractable combinatorial space of $N$-grams, the module uses $K$ distinct hash heads for each $N$-gram order ($n$). These map compressed contexts to indices in embedding tables via multiplicative-XOR hashing.
  • Context-aware Gating and Fusion:
    • Gating Mechanism: Retrieved embeddings are static and context-independent. To resolve ambiguity (polysemy or hash collisions), the module uses the current hidden state ($h_t$) as a Query and the retrieved memory ($e_t$) as Key and Value.
    • Scalar Gate ($\alpha_t$): A gating signal is computed using RMSNorm and a sigmoid function: $\alpha_t = \sigma(\frac{RMSNorm(h_t)^\top RMSNorm(k_t)}{\sqrt{d}})$. If the memory contradicts the context, $\alpha_t$ tends toward zero.
    • Post-processing: A short depthwise causal convolution (kernel size 4) and SiLU activation are applied to expand the receptive field and enhance non-linearity before the output is added back via a residual connection.
  • Multi-branch Integration: Engram is optimized for architectures like mHC by sharing a single sparse embedding table and Value matrix across parallel branches while maintaining branch-specific Key matrices to enable distinct gating behaviors.

3. Scaling Laws and Sparsity Allocation

The authors define the Sparsity Allocation problem: how to divide a fixed “inactive” parameter budget ($P_{sparse}$) between MoE experts and Engram embeddings.

  • The U-Shaped Law: Experiments revealed a robust U-shaped relationship between validation loss and the allocation ratio ($\rho$).
    • $\rho \to 1$ (Pure MoE): Suboptimal because the model must inefficiently reconstruct static patterns through depth.
    • $\rho \to 0$ (Engram-dominated): Suboptimal because the model lacks the conditional computation needed for complex reasoning.
    • The Optimum: Reallocating 20%–25% of the sparse parameter budget to Engram consistently yielded the best performance.
  • Infinite Memory Regime: Unlike computation, Engram scaling follows a strict power law (log-linear) even without adding per-token FLOPs, providing a predictable “scaling knob” for model capacity.

4. Key Insights and Mechanistic Analysis

  • Effective Depth Increase: Mechanistic analysis using LogitLens and CKA (Centered Kernel Alignment) showed that Engram variants exhibit lower KL divergence in early layers, meaning hidden states become “prediction-ready” faster. Funcitonally, the representations at layer 5 of Engram-27B align with layer 12 of a standard MoE baseline, effectively “deepening” the network.
  • Attention Capacity Offloading: By delegating local dependency modeling to lookups, Engram frees up the attention mechanism to manage global context. This resulted in significant gains in long-context retrieval, such as a jump from 84.2 to 97.0 on Multi-Query Needle-in-a-Haystack (NIAH).
  • Functional Dichotomy: Sensitivity tests showed that “factual knowledge” (e.g., TriviaQA) relies heavily on Engram, while “reading comprehension” (e.g., C3) relies primarily on the backbone’s attention mechanism.

5. Infrastructure-Aware Efficiency

A major advantage of Engram is deterministic addressing. Because indices depend only on the input token sequence (not runtime hidden states as in MoE), the system can asynchronously prefetch embeddings from host memory (DRAM) over PCIe while preceding layers compute.

  • Empirical Result: Offloading a 100B-parameter table to host memory incurred a negligible throughput penalty ($<3\%$), demonstrating that Engram can bypass GPU VRAM constraints for massive parameter expansion.