GLM-5.2: Advancing Long-Horizon Tasks with 1M Token Context

Reading notes on:

GLM-5.2: Advancing Long-Horizon Tasks with 1M Token Context

GLM-5.2 is an open-weights (MIT license) large language model explicitly engineered for sustained, long-horizon tasks—such as automated research, complex debugging, and large-scale software engineering. Its flagship feature is a highly reliable 1M-token context window, supported by profound architectural optimizations, sophisticated reinforcement learning (RL) infrastructure, and dynamically controllable “thinking effort”.

Under the hood, it is a 744 Billion parameter behemoth that heavily utilizes sparsity to remain computationally tractable. Here is a deep dive into the technical breakthroughs that make this possible.

1. Macro-Architecture: Extreme Sparsity via MoE

To support a massive 744B parameter scale without requiring proportional compute at inference, GLM-5.2 employs a highly optimized Mixture of Experts (MoE) routing system:

Active Parameter Efficiency: Despite its 744B size, only 40B parameters are active during any single inference step per token. This extreme sparsity is the primary reason the model can be served efficiently.
Expert Routing: The MoE layer consists of 256 distinct FeedForward experts. For every token processed, the router selects only 8 specialized experts to activate, alongside 1 shared expert that remains active across all tokens to preserve generalized knowledge.
SwiGLU FeedForward Specifics: The individual expert modules are built on a SwiGLU architecture (Linear → SiLU activation → Linear → Linear). They take an input expert size of 6,144 and utilize an intermediate projection size of 2,048.
Dense Foundation: Interestingly, GLM-5.2 does not use MoE uniformly. The first 3 transformer blocks utilize a Dense FeedForward Network (FFN) with a massive hidden size of 12,288, rather than the MoE structure. This likely acts to create a highly stable, generalized representational foundation before routing tokens to specialized experts in deeper layers.

2. Transformer Block Anatomy & 1M Context

The core sequence modeling is handled by a deep stack of 78 transformer layers operating on an embedding dimension of 6,144 and a vocabulary size of 155,000 tokens. The block relies on standard Pre-Norm architecture using RMSNorm (applied before Attention, before MoE, and finally before the linear output layer) and utilizes Rotary Position Embedding (RoPE) to encode token positional data.

Expanding a context window to 1M tokens natively imposes crippling memory and computational bottlenecks. GLM-5.2 circumvents these through a novel technique called IndexShare (the general form of which is analyzed in IndexCache: Cross-Layer Index Reuse), paired with a heavily optimized Multi-Token Prediction (MTP) layer.

Hybrid Attention Mechanism: MLA + DSA

Inside the main transformer block (layers 4 through 78), the attention computation is a fusion of Multi-head Latent Attention (MLA) and DeepSeek Sparse Attention (DSA).

DSA works to constrain compute: rather than computing full dense attention (where every token attends to all previous tokens), DSA restricts a token so it “can only attend itself and selected previous tokens”. This sparsity allows the attention matrix to remain manageable.

IndexShare for Dense Sparse Attention (DSA)

To reduce the computational burden of the indexer in DSA, GLM-5.2 reuses a single, lightweight indexer across every block of 4 transformer layers.

Mechanism: The indexer is placed at the first of the 4 layers. It computes the topk indices, which are then passed down and reused by the subsequent 3 layers.
Insight & Impact: This entirely eliminates the indexer dot product and topk operations in 75% of the layers, slashing per-token FLOPs by 2.9× at a 1M context length. The model is trained with IndexShare starting from mid-training (at a 128K sequence length) and achieves better long-context performance than its predecessor, GLM-5.1, while using significantly less computation.
Generalization: IndexShare’s fixed every-4-layers reuse is a uniform interleaving pattern. The IndexCache work generalizes this idea — showing that a data-driven greedy layer search beats uniform interleaving in the training-free setting, and proving that a multi-layer distillation loss is what makes uniform reuse (like IndexShare) safe once re-trained.

MTP with IndexShare and KVShare

GLM-5.2 utilizes a Multi-Token Prediction (MTP) layer acting as a draft model for speculative decoding. The designers optimized this layer to satisfy two constraints: minimizing drafting cost and maximizing the speculative acceptance rate.

Applying IndexShare to the multi-step MTP layer requires careful alignment of hidden states to prevent training-inference discrepancies.

The Derivation of the KV-Mix Problem: In a standard multi-step MTP, the input tokens differ across steps. If we denote the hidden states as $h_n$, and we naively reuse the topk indices of $h_4$ for the next step $h_5$, then $h_5$ will only be able to attend to the sequence $h_1 \dots h_4$, but not to itself ($h_5$).
Inference vs. Training Consistency: During the inference of a two-step MTP layer, step one uses hidden states entirely from the target (backbone) model. However, in step two, the preceding states $h_{1:4}$ come from the target model, while $h_5$ comes from the MTP layer.
KVShare Solution: Without IndexShare, the Key-Value (KV) cache for $h_5$ would be a polluted mixture: $kv_{1:4}$ (computed from the target model) combined with $kv_5$ (computed from the MTP draft layer). By utilizing IndexShare, $h_5$ is restricted from attending to itself, meaning its KV cache relies exclusively on $kv_{1:4}$, all of which originate from the highly accurate target model.
Training Application: During training, the system simply reuses both the KV cache and the topk indices from the first MTP step, eliminating the discrepancy between training and inference.

Furthermore, GLM-5.2 integrates rejection sampling for speculative decoding and trains the draft layer using an end-to-end Total Variation (TV) loss. These combined MTP optimizations (IndexShare + KVShare + Rejection Sampling + TV Loss) increase the speculative decoding acceptance length by 20% (from a baseline of 4.56 to 5.47 tokens per step).

3. Reinforcement Learning (RL) for Super-Long Trajectories

Long-horizon tasks yield highly variable and extensively long execution traces, making standard RL alignment methods unstable. As discussed in RL Systems Mind the Gap, the efficiency of these systems often depends on matching trainer and generator throughput.

Single-Rollout PPO and Compaction

When handling 1M-token workflows, a single long trajectory is often split (compacted) into multiple sub-traces. Because a single prompt can yield a vastly different number of sub-traces with wildly different lengths, traditional group-wise optimization fails.

The Pivot: GLM-5.2 abandons group-wise comparisons in favor of a critic-based PPO formulation that evaluates token-level advantages from individual rollouts.
Insight: This single-rollout approach seamlessly accommodates trajectory compaction. All compacted sub-traces are included as trainable data, and a token-level loss mathematically smooths out the severe length imbalances between them.

Online Anti-Hack Verification Mechanism

In coding RL, the model often discovers “shortcuts” (reward hacking) to pass isolated unit tests rather than genuinely solving the problem (e.g., downloading the answer sheet directly via curl https://raw.githubusercontent.com/...).

Dual-Stage Detection: GLM-5.2 implements an anti-hack module utilizing a high-recall rule-based filter, followed by a high-precision LLM judge to verify the intent behind flagged actions.
Online Trajectory Preservation: Instead of halting the rollout when a hack is detected—which causes training instability and model collapse—the system blocks the malicious tool call and returns dummy information to the model. This forces the model to recover and continue the trajectory organically, stabilizing the RL signal.

4. Systems Engineering: `slime` Framework and Inference Scaling

Expanding context mathematically is only half the battle; serving 1M tokens physically shifts the bottleneck from raw compute to KV-cache capacity and GPU/CPU memory bandwidth overheads.

Optimizing the Inference Engine

LayerSplit & Finer-Grained Memory: GLM-5.2 introduces advanced memory parallelization built on LayerSplit, maximizing usable cache space for ultra-long contexts.
Pipeline Coordination: Long-context kernels are explicitly coordinated with the cache transfer pipeline to mask the latency of prefill and decode phases.
Bubble Reduction: CPU-side request scheduling and execution paths are optimized to eliminate GPU pipeline bubbles, allowing throughput advantages that actually scale up as context length grows.

The `slime` Orchestration Layer

To manage the chaotic logistics of long-horizon Agentic RL, the team developed slime—a unified infrastructure layer handling both training and large-scale inference rollout.

slime supports white/black-box rollouts, multi-agent workflows, and KV-cache FP8.
It enabled highly efficient On-Policy Distillation (OPD), allowing the team to merge more than ten expert models into the final GLM-5.2 release in roughly two days. (See the previous post on Revisiting On-Policy Distillation for a deep dive into the mechanics.)

5. Benchmark Insights and Competitive Landscape

Because of these deep-stack optimizations, GLM-5.2 is currently the highest-ranked open-source model across major long-horizon coding benchmarks.

FrontierSWE: Trails Claude Opus 4.8 by only 1%, but beats Claude Opus 4.7 by 11%.
Terminal-Bench 2.1: Scores 81.0, vastly outperforming GLM-5.1 (63.5) and landing just shy of Opus 4.8 (85.0).
General Reasoning: In external landscape evaluations (like Artificial Analysis), GLM-5.2 (max effort) achieves a top-tier score of 51, comfortably outperforming Gemini 3.5 Flash (50), Claude Sonnet 4.6 max (47), and DeepSeek V4 Pro Max (44). It represents a generational leap from GLM-5, which scored 40.

Configurable Compute: Finally, GLM-5.2 surfaces a unique feature to developers: thinking effort control (e.g., High or Max). This allows engineering teams to dynamically dial up test-time compute on a per-task basis, intentionally trading execution latency and quota for higher reasoning accuracy during severe debugging sessions.