Reading Note on GLM-5

Reading the following paper:

GLM-5: from Vibe Coding to Agentic Engineering

The core philosophy driving GLM-5 is the transition from “vibe coding” (where humans manually prompt AI models to write code) to “agentic engineering” (where AI agents autonomously plan, implement, and iterate on long-horizon tasks).

GLM-5 is a 744B parameter Mixture-of-Experts (MoE) model (with 40B active parameters) trained on 28.5 trillion tokens. It has achieved state-of-the-art results across major benchmarks, including becoming the first open-weights model to hit a score of 50 on the Artificial Analysis Intelligence Index v4.0 and ranking #1 among open models on the LMArena Text and Code leaderboards.

1. Architectural Efficiency: MLA, Muon Split, and DSA

To balance high performance with computational efficiency, GLM-5 introduces several architectural innovations:

Multi-latent Attention (MLA) & “Muon Split”: GLM-5 adopts MLA to match the effectiveness of Grouped-Query Attention (GQA) while saving GPU memory during long-context processing. However, MLA initially underperformed GQA when paired with the Muon optimizer. Muon Split is a technique that splits up-projection matrices into smaller matrices for independent matrix orthogonalization, allowing different attention heads to update at different scales and stabilizing attention logits.
DeepSeek Sparse Attention (DSA): To handle extreme context lengths (up to 200K tokens) without exploding costs, GLM-5 integrates DSA during its continued pre-training phase. Unlike Sliding Window Attention, which degrades performance, DSA uses a lightning indexer to dynamically select important tokens, reducing attention computation by 1.5–2x losslessly.
Multi-Token Prediction (MTP) with Parameter Sharing: To improve speculative decoding without scaling memory linearly, GLM-5 shares parameters across 3 MTP layers during training. This keeps the draft model’s memory footprint consistent with a single MTP layer while yielding a higher token acceptance rate than DeepSeek-V3.2.

2. SFT: The Three Pillars of “Thinking”

To support robust agentic workflows, GLM-5’s Supervised Fine-Tuning (SFT) phase introduces sophisticated thinking mechanisms:

Interleaved Thinking: The model explicitly “thinks” before generating every response and tool call, drastically improving instruction following.
Preserved Thinking: Critical for multi-turn agentic coding, the model retains its thinking blocks across conversations, reusing prior reasoning instead of starting from scratch every turn.
Turn-level Thinking: The ability to dynamically disable thinking for lightweight requests to save latency, or enable it for complex problem-solving.

3. Asynchronous Reinforcement Learning (RL)

The post-training pipeline of GLM-5 sequentially runs Reasoning RL, Agentic RL, and General RL. The standout innovation here is its Asynchronous Agent RL framework, designed to tackle the severe GPU idle times typical of long-horizon agent rollouts.

Decoupled Infrastructure: Using the “slime” framework, training and inference are decoupled onto different GPUs via a central Multi-Task Rollout Orchestrator.
Token-in-Token-out (TITO): To prevent alignment corruption from re-tokenization during asynchronous rollouts, a TITO gateway intercepts the exact token IDs produced by the inference engine and passes them directly to the trainer.
Direct Double-sided Importance Sampling: Asynchronous updates mean trajectories are generated by slightly stale policies. Instead of keeping a massive history of old model checkpoints, GLM-5 directly reuses the rollout log-probabilities and applies token-level clipping (masking tokens that deviate too far) to ensure stable optimization.
Cross-Stage Distillation: To prevent “catastrophic forgetting” of reasoning skills while learning human alignment, the final step is on-policy distillation, using earlier stage checkpoints as teachers to recover base capabilities.

4. Scaling Verifiable Environments for Agentic Training

To transition to “agentic engineering,” the team had to build massive, verifiable environments to provide the model with grounded feedback rather than relying on human annotations:

SWE and Terminal Environments: They built over 10,000 real-world Software Engineering (SWE) execution environments using the RepoLaunch framework and Harbor-formatted Docker containers. This allowed the model to test code, view error logs, and iteratively fix bugs autonomously.
Agent-as-a-Judge (CC-Bench-V2): For frontend evaluation, where static testing fails to capture visual bugs, they used an autonomous Judge Agent equipped with Playwright tools to click, inspect UIs, and grade the generated code—achieving a 94% agreement rate with human experts.

5. Pragmatic Hardware Adaptation

A fascinating detail is GLM-5’s full-stack adaptation to the Chinese GPU ecosystem (e.g., Huawei Ascend, Moore Threads, Cambricon). To fit a 744B model onto a single Atlas 800T A3 node, the team used W4A8 mixed-precision quantization (INT4 for MoE experts, INT8 for Attention/MLP blocks) alongside highly optimized custom fusion kernels (e.g., fusing score calculation, ReLU, and TopK into a single “Lightning Indexer” kernel).

Insight

GLM-5 is not just an incremental parameter bump; it is an infrastructural blueprint for training autonomous AI. By aggressively optimizing memory (MLA + Muon Split), drastically cutting long-context compute (DSA), and fully decoupling RL training engines to learn from thousands of sandboxed Docker environments, GLM-5 successfully transforms LLMs from intelligent chatbots into reliable, multi-step software engineers.