Reading the following paper and blogs:

1. High-Level Overview

LongCat-Flash is a 560-billion parameter MoE language model designed to optimize the trade-off between model scale and inference latency. It matches state-of-the-art non-thinking models (e.g., DeepSeek-V3.1, Kimi-K2) in performance while achieving higher throughput (>100 tokens/sec on H800) and lower inference costs ($0.70/1M output tokens). The model is explicitly tuned for agentic capabilities, featuring extended context (128k) and high proficiency in tool use and complex reasoning.

2. Core Architectural Innovations

The architecture departs from standard Top-K routing by introducing dynamic compute and communication hiding.

A. Dynamic Compute: Zero-Computation Experts

Unlike traditional MoE, LongCat-Flash does not treat all tokens equally. It introduces Zero-Computation Experts (identity functions) alongside standard FFN experts to reduce redundant computation for “easy” tokens.

  • Mechanism: The router selects $K$ experts from a pool of $N$ FFN experts and $Z$ identity experts.
    • Formula: $MoE(x_t) = \sum_{i=1}^{N+Z} g_i E_i(x_t)$, where $E_i(x_t) = x_t$ (identity) if $N < i \le N+Z$.
  • Result: The model activates a variable number of parameters per token—ranging from 18.6B to 31.3B—depending on token difficulty, with an average of ~27B.
  • PID Control: To prevent the router from collapsing (ignoring FFNs) or becoming lazy, the team utilized a PID controller for expert bias adaptation rather than a simple auxiliary loss. This forces the router to converge to the target expectation ($K_e$) of FFN usage: \(\Delta b_i = \mu \left( \frac{K_e}{K} \cdot \frac{1}{N} - \frac{T_i}{K T_{all}} \right)\)

B. Shortcut-Connected MoE (ScMoE)

To resolve communication bottlenecks inherent in Expert Parallelism (EP), the architecture adds a shortcut from the first Multi-head Latent Attention (MLA) output directly to the MoE block.

  • Benefit: This enables Single Batch Overlap (SBO). The dense FFN of the preceding block can execute in parallel with the Dispatch/Combine communication of the current layer.
  • Impact: Reduces the time spent on non-overlapping dispatch/combine communication from 25.3% to 8.4% compared to standard architectures.

3. Training Stability & Scaling Strategy

LongCat-Flash was trained on 20T tokens in 30 days using a rigorous stability framework designed to handle the scale of 560B parameters.

  • Initialization via Model Growth: Instead of random initialization, the model was initialized by stacking layers from a pre-trained 14-layer half-scale model ($r=2$ expansion). This method demonstrated faster convergence and lower loss compared to random initialization.
  • Hyperparameter Transfer: Optimal hyperparameters were not searched at 560B scale. They were predicted using “width scaling” laws from a smaller proxy model ($n_{proxy}=768 \to n_{target}=6144$, scaling factor $s=8$).
    • Rule: Embedding Learning Rate is constant; Hidden layer Learning Rate scales by $1/s$.
  • Stability Interventions:
    • Adam Epsilon ($\epsilon$): Set to $1e^{-16}$ (vs. standard $1e^{-8}$). Analysis showed that as model scale increases, gradient RMS norms drop; standard epsilon values interfere with the optimizer’s adaptive division, causing loss spikes.
    • Hidden z-loss: Introduced to penalize massive activation magnitudes in the final layer ($z_t$), preventing numerical instability.
    • Router Gradient Balance ($R_g$): The ratio of Load Balancing gradient norm to LM gradient norm is monitored and kept $< 0.1$ to ensure the auxiliary loss does not dominate the gradient direction.

4. Inference System Co-Design

The inference engine is co-designed with the model architecture to maximize hardware utilization.

  • SBO Pipeline: Leveraging ScMoE, the pipeline overlaps inter-node EP communication (RDMA) with intra-node Tensor Parallelism communication (NVLink).
  • Speculative Decoding with Dense MTP: The model uses Multi-Token Prediction (MTP) as a draft model. Uniquely, the MTP head uses a single dense layer rather than an MoE layer. This lightweight design offers a better trade-off between draft latency and acceptance rate (>90%).
  • Custom Kernels:
    • “SwapAB” GEMM: For small batch sizes (decoding phase), weights are treated as the left-hand matrix to utilize the $n$-dimension granularity (8 elements) rather than the token dimension, maximizing Tensor Core usage.
    • Communication: Custom kernels utilize multimem.st (broadcast) and multimem.ld_reduce (in-switch reduction), outperforming NCCL on message sizes from 4KB to 96MB.

5. Data & Agentic Capabilities

The model is positioned as a specialist in agentic tasks, supported by a specific data synthesis strategy.

  • Synthetic Data Construction: A multi-agent framework generates training data by manipulating difficulty across three axes: Information Processing, Tool Set Complexity (graph density), and User Interaction (simulating reluctant users requiring strategic questioning).
  • Performance:
    • Agentic: Outperforms comparable models on VitaBench (real-world business scenarios) and Meeseeks (iterative feedback).
    • General: MMLU 89.71, GSM8K 92.19, matching DeepSeek-V3.1.
    • Context: Validated up to 128k context length.