Jianyu Huang

Jul 20, 2026
DeepSeek-V4 on Blackwell: Model-Specific and Agentic Optimizations in TensorRT-LLM
Jul 19, 2026
The Mechanics of Reasoning Effort and Inference Scaling
Jul 18, 2026
Kimi K3: Open Frontier Intelligence at 2.8 Trillion Parameters
Jul 17, 2026
Dissecting Inkling: Thinking Machines Lab's 975B Open-Weights MoE
Jul 16, 2026
Inside TPU and GPU Clusters: The Anatomy of Collective Communication
Jul 15, 2026
ECHO: Training Terminal Agents via World-Model Objectives
Jul 14, 2026
CODA: Optimizing Transformers via GEMM-Epilogue Programming
Jul 11, 2026
The 4-bitter Lesson: NVFP4 in the RL Loop
Jul 10, 2026
SWE-1.7: Frontier Intelligence at a Fraction of the Cost
Jul 9, 2026
Harness Engineering for Self-Improvement
Jul 3, 2026
LongCat-2.0: A Trillion-Scale MoE on AI ASICs
Jul 2, 2026
Looped Transformers: From Programmable Computers to Length Generalization
Jul 1, 2026
Life is a Series of World Cups: A Last-Minute Miracle at Lumen Field
Jun 30, 2026
Creating the Nemotron 3 Ultra NVFP4 Checkpoint
Jun 29, 2026
DFlash & DSpark: Block Diffusion and Semi-Autoregressive Drafting for Flash Speculative Decoding
Jun 28, 2026
Preparing for the ML Research Job Search: A Field Guide
Jun 27, 2026
IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse
Jun 26, 2026
Efficiency in LLMs: Mastering Fast Inference and Memory Bandwidth
Jun 25, 2026
The Architecture of Scaling Laws in Deep Learning
Jun 24, 2026
NCCL GIN & MSCCL++: Rethinking GPU Communication for Low-Latency AI
Jun 23, 2026
SOAP: Bridging First- and Second-Order Optimization via Shampoo's Eigenbasis
Jun 21, 2026
GLM-5.2: Advancing Long-Horizon Tasks with 1M Token Context
Jun 20, 2026
From One Person to a Family: Twelve Years of National Park Travels
Jun 19, 2026
RL Systems Mind the Gap: Matching Trainer and Generator Throughput
Jun 14, 2026
Scaling Video Training with Sequence Parallelism: From Multi-Modal SP to Balanced SP
Jun 13, 2026
The Distributional Lens of Post-Training: SFT, RL, and On-Policy Distillation
Jun 12, 2026
Metrics and Benchmarks Across LLM Training Stage
Jun 7, 2026
Deep Dive into MAI-Thinking-1: The Architecture of a Hill-Climbing Machine
Jun 7, 2026
The Shifting Yardsticks of Time
Jun 6, 2026
Deep Dive into Nemotron 3 Ultra: Hybrid Mamba-Attention and Agentic Reasoning at Scale
Jun 2, 2026
Echoes of Time and Space: On Serendipity, High-Density Spaces, and the Circles We Close
May 30, 2026
Deep Dive into MiniMax-M2: Unleashing Agentic Intelligence via Mini Activations
May 24, 2026
MLSys 2026 Poster Session Summary Notes
May 23, 2026
Demystifying Event Tensors and Dynamic Megakernels for LLM Inference
May 17, 2026
The Economics of a Token: A Roofline Tour of Frontier Inference
May 16, 2026
Native Interaction Models: Baking Real-Time Collaboration into the Weights
May 15, 2026
Cluster Launch Control: Hardware-Driven Dynamic Tile Scheduling on Blackwell
May 10, 2026
How to Think About GPUs for LLM Scaling
May 9, 2026
The Architecture of Resilience: On Mindset, Comparison, and the Long Game
Apr 26, 2026
DeepSeek-V4 Architecture & Training: Hybrid Attention, Muon, On-Policy Distillation
Apr 25, 2026
DeepSeek-V4 Infra: Overlap, TileLang, FP4 QAT, and Hybrid KV Cache
Apr 19, 2026
Revisiting On-Policy Distillation: Failure Modes and Local Support Matching
Apr 16, 2026
Experience Replay for LLM RL: Breaking the Generate-Then-Discard Paradigm
Apr 14, 2026
Meta-Harness: Automating LLM Context Engineering via Agentic Search
Apr 12, 2026
NVIDIA Blackwell SM100: TMEM, TMA, and the New Tensor Core Roofline
Apr 8, 2026
Training-Inference Parity in MoE Models: Where Numerics Drift
Apr 5, 2026
Gemma 4: Architecture and Multimodal Innovations
Apr 3, 2026
Residual Matrix Transformers – Scaling the Residual Stream
Mar 30, 2026
Nvidia Inference: Disaggregated Decode, LPU Integration, and Datacenter Macro-Architectures
Mar 29, 2026
Appointment with Spring: Our Seventh Year Among the Blossoms
Mar 16, 2026
Attention Residuals (AttnRes) – Generalizing Depth-wise Information Flow in LLMs
Mar 14, 2026
Reunion: A Gentle Reconciliation with Time
Mar 11, 2026
Scalable Training of Mixture-of-Experts Models with Megatron Core
Mar 10, 2026
A Two-Year-Old’s Milestone: A Heartfelt Reflection on Time
Mar 6, 2026
FlashAttention-4 and the Challenge of Asymmetric Hardware Scaling
Mar 3, 2026
CUDA Agent: Leap Forward in LLM-Driven GPU Kernel Optimization
Mar 2, 2026
Breaking the FSDP Bottleneck with veScale-FSDP for Structure-Aware Training
Feb 25, 2026
Claude Code Inflection Point
Feb 24, 2026
The State of Scaling LLM Inference (NV vs. AMD)
Feb 23, 2026
Milestones of Time: Reflections After a Birthday
Feb 20, 2026
Reading Note on GLM-5
Feb 16, 2026
The Journey | 旅途
Feb 15, 2026
The Horizon’s Calling: A Life Measured in Tides
Jan 31, 2026
Reading Note on LatentMoE
Jan 30, 2026
Kimi K2.5 Reading Note
Jan 29, 2026
Quantization-Aware Distillation (QAD) for NVFP4
Jan 28, 2026
The Weight of Thirty-Five: Reflections on Mortality and the Cycle of Life
Jan 27, 2026
DeepSeek-OCR 2 with DeepEncoder V2
Jan 26, 2026
Jet-RL and the Precision Mismatch in Reasoning Models
Jan 23, 2026
Open Source Model Quantization Strategies
Jan 18, 2026
MoE Parallel Folding
Jan 13, 2026
Engram: Scaling Large Language Models via Conditional Memory Lookup
Jan 6, 2026
Manifold-Constrained Hyper-Connections
Dec 29, 2025
High-Performance Matmul Kernels on NVIDIA Hopper
Dec 28, 2025
Rollout Routing Replay: Stabilizing MoE Reinforcement Learning
Dec 27, 2025
LLM Agent Memory
Dec 26, 2025
Self-Play SWE-RL: Superintelligent Agents via Autonomous Bug Discovery
Dec 25, 2025
LLMs as Improvement Operators & Parallel-Distill-Refine (PDR)
Dec 24, 2025
Reading Note on Performance Hints
Dec 19, 2025
Reading Note on SonicMoE
Dec 15, 2025
Reading Note on Nvidia Nemotron 3
Dec 14, 2025
Interplay of Training Stages
Dec 13, 2025
Linear Attention: Kimi Delta Attention
Dec 11, 2025
Pure BF16 Training via Stochastic Rounding
Dec 10, 2025
Adaptive NVFP4 Quantization
Dec 8, 2025
LongCat Flash
Dec 7, 2025
MXFP8 Training
Dec 6, 2025
Tokenizer Learning
Dec 5, 2025
LLM Architecture Evolution
Dec 2, 2025
First-Order Approximation for Stable LLM-RL Training
Dec 1, 2025
DeepSeek-V3.2 Reading Note
Nov 30, 2025
vLLM V1 Understanding
Nov 29, 2025
Smol Training Playbook Reading Note
Nov 28, 2025
Infra Math for LLM Training
Nov 23, 2025
Predictable Scaling of Reinforcement Learning for LLMs
Nov 22, 2025
Reasoning Limits of LLMs Under RLVR
Nov 21, 2025
LoRA, Manifolds, and OPD
Nov 21, 2025
NVFP4: Stable 4-Bit Training at 10 Trillion Tokens
Aug 4, 2025
Kimi-K2 Reading Note
Jun 25, 2025
Recent RL Infra Related Papers
Jun 23, 2025
High Precision Used for Reasoning Recipes
May 15, 2025
DeepSeek-V3's Hardware-Aware Design
May 5, 2025
Summary on Llama-Nemotron
May 4, 2025
Summary on StreamRL
May 1, 2025
DAPO Reading Note
Mar 30, 2025
Disaggregate Prefill and Decoding
Jan 20, 2025
Summary on DeepSeek R1 and Kimi k1.5
Jan 18, 2025
Summary on MiniMax-01
Dec 30, 2024
Summary on Zero Bubble
Dec 29, 2024
CUDA H100 GEMM Optimization
Dec 28, 2024
Summary of SemiAnalysis o1 Reasoning Report
Dec 27, 2024
People Retrospective: Communication, Growth, Collaboration, and Challenges
Dec 26, 2024
Summary on DeepSeek V3
Dec 23, 2024
Scaling Law
Dec 15, 2024
Understand Speculative Decoding for LLM Inference
Nov 12, 2024
Notes on Reading Hunyuan Model
Nov 11, 2024
Educational Materials for GEMM Optimizations on CPUs and GPUs