Open Source Model Quantization Strategies

This breakdown details quantization strategies across leading open-source model families and inference engines.

This table details the primary quantization “recipes” used for specific model families in vLLM and SGLang.

Model Family	Quantization Recipe	Bit Precision (Weight / Act)	Target Modules	vLLM Implementation	SGLang Implementation
DeepSeek-V3 / R1	FP8 Block-Wise	W8A8 (FP8 E4M3)	Experts (MoE) + Linear Layers	DeepGEMM (Integrated). vLLM now supports DeepGEMM for MoE and MQA logits.	DeepGEMM (Native). Uses this custom kernel to handle 128x128 block scaling efficiently.
Kimi K2 Thinking	Native INT4 (QAT)	W4A16 (INT4 / FP16)	All Linear Layers	Compressed-Tensors. Loaded via compressed-tensors format using optimized INT4 kernels.	Native Support. Day-0 support in v0.5.5; maps to FlashInfer INT4 kernels.
GPT-OSS (120B/20B)	MXFP4	W4A8 (FP4 / FP8)	MoE Layers (Weights in e2m1 format)	Marlin MXFP4. Uses a specialized Marlin kernel variant for MXFP4 MoE operations.	MXFP4. Supported via recent updates to handle the specific Microscaling format.
Llama 3.1 / 3.3	FP8 (FBGEMM)	W8A8 (FP8)	Linear Layers (QKV, MLP)	Machete (Hopper). A mixed-input GEMM kernel optimized for H100s.	FlashInfer / CUTLASS. Uses JIT-compiled kernels for FP8 GEMM.
Qwen 2.5 / Mistral	GPTQ / AWQ	W4A16 (INT4 / FP16)	Linear Layers	Marlin (Ampere). Standard highly-optimized INT4 kernel.	FlashInfer. Maps GPTQ/AWQ metadata to FlashInfer W4A16 kernels.
General (Any)	AutoRound	W4A8 (INT4 / INT8)	Linear Layers	AutoRound Plugin. Supported via llm-compressor integration.	AutoRound Native. Directly loads AutoRound checkpoints for low-bit W4A8 inference.

Quantizing the KV Cache is independent of model weights and is critical for increasing context length and batch size.

Strategy	Precision	Supported Engines	Hardware Requirement	Notes
FP8 E4M3	8-bit	vLLM & SGLang	Hopper (H100), Ada (L40S)	Preferred format. Higher precision (3 mantissa bits) preserves accuracy better than E5M2.
FP8 E5M2	8-bit	vLLM & SGLang	Hopper, Ampere (via cast)	Higher dynamic range but lower precision. Often used if E4M3 scaling factors are missing.
NVFP4	4-bit	Experimental	Blackwell (B200)	Emerging standard. Reduces cache size by another 50% vs FP8.

To understand why performance differs, it is necessary to examine the specific kernel handling the mathematics:

DeepGEMM (SGLang/vLLM): Written specifically for DeepSeek-V3. Unlike standard kernels that scale per-tensor, DeepGEMM handles fine-grained block-wise scaling (a different scale factor for every 128x128 block). It uses FP32 accumulation to prevent overflow inherent in low-precision FP8 additions.
Machete (vLLM): A “mixed-input” kernel for NVIDIA Hopper. It enables W4A16 (4-bit weights, 16-bit activations) by dequantizing weights inside the kernel pipeline. It utilizes the Tensor Memory Accelerator (TMA) to hide conversion costs behind memory transfers.
Marlin (vLLM): The standard for INT4 on Ampere (A100). It restructures the weight matrix in memory to perfectly align with the GPU’s access patterns, using asynchronous data movement (cp.async) to achieve near-theoretical peak bandwidth.
FlashInfer (SGLang): A JIT (Just-In-Time) compiler. Instead of pre-shipping binaries, it generates optimal kernel code at runtime based on the specific batch size and head dimension. It also supports fused kernels (e.g., RoPE fused into attention) to minimize launch overhead.