vLLM V1 Understanding

Read on the following blog:

Inside vLLM: Anatomy of a High-Throughput LLM Inference System.

vLLM V1 inference engine (based on commit 42172ad, August 9th, 2025), focusing on core system design, advanced features, distributed scaling, and performance modeling.

I. LLM Engine and Engine Core: System Fundamentals

The vLLM system achieves high-throughput inference through specialized scheduling, memory management, and GPU optimization implemented within the Engine Core.

A. Initialization and Optimization

The Worker construction process involves three key procedures:

Init Device: Assigns the CUDA device, verifies model data type support (e.g., bf16), checks VRAM availability based on gpu_memory_utilization, and instantiates the model_runner and CPU-side InputBatch object.
Load Model: Instantiates the model architecture, loads weights, and sets PyTorch to inference mode (model.eval()).
Initialize KV Cache: Determines the per-layer KV-cache specification (historically FullAttentionSpec, but complex for hybrid models like Jamba) and profiles the available VRAM to compute the number of allocatable KV cache blocks.

A critical performance optimization is the use of CUDA Graphs. During a dummy run across warmup batch sizes, the sequence of GPU work is recorded as a Directed Acyclic Graph (DAG). Replaying these pre-baked graphs during the forward pass significantly reduces kernel launch overhead, improving overall latency.

B. Paged Attention and Continuous Batching

The Scheduler determines which requests (Prefill or Decode) run in the next step(). Its core is the KV Cache Manager, which implements Paged Attention.

KV Cache Management: The manager maintains a free_block_queue, a pool of available KV-cache blocks. These blocks (defaulting to 16 tokens) form the indexing structure for mapping tokens to their Key/Value vectors in memory. The size of a block for a standard transformer layer (non-MLA) is calculated as: $2 \times \text{block_size} \times \text{num_kv_heads} \times \text{head_size} \times \text{dtype_num_bytes}$.
Continuous Batching: The step() function iterates over Schedule, Forward pass, and Postprocess stages. Continuous batching is natively supported because the forward pass concatenates all sequences into a “super sequence”. Custom paged attention kernels utilize position indices and attention masks to ensure that self-attention is correctly restricted to tokens within the same sequence, eliminating the need for right-padding and improving memory efficiency.

C. Scheduling Logic (Prefill vs. Decode)

The V1 scheduler mixes Prefill requests (compute-bound, forward pass over all prompt tokens) and Decode requests (memory-bandwidth-bound, forward pass over the single most recent token). Decode requests, which prioritize latency (ITL), are processed first.

The core scheduling decision revolves around the allocate_slots function:

It calculates the required number of new KV-cache blocks ($n = \lceil \text{new_tokens} / 16 \rceil$).
It checks the free_block_queue availability. If insufficient blocks exist, the scheduler may attempt recompute preemption by evicting low-priority requests (calling kv_cache_manager.free).
Available blocks are fetched from the free_block_queue (a doubly linked list) and mapped to the request ID.

II. Advanced Features for Inference Optimization

A. Chunked Prefill and Prefix Caching

Chunked Prefill: Prevents a single long prompt from monopolizing an engine step and increasing latency for other requests. If the number of requested tokens exceeds long_prefill_token_threshold, the prefill is truncated and executed over multiple engine steps.
Prefix Caching: Enables reuse of KV cache blocks for shared prefixes across multiple requests.
1. The prefix is split into block-sized chunks (e.g., 16 tokens).
2. A hash (SHA-256 or built-in hash) is computed for each complete chunk, incorporating the previous block’s hash, current tokens, and optional metadata (e.g., LoRA ID, cache salt).
3. find_longest_cache_hit checks the cached_block_hash_to_block mapping.
4. If a match is found, the KV blocks are reused, and the corresponding reference count is managed by the KV cache manager (e.g., incremented if the original request is still live). Invalidation occurs only when a cached block is about to be reallocated from the free_block_queue.

B. Guided Decoding (FSM)

Guided decoding enforces grammatical constraints on token generation using a Finite State Machine (FSM).

A StructuredOutputManager compiles the grammar (using third-party backends like xgrammar).
The backend generates a _grammar_bitmask tensor based on the current FSM state. For large vocabularies, this uses multiple 32-bit integers.
After the forward pass produces logits, a function (e.g., xgr_torch_compile) expands the bitmask (32x expansion ratio) and masks disallowed logits by setting them to $-\infty$, enforcing constraints before sampling.
The FSM advances its state using accept_tokens after the next token is sampled.

C. Speculative Decoding (SpecDec)

Speculative decoding aims to reduce the effective per-token latency by verifying multiple proposed tokens in a single large-model forward pass. V1 supports faster proposal schemes rather than a full LLM draft model, including n-gram, EAGLE, and Medusa. The system flow involves:

The large model runs the initial prefill.
A drafter (e.g., NgramProposer) proposes $k$ draft tokens (request.spec_token_ids).
In the next step, allocate_slots reserves KV blocks for the context plus the $k$ draft tokens.
A single large-model forward pass runs over the augmented sequence.
A custom rejection_sampler (partially implemented in Triton) performs left-to-right verification and produces output_token_ids. The statistical equivalence to standard autoregressive sampling is guaranteed by the accept/reject rule (accept based on $P_{large} / P_{draft}$ ratio).

D. Disaggregated Prefill/Decode (P/D)

Disaggregation separates compute-bound prefill from memory-bandwidth-bound decode workloads to achieve tighter latency control (TTFT and ITL).

A Connector abstraction (e.g., SharedStorageConnector for illustration) manages the transfer of KVs.
Prefill instances write KV data, while decode instances read from the external KV-cache service.
During scheduling, the connector methods (get_num_new_matched_tokens) check for externally cached tokens.
The KV exchange occurs within a context manager surrounding the forward pass:
- On entry: kv_connector.start_load_kv (loads external KV into paged memory for decode).
- On exit: kv_connector.wait_for_save (blocks until KV is uploaded for prefill).

III. Scaling and Distributed Serving Architecture

A. Multi-GPU Parallelism (Scaling Up)

For models exceeding single-GPU VRAM, vLLM utilizes Tensor Parallelism (TP) within a node, potentially combined with Pipeline Parallelism (PP) across nodes. The MultiProcExecutor coordinates multiple GPU workers.

Workers communicate using an rpc_broadcast_mq (implemented via shared memory) to receive work.
Results are collected from the designated output rank via worker_response_mq.
The executor transparently manages the partitioning and collection of work, allowing the EngineCore to call execute_model without awareness of the underlying parallelism complexity.

B. Distributed Serving (Scaling Out)

Scaling out involves Data Parallelism (DP), replicating the model across nodes, orchestrated by a distributed serving layer.

Engine Core Processes: On each node, CoreEngineProcManager launches DPEngineCoreProc instances (DP replicas), each initializing its own EngineCore with the MultiProcExecutor (for TP).
Steady State: Each DPEngineCoreProc runs three steady-state threads:
1. Input Thread: Blocks on the input ZMQ socket, dequeues requests from the API server, and enqueues them onto the process’s input_queue.
2. Main Thread: Wakes on the input_queue, runs engine_core.step(), and enqueues results to the output_queue.
3. Output Thread: Wakes on the output_queue, sends results back through the output socket to the API server.
Synchronization: If any DP replica is active, all replicas execute a step, requiring dummy steps for idle replicas to maintain lockstep synchronization, which is necessary for architectures like MoE (Expert Parallelism, EP) but currently applied across all DP setups.

C. Frontend and Load Balancing

The API server uses AsyncLLM and the DPLBAsyncMPClient for asynchronous, load-balanced, data-parallel operation.

The DPCoordinator process mediates between the frontend and engine cores, periodically transmitting load-balancing information (e.g., queue sizes).
Load balancing selects the engine core with the minimal load score: $\text{score} = \text{len}(\text{waiting}) \times 4 + \text{len}(\text{running})$
Frontend tasks (FastAPI routes) asynchronously call AsyncLLM.generate. After load balancing, the request is sent to the chosen engine’s input_socket. Output results stream back via asynchronous tasks (process_outputs_socket, output_handler).

IV. Benchmarks and Performance Modeling

Performance evaluation targets two competing objectives: Latency (time to response) and Throughput (tokens/requests per second). Key metrics include TTFT (Time to First Token), ITL (Inter-Token Latency), and Goodput (Throughput meeting Service-Level Objectives, SLOs).

A. The Latency-Throughput Tradeoff

The relationship is defined by the GPU Roofline Model:

Below the saturation batch size ($B_{sat}$), the system is HBM bandwidth-bound (streaming weights layer-by-layer), meaning step latency remains relatively flat as batch size ($B$) increases.
Above $B_{sat}$, the kernel becomes compute-bound. Step time increases roughly proportional to $B$, directly increasing Inter-Token Latency (ITL).

The tradeoff means optimizing for high $B$ increases throughput (due to amortized weight I/O) but eventually degrades ITL once $B > B_{sat}$.

B. Benchmarking Utilities

vLLM provides a vllm bench CLI for performance measurement:

latency: Measures end-to-end latency for small batches (e.g., 8).
throughput: Submits a large fixed workload (QPS=$\infty$ mode) and reports total tokens/requests per second.
serve: Simulates real-world conditions using a probabilistic arrival distribution (Poisson/Gamma) and can enforce server-side concurrency limits, measuring metrics against SLOs.

An integrated auto-tune script drives the serve benchmark to automatically discover optimal argument settings that maximize throughput while adhering to specified latency SLOs (e.g., p99 E2E latency limits).