Reference List

Summary

Disaggregating the prefill and decoding phases in LLM inference enables speedups by optimizing each phase for its distinct computational demands, resource requirements, and bottlenecks.

1. Phase-Specific Computational Characteristics

  • Prefill Phase (Compute-Bound):

Processes the entire input prompt in parallel, requiring heavy matrix operations (e.g., attention across all tokens).

Benefits from high-throughput compute resources (e.g., powerful GPUs/TPUs) to maximize parallel processing.

  • Decoding Phase (Memory-Bound):

Generates tokens sequentially, with each step dependent on prior outputs.

Limited by memory bandwidth (loading model weights repeatedly) rather than raw compute.

2.Resource Specialization

  • Prefill: Allocate high-FLOPS devices (e.g., A100/H100 GPUs) to exploit parallelism.

  • Decoding: Use memory-optimized hardware (e.g., inference chips with large caches) or techniques like KV caching to reduce redundant memory access.

3. Batching Efficiency

  • Prefill: Batch multiple prompts to maximize GPU utilization (large, static workloads).

  • Decoding: Use smaller, dynamic batches tailored to latency-sensitive token generation, avoiding interference from prefill’s bulkier computations.

4. Memory and Latency Optimization

  • Prefill: Precompute and cache attention keys/values (KV cache) during prompt processing, reused in decoding to avoid redundant computation.

  • Decoding: Focus on minimizing latency by streamlining memory access (e.g., keeping weights in faster cache memory).

5. Asynchronous and Scalable Workflow

  • Decouple prefill and decoding into separate systems, enabling:

  • Overlap: Start decoding immediately after prefill completes, hiding latency.

  • Scalability: Independently scale prefill (throughput-oriented) and decoding (latency-oriented) resources based on demand.

6. Reduced Contention

  • Avoid resource competition (e.g., GPU cores vs. memory bandwidth) by isolating phases, ensuring neither starves the other.

Key Technical Insights

  • KV Caching: Storing precomputed attention states during prefill drastically reduces decoding overhead.

  • Hardware Fit: Match compute-heavy prefill to GPUs and memory-bound decoding to optimized inference accelerators.

  • Pipeline Parallelism: Overlap prefill for one request with decoding for another, improving overall throughput.

Example Workflow

  • Prefill Server: Processes a batch of prompts in parallel, generates KV caches.

  • Decoding Server: Uses cached KV states to generate tokens efficiently, even with low batch sizes.

Result

  • Higher Throughput: Efficient batching and parallelism in prefill.

  • Lower Latency: Optimized memory access and specialized hardware for decoding.

  • Cost Efficiency: Right-sizing resources for each phase reduces idle time and operational costs.