(Credits to Shikai Li)

Overview

Speculative decoding is a technique used to improve the latency of generating text with large language models. It involves using a smaller “draft” model to generate a sequence of tokens quickly, which are then verified by a larger, more accurate model.

Process

  • Draft Model Generation:
    • The draft model generates a sequence of tokens one at a time, up to a specified number (e.g., 10 tokens).
  • Verification by the Original Model:
    • These tokens are then passed to the original, larger model, which generates the probabilities for each subsequent token in the sequence.
    • The larger model processes all tokens together, leveraging causal masks to ensure that each token prediction only depends on previous tokens.
  • Comparison and Validation:
    • The output tokens from the larger model are compared with the tokens generated by the draft model at the corresponding positions.
    • If any token from the draft model does not match the larger model’s prediction, the process stops, and the sequence is regenerated starting from the point of discrepancy.
  • Efficiency Considerations:
    • The speculative decoding approach focuses on reducing latency rather than increasing throughput.
    • It is particularly beneficial for small batch sizes, where the overhead of the draft model is outweighed by the speed of verification.
    • For larger batch sizes, the benefits diminish as the draft model introduces additional overhead, and the larger model can already achieve good throughput.

Intuition and Limitations:

  • Latency vs. Throughput:
    • Speculative decoding is designed to minimize the time it takes to generate a sequence (latency), rather than maximizing the number of sequences processed simultaneously (throughput).
    • With small batch sizes, the focus is on quickly verifying and correcting token sequences, which is less effective with larger batches.
  • Overhead and Efficiency:
    • The draft model adds computational overhead, and the main advantage comes from running the verification process in batches.
    • In scenarios with large batch sizes, the speculative decoding process may become slower than traditional decoding due to this overhead.
  • KV Cache Savings:
    • Speculative decoding can save time on loading the key-value (KV) cache by only loading specific rows corresponding to the batch size of one.
    • This is particularly advantageous in long-context scenarios where the KV cache can become a bottleneck.
  • Use with Mixture of Experts (MoE):
    • The approach may be less effective with models using a mixture of experts (MoE), as the draft model’s overhead and the need to load multiple experts (more experts for multiple tokens used in PD vs. one expert for one token used in normal decoding) during verification can reduce its benefits.

In summary, speculative decoding is a technique aimed at reducing latency for small batch sizes by using a draft model to quickly generate and verify token sequences, with the main benefits arising from efficient verification rather than increased throughput.

Discussions

  • Option 1: Speculative decoding / contiguous batching 10 tokens = 1 query x 10 tokens per query

Activation: batch size = 1, seq len = 10 KV cache: batch size = 1, seq len = 1000

  • Option 2: Disagg 10 tokens = 10 queries x 1 tokens per query (normal decoding) Activation: batch size = 10, seq len = 1 KV cache: batch size = 10, seq len = 1000

Activation memory access / BW are the same However, for KV cache, Option 1 saves 10x memory access.

  • 70B, 2K: close
  • 70B, 128K: KV cache saving will be significant.