LLM generation is memory-bandwidth bound. Producing text token-by-token is incredibly slow because the GPU has to load the entire massive model's weights into memory just to calculate a single word. Speculative Decoding breaks this sequential bottleneck. A tiny, highly efficient 'draft' model (e.g., 1B parameters) rapidly guesses the next 5 to 10 tokens. These guessed tokens are then passed to the main, massive 'target' model (e.g., 70B parameters). The massive model evaluates all 5 tokens in parallel in a single forward pass. If it agrees with the draft model's logic, it accepts the tokens instantly, essentially generating 5 words for the computational time of 1.

How It Works

  • Drafting: A fast, small model generates a sequence of k speculative tokens.
  • Verification: The large target model processes these k tokens in parallel.
  • Acceptance: The target model compares the draft's probabilities against its own. It accepts tokens up to the point where the draft made a mistake.
  • Correction: At the point of the mistake, the target model discards the remaining draft tokens and generates the correct token itself, then the loop restarts.

Common Use Cases

  • Reducing latency and Time to First Token (TTFT) in real-time conversational agents.
  • Increasing serving throughput for massive open-source models without sacrificing output quality.

Related Terms