Speculative Decoding: Accelerating LLM Inference

LLM generation is memory-bandwidth bound. Producing text token-by-token is incredibly slow because the GPU has to load the entire massive model's weights into memory just to calculate a single word. Speculative Decoding breaks this sequential bottleneck. A tiny, highly efficient 'draft' model (e.g., 1B parameters) rapidly guesses the next 5 to 10 tokens. These guessed tokens are then passed to the main, massive 'target' model (e.g., 70B parameters). The massive model evaluates all 5 tokens in parallel in a single forward pass. If it agrees with the draft model's logic, it accepts the tokens instantly, essentially generating 5 words for the computational time of 1.

How It Works

Drafting: A fast, small model generates a sequence of k speculative tokens.
Verification: The large target model processes these k tokens in parallel.
Acceptance: The target model compares the draft's probabilities against its own. It accepts tokens up to the point where the draft made a mistake.
Correction: At the point of the mistake, the target model discards the remaining draft tokens and generates the correct token itself, then the loop restarts.

Common Use Cases

Reducing latency and Time to First Token (TTFT) in real-time conversational agents.
Increasing serving throughput for massive open-source models without sacrificing output quality.

How It Works

Common Use Cases

Related Terms