As context windows push past 100k tokens, the memory required to store the Key-Value (KV) cache grows linearly and quickly overwhelms even the largest GPUs. KV Cache Eviction policies dynamically prune this memory during generation. Based on the observation that attention within an LLM is extremely sparse (a model only actually 'looks at' a fraction of the previous tokens to predict the next one), these policies drop tokens that receive low attention scores. Techniques like 'Heavy Hitter Oracle' (H2O) retain only the most critical tokens (like the system prompt and the most recent turns), effectively simulating an infinite context window within a strict memory budget.
How It Works
- Attention Scoring: During generation, the algorithm monitors the attention weights across all tokens in the cache.
- Heavy Hitters: Tokens that consistently accumulate high attention scores (often the initial instructions or specific entities) are marked as 'Heavy Hitters' and preserved.
- Eviction: Tokens with near-zero attention scores are permanently evicted from the GPU memory.
- Rolling Window: A fixed sliding window preserves the absolute most recent tokens to maintain immediate fluency.
Common Use Cases
- Running infinite-length chatbots or agents on consumer-grade GPUs (e.g., LocalLLaMA).
- Processing massive codebases or books where only specific variables and structural boundaries actually matter for the final output.