Traditional caching (like Redis) requires an exact string match to return a cached result. If a user asks 'How do I reset my password?' and later asks 'What is the process to change my password?', a traditional cache misses. Semantic caching solves this by embedding the incoming user query into a vector and performing a fast nearest-neighbor search against previously answered queries. If the similarity score exceeds a defined threshold, the system immediately returns the cached response. This drastically reduces API costs, cuts latency to milliseconds, and protects backend LLMs from repetitive load.

How It Works

A typical semantic cache pipeline:
  • Embedding: The incoming user query is embedded into a dense vector using a fast, cheap model.
  • Vector Search: The vector is compared against a database of previous queries using Cosine Similarity.
  • Threshold Check: If the similarity score is above 0.95, the cached LLM response is returned instantly.
  • Cache Miss: If no match is found, the query goes to the expensive LLM, and the new response is cached.

Common Use Cases

  • Customer support chatbots dealing with high volumes of repetitive questions.
  • Reducing API costs for enterprise RAG deployments.

Related Terms