Two architectural patterns dominate how LLM applications handle context: Retrieval-Augmented Generation (RAG) and Memory systems. Developers constantly confuse them, and that confusion leads to systems that hallucinate, forget, or leak context. Here is what actually separates these approaches, and when to reach for each one.
RAG pulls information from an external store at inference time. Memory keeps a running record of the conversation and injects it into the context window. The distinction sounds simple. The implementation details are where things go wrong.
How RAG Actually Works
RAG solves a specific problem: your LLM cannot know what it was not trained on. GPT-4o was not trained on your codebase, your product docs, or yesterday's Slack messages. RAG bridges that gap by retrieving relevant documents at query time and stuffing them into the prompt.
The pipeline has four stages. First, a chunking step splits your documents into pieces small enough to embed. Second, an embedding model converts each chunk into a dense vector. Third, those vectors live in a vector database like Pinecone, Weaviate, or pgvector. Fourth, at inference, the user's query gets embedded and similarity search finds the k nearest chunks. Those chunks get prepended to the prompt.
Vector similarity search rarely returns perfect results. A query about "API rate limits" might surface chunks about authentication tokens if the embedding model thinks they are semantically similar. This is the dirty secret of RAG: retrieval noise is constant, and you need eval pipelines to measure it.
Most developers pick a chunk size of 512 tokens and call it done. That is lazy. Overlapping chunks (256 token overlap) improve recall significantly. So does reranking with a cross-encoder model like cross-encoder/ms-marco-MiniLM-L-6v2 after the initial vector search. These two tweaks alone can move your recall metric from 0.61 to 0.89 on standard benchmarks.
What Memory Systems Actually Do
Memory systems take a different approach. Instead of retrieving from a static corpus, they maintain a rolling history of the current conversation (or all conversations, depending on design). Each turn, the memory gets formatted and inserted into the context window alongside the new user message.
The simplest memory implementation is a fixed-size conversation buffer. You store the last N messages and truncate everything older. This works until N grows large enough to exceed your context window limit, and then you face the same chunking and retrieval problems that RAG has.
More sophisticated memory systems use summarization. Instead of storing every message verbatim, they periodically compress older messages into a distilled summary. This preserves the gist of the conversation while using fewer tokens. Anthropic's Claude uses this approach internally with its 200K context window, summarizing conversations that exceed certain depth thresholds.
The critical difference from RAG: memory is stateful and temporal. RAG queries a static knowledge base. Memory evolves with the conversation. If your user says "let's schedule that for next Tuesday", memory captures it. RAG never will unless you write it back to the knowledge base explicitly.
The Core Tradeoffs
RAG wins when you need the model to access information it was never trained on. Memory wins when the model needs to track conversation state and reference things said earlier in the session.
RAG introduces retrieval latency. A vector search against 10 million chunks takes 40-120ms on pgvector with a good index (HNSW, IVFFlat). Add embedding time (20-50ms for a batch of 10 chunks via OpenAI's text-embedding-3-small) and you are looking at 80-200ms per query before the LLM even starts generating. Memory adds negligible latency since it is just string concatenation.
RAG is auditable. Every retrieval can be logged, inspected, and debugged. You know exactly which chunks the model saw and why. Memory is opaque by comparison. When a model "forgets" something, you have no clear trace of what happened inside the attention mechanism.
Memory scales with conversation length. A 50-turn conversation with 200 tokens per turn is 10,000 tokens of memory. At $0.005 per 1K tokens for GPT-4o-mini, that is $0.05 per conversation in token costs alone, before generation. RAG costs are dominated by retrieval infrastructure, not token counts.
RAG requires a separate data pipeline. Documents need to be chunked, embedded, and indexed before queries can touch them. If your data changes frequently, you need a re-indexing cadence. Memory requires no data pipeline, only a session store.
When To Use RAG
RAG is the right tool when your application needs to answer questions about a large, structured corpus that changes infrequently. Legal document Q&A, internal knowledge bases, and codebase search are canonical examples.
The corpus must be large enough that you cannot fit it all in the context window. If you have 50 pages of documentation, you could in theory paste it all in. But then you pay for 50 pages of tokens on every single query, most of which are irrelevant. RAG lets you pay for only the relevant chunks.
RAG also shines when you need citations. Downstream chunks from the vector store can include source metadata. The model can then attribute its answers to specific documents, which matters in compliance-heavy industries.
If you build a RAG system, invest early in evaluation. Build a golden dataset of Q&A pairs with expected chunks. Measure hit rate (did we retrieve the right chunks?) and answer accuracy (did the model use those chunks correctly?). Without this, you are flying blind.
When To Use Memory
Memory is the right tool for anything that resembles a dialogue agent: a chatbot, a coding assistant, a personal assistant. These applications derive their value from tracking state across turns.
Memory also matters for long-running tasks. If your user asks the model to refactor a large module and the task takes 30 minutes across dozens of tool calls, memory is what keeps the model coherent at the end. Without memory, the model loses track of what it was doing and starts generating irrelevant code.
If your application involves user preferences, session-specific context, or anything that should not be shared across users, you need memory isolation. Each user session gets its own memory store. This is not a feature you bolt on later; it is a fundamental security boundary.
The Hybrid Approach
Production systems rarely use only RAG or only memory. The hybrid pattern combines both: memory handles conversation state while RAG handles domain knowledge.
Here is a working implementation in Python:
from openai import OpenAI
from datetime import datetime
import json
client = OpenAI()
class HybridAssistant:
def __init__(self, vector_store, session_store):
self.vector_store = vector_store # e.g., Pinecone, Weaviate
self.session_store = session_store # e.g., Redis, PostgreSQL
self.max_memory_tokens = 4096
self.max_context_tokens = 128000
def build_prompt(self, session_id: str, user_query: str) -> list[dict]:
# Step 1: Retrieve from RAG knowledge base
query_embedding = client.embeddings.create(
input=user_query,
model="text-embedding-3-small"
).data[0].embedding
rag_chunks = self.vector_store.query(
vector=query_embedding,
top_k=5,
include_metadata=True
)
# Step 2: Load session memory
memory = self.session_store.get(session_id) or []
# Step 3: Format memory as a conversation history string
memory_text = self._format_memory(memory)
# Step 4: Build messages with system prompt, memory, RAG context, and query
system_prompt = {
"role": "system",
"content": "You are a helpful assistant. Use the retrieved context "
"to answer questions accurately. Cite sources when available."
}
messages = [system_prompt]
if memory_text:
messages.append({
"role": "system",
"content": f"Conversation history:\n{memory_text}"
})
if rag_chunks:
rag_text = self._format_rag_context(rag_chunks)
messages.append({
"role": "system",
"content": f"Retrieved knowledge:\n{rag_text}"
})
messages.append({"role": "user", "content": user_query})
# Step 5: Truncate to fit within model context limit
return self._truncate_messages(messages)
def _format_memory(self, memory: list[dict]) -> str:
if not memory:
return ""
lines = []
for turn in memory[-20:]: # last 20 turns
role = turn["role"]
content = turn["content"][:500] # truncate long messages
lines.append(f"{role}: {content}")
return "\n".join(lines)
def _format_rag_context(self, chunks: list[dict]) -> str:
sections = []
for i, chunk in enumerate(chunks):
sections.append(
f"[Source {i+1}] {chunk['metadata'].get('source', 'unknown')}\n"
f"{chunk['text']}"
)
return "\n\n".join(sections)
def _truncate_messages(self, messages: list[dict]) -> list[dict]:
# Rough token count: ~4 chars per token for English
total = sum(len(m["content"]) // 4 for m in messages)
budget = self.max_context_tokens - self.max_memory_tokens - 512
while total > budget and len(messages) > 2:
# Remove oldest non-system messages
for i in range(1, len(messages) - 1):
if messages[i]["role"] != "system":
removed = messages.pop(i)
total -= len(removed["content"]) // 4
break
return messages
def chat(self, session_id: str, user_query: str) -> str:
messages = self.build_prompt(session_id, user_query)
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
temperature=0.3
)
# Store this exchange in session memory
memory = self.session_store.get(session_id) or []
memory.append({"role": "user", "content": user_query})
memory.append({
"role": "assistant",
"content": response.choices[0].message.content
})
self.session_store.set(session_id, memory)
return response.choices[0].message.content
The critical part of this implementation is the truncation logic. Without it, you will eventually exceed the context window and the model will start dropping content unpredictably. The _truncate_messages method removes older turns first, which preserves recency bias in the conversation.
Structuring Memory For Production
If you ship this to real users, basic session storage is not enough. You need structured memory layers.
The first layer is working memory: the last N turns of the current session. This is what the code above implements. It is fast, small, and ephemeral.
The second layer is episodic memory: summaries of past sessions with the same user. After each session ends, generate a summary and store it keyed by user ID. When a new session starts, inject the last 3-5 session summaries as context. This gives the model multi-session continuity.
The third layer is semantic memory: persistent facts about the user that the model has learned. If the user said "I am allergic to nuts", that fact should persist across all future sessions. Store these as structured records, not in free text, so they can be retrieved and updated reliably.
This three-layer model mirrors how human memory works, and it is far more robust than a single flat buffer.
Evaluation: The Part Nobody Does
RAG evaluation is well-documented. Tools like RAGAS and Trulens provide metrics for faithfulness (did the model stick to the retrieved content?), answer relevancy, and context precision. Run these on a monthly cadence and alert on regressions.
Memory evaluation is harder. There are no standard benchmarks. The best approach is to build synthetic test conversations that last 20+ turns and verify that key facts from turn 3 are still present in turn 20. Automate this with a second LLM that reads the full conversation and extracts all stated facts, then checks if the final responses are consistent with those facts.
If you skip evaluation, you will ship systems that pass internal demos and fail in production. The demo works because the conversation is three turns long. Production users have 50-turn conversations and the memory degrades silently.
Cost Implications
RAG infrastructure costs money. Pinecone charges based on vector count and dimension. A moderate knowledge base of 1 million chunks at 1536 dimensions (OpenAI's text-embedding-3-small outputs 1536 dims) runs about $70/month on Pinecone's serverless tier. Weaviate's managed cloud starts around $65/month for the same scale. pgvector on a $20/month VPS handles up to 5 million vectors comfortably if you tune the HNSW parameters.
Memory costs scale with usage. Each token in the context window costs money on every single API call. If your average conversation is 5,000 tokens of memory per query and you serve 10,000 conversations per day, that is 50 million memory tokens per day or 1.5 billion per month. At GPT-4o-mini pricing ($0.15 per 1M input tokens), that is $225/month in memory costs alone, before generation.
The hybrid approach lets you keep memory lean. Most queries only need the last 10-15 turns of working memory (roughly 2,000-3,000 tokens). RAG handles the heavy lifting of domain knowledge. The model does not need a full conversation transcript to answer "what is our refund policy?".
Common Failure Modes
RAG fails when the chunking strategy does not match the query pattern. If users ask about paragraphs that span multiple chunks, the retrieval will return half the answer every time. This is a structural problem that reranking cannot fix. The solution is semantic chunking: split on meaningful boundaries (section headers, paragraph breaks) instead of fixed token counts.
Memory fails when it grows unbounded. Without truncation, a long conversation eventually exceeds the context window and the model starts dropping old content. This is not a bug; it is a fundamental constraint of transformer architectures. Plan for it from day one.
Memory also fails when sessions are not isolated. If you store memory in a shared Redis instance without proper key namespacing, one user's data leaks into another user's session. This is an authentication bypass equivalent in severity. Namespace every key by user ID and session ID.
Hybrid systems fail when RAG retrieval noise contaminates memory-grounded responses. If the vector store returns irrelevant chunks about a previous topic, the model may anchor on those chunks instead of the actual conversation. Use a two-pass approach: first pass determines whether the query is about domain knowledge (use RAG) or conversation state (use memory), and route accordingly.
What This Means For Your Architecture
RAG and memory are not competing approaches. They solve different problems. RAG extends what the model knows. Memory extends what the model remembers within a session.
Build both from the start. The initial demo might only need one, but users will eventually ask for the other, and retrofitting is always more expensive than building right. The hybrid class above is a starting point, not a final design.
The evaluation infrastructure matters more than the retrieval algorithm. You can tune chunk sizes and embedding models forever. None of it matters if you cannot measure whether your users are getting correct answers. Invest in evals first.
Related Articles
- Context windows vs memory
- AI memory management for LLMs
- Short-term memory for AI agents
- How memory works in HyperAgents
- State of AI agent memory 2026
Faq
Can I use RAG without a vector database?
Yes. You can use BM25 sparse retrieval with scikit-learn's TfidfVectorizer or a dedicated library like rank_bm25. It underperforms dense retrieval on semantic queries but works well for exact keyword matching. SQL-based full-text search is another option for structured data.
How do I handle data that changes in real-time? RAG does not handle real-time data well by default since indexing has latency. For streaming data, use a hybrid: RAG for static knowledge, a function-calling tool (like a webhook to an API) for live data. The model decides which to use based on the query.
What embedding model should I use?
OpenAI's text-embedding-3-small is the practical default: 1536 dimensions, good performance, low cost. For multilingual or specialized domain data, text-embedding-3-large improves recall at 2x the cost and dimension count. Sentence-transformers all-MiniLM-L6-v2 works well for on-premise deployments where you do not want to send data to OpenAI.
How many chunks should I retrieve per query? Usually 3-8 chunks. Fewer than 3 and you risk missing relevant context. More than 8 and you dilute the signal with retrieval noise while burning through your context window. Tune this number against your recall metric.
Does memory work with open-source models the same way? Yes, the mechanism is identical. Open-source models like Llama 3.1 70B and Mistral Large accept the same message format. The differences are context window size (Llama 3.1 supports 128K, Mistral Large supports 32K) and inference cost (self-hosted is cheaper at scale but requires GPU infrastructure).
How do I prevent memory from growing indefinitely? Implement a three-layer truncation strategy: keep the last N turns verbatim, summarize older turns into episodic summaries, and extract persistent facts into a structured semantic memory store. This mirrors the approach described in the production memory section above.
Should I use conversation history as RAG context? No. Conversation history belongs in the memory layer, not the RAG layer. RAG queries a static knowledge base. Feeding conversation history into the vector store creates retrieval noise and mixes session state with domain knowledge. Keep these pipelines separate.
Related posts you might find useful: