Episodic, Semantic, and Working Memory in AI Agents: A Practical Map

I spent two weeks debugging an agent that kept forgetting what it had just done. The logs looked fine. The retrieval pipeline was fast. The context window was spacious. The problem turned out to be a category error: the agent was storing episodic memories in a semantic memory system, and semantic facts in episodic storage. Everything was there. Nothing was useful.

That distinction is what I want to map out here, because it is the root of most agent memory failures I see in code reviews and production postmortems.

The three memory types your agent is actually running

Working memory: the scratchpad

Working memory is what the model has right now. In an LLM, this is the context window. I think of it as the agent's immediate scratchpad: what was said in the last few turns, what the current task is, what tools are available.

Working memory is fast and volatile. It disappears when the context resets. I ran into this explicitly when building a long-running customer support agent. Between turns, the agent would lose track of what the customer had already been told. The fix was not a better retrieval system. The fix was persisting a session summary back into working memory at the start of every turn.

Working memory has hard limits. The context window caps it. When I push an agent toward 200 turns in a single session, I start seeing degradation even when retrieval is working perfectly. The model simply has too much to reason over.

If you want to understand how context windows fit into this picture, I wrote about LLM context windows explained and how they interact with different memory systems.

Episodic memory: what happened

Episodic memory stores specific experiences. I think of it as the agent's autobiographical record: this task was attempted on this date, failed with this error, the user corrected it this way.

When a DevOps agent recalls that "the last deployment on Friday failed because the database migration timed out," that is episodic retrieval. The agent is pulling a specific event from its history, not a general fact about deployments.

The failure mode I see most often is treating episodic memory like a database. Engineers build retrieval pipelines that query episodic storage with semantic similarity. The user asks "what went wrong with my deployments" and the agent returns semantically similar past events, most of which are irrelevant because the user wants a chronological account, not a topic cluster.

I wrote about why this asymmetry breaks RAG pipelines in asymmetric retrieval and why it breaks your agent memory. The short version: episodic recall is directional in a way that semantic search cannot model.

Semantic memory: what is true

Semantic memory stores facts, knowledge, and learned abstractions. This is what most people mean when they say "knowledge base" or "world model." It is not tied to a specific time or experience.

A coding agent's semantic memory contains facts like "Python's list.sort() is stable" or "React components re-render when state changes." These are general truths that apply across contexts.

When semantic memory is wrong, the agent confidently misinforms. When episodic memory is wrong, the agent contradicts itself. The error signatures are completely different, and debugging them requires different tools.

I keep coming back to the memory hierarchy in AI systems when thinking about where semantic memory sits relative to retrieval. The hierarchy framing helps because it makes clear that semantic memory is a layer, not the whole system.

Why agents get the types confused

The problem is that most agent frameworks expose a single "memory" interface. You call agent.add_to_memory(message) and the framework decides what to do with it. That decision is often implicit, based on token budget or retrieval latency, not on the nature of the information.

When I reviewed a popular agent framework last year, I found that every user message was being stored in episodic memory regardless of content. The agent's learned knowledge about Python, Git, and deployment pipelines was being treated the same as session logs. The retrieval results were noisy because episodic memory had been flooded with transient session data.

The fix was a content-type router: factual statements and learned patterns went into semantic memory, session events and task outcomes went into episodic memory. The framework did not expose this distinction, so we patched it at the tool layer.

What this means for your retrieval pipeline

If your agent retrieves from a single vector store, you are almost certainly mixing memory types. Semantic queries like "what is the architecture" return episodic results. Episodic queries like "what happened in the last session" return semantically similar but temporally irrelevant results.

I found that splitting retrieval by memory type improved accuracy significantly. Semantic queries went to a knowledge base with relatively static embeddings, updated weekly. Episodic queries went to an event store with recency weighting, queried by time range or session ID rather than embedding similarity.

This is not a novel architecture. It is how human memory works. We do not use the same cognitive process to recall a phone number as we use to recall what we did yesterday.

A practical implementation pattern

Here is the pattern I settled on after testing several variations.

Three separate storage systems with a routing layer in front. The router inspects each piece of information before storing it and decides which system handles it.

Factual knowledge that the agent has confirmed: semantic store. Task outcomes, user corrections, session events: episodic store. Current working context, summaries of recent events: working memory, refreshed every turn.

The routing rules are not perfect. I still see cross-contamination. But the error surface is dramatically smaller than a single-store approach, and the failure modes are easier to debug.

The failure signatures to watch for

When working memory is too small, the agent loses track of the current task mid-execution. This is the most common failure I see in agent demos. The fix is usually not a better model. It is a session summary that gets prepended to the context every turn.

When episodic memory retrieval is wrong, the agent contradicts itself across sessions. It says it completed a task it never finished, or claims a previous approach failed when it actually succeeded. This is a retrieval ranking problem, not a storage problem.

When semantic memory is wrong, the agent generates plausible-sounding incorrect information with high confidence. The user catches this before the agent does. The fix requires updating the knowledge base, not the retrieval system.

These three failure modes look completely different. If you are debugging with a single memory system, you cannot tell them apart. That is the real cost of the category error.

Conclusion

The three memory types are not academic distinctions. They correspond to different information, different retrieval mechanisms, and different failure modes. An agent that treats a learned fact the same as a past event will fail in predictable ways. An agent that routes information by type will be easier to debug and more reliable in production.

I wrote this because the memory hierarchy framing in my earlier post did not go far enough in separating these types. The state of AI agent memory in 2026 covers the landscape. This post is about the specific architectural decisions that determine whether your agent's memory actually works.