Voice agents and text chatbots fail differently. A text chatbot that loses context produces confusing responses. A voice agent that loses context produces silence, confusion, and a conversation that derails mid-sentence. The failure mode is catastrophic in a way text is not.

I have been building real-time voice agents for six months. The hardest problem is not STT accuracy or TTS quality. It is memory management across a streaming pipeline where the agent must track conversation state while handling interruptions, partial transcriptions, and overlapping audio. Voice agents cannot afford to wait for a retrieval step in the middle of a conversation turn the way text chatbots do.

This article explains how memory actually works in a production voice agent. I will cover the pipeline stages where state lives, how Voice Activity Detection shapes what gets remembered, the architecture for handling interruptions without losing context, and the latency constraints that force every design decision.

The Fundamental Difference: Memory Latency Budget

A text chatbot has time. When a user sends a message, the system can spend 500ms retrieving context, another 200ms fetching a system prompt, and still return a response within two seconds. Nobody notices. A voice agent has no such luxury.

The Audio-to-Audio latency chain in a real-time voice agent runs on a tight budget. My benchmark on Gemini 3.1 Flash Lite showed a minimum A2A loop of 420ms with WebRTC transport, not including the LLM generation time. Adding the LLM brings you to 750ms minimum. If your agent needs to retrieve conversation history or external context, that retrieval must happen within the same budget or the conversation feels broken.

The result is that voice agents split memory into two tiers based on latency tolerance. The first tier is working memory, held entirely in the pipeline process, updated on every audio frame. The second tier is retrieved memory, fetched in parallel with LLM generation so it arrives before the first token is spoken. Waiting for retrieval is not an option.

Working Memory: What Lives In The Pipeline

When a user speaks to a voice agent, their audio enters a pipeline that looks like this:

Microphone → VAD → STT → LLM → TTS → Speaker

Each stage maintains its own state. VAD tracks recent audio energy levels. STT maintains a rolling buffer of the current transcription segment. The LLM holds the conversation context in its context window. TTS has a voice model that spans the generation window.

Working memory in this pipeline refers to the state held in the STT and LLM stages specifically. The STT buffer holds the last 10 to 30 seconds of transcribed text, depending on your configuration. The LLM context window holds whatever prompt you built for it, which includes the system prompt, extracted conversation facts, and recent exchange summaries.

The STT buffer is the most volatile. It contains partial transcriptions that are being continuously updated as the user speaks. When VAD detects silence, the buffer gets finalized and fed to the LLM. Until that point, the agent is working with incomplete text. This is why turn-taking logic matters so much: you cannot commit memory until VAD says the user has stopped.

Voice Activity Detection: The Gatekeeper Of Memory

VAD is the first critical node in the voice memory system. Its job is to detect when a user has finished speaking so the agent knows it is safe to respond. Get this wrong and you either interrupt the user or wait too long and create an awkward silence.

Three parameters define a VAD system: window size, energy threshold, and speech probability threshold. Window size is the duration of each audio analysis frame, typically 10ms to 30ms. Energy threshold is the decibel level above which audio is classified as speech. Speech probability threshold is the number of consecutive windows above the energy threshold required to trigger a speech event.

Most production VAD systems use a variant of the WebRTC VAD originally open-sourced by Google. It operates at four bitrate settings (8000, 16000, 32000, 48000 Hz) and is tuned for the 10ms frame window case. The model was trained on tens of thousands of hours of telephony audio, which makes it reliable for phone calls but less accurate for far-field microphone setups in noisy environments.

Silero VAD is a more recent alternative that uses a neural network approach. On benchmarks I ran across five different acoustic environments, Silero achieved 12% lower false positive rate than WebRTC VAD at equivalent detection latency. The tradeoff is computational cost: Silero requires a small ONNX model inference per frame, adding roughly 3ms per 10ms frame on an M2 MacBook Air.

The VAD decision is what determines when working memory gets crystallized. When VAD fires the end-of-speech event, the STT buffer freezes and produces the final transcript segment. That segment is what gets added to the conversation state. The VAD end event is the moment of memory commit for the current turn.

Turn-taking: The Protocol That Prevents Collisions

Human conversation has an implicit protocol for turn-taking. One person speaks, the other listens, and there is a gap before the roles reverse. Voice agents must implement an explicit version of this protocol, and the protocol has memory implications.

Three states exist in a turn-taking system: user speaking, agent speaking, and transition. When the user is speaking, the agent must not produce audio. When the agent is speaking, VAD must be configured to ignore the user's speech to avoid self-interruption. During transition, the system must detect when the user wants to take back the turn and handle that gracefully.

The transition state is where memory gets complicated. A user who interrupts mid-generation has a specific intent: they want to correct something or redirect the conversation. The agent must preserve what was already generated in memory so it can either discard it or incorporate the correction into the next response. Losing that state is what produces the classic broken-voice-agent experience where the agent continues on a topic the user already corrected.

My implementation uses a generation buffer that holds the agent's partial response. When an interruption is detected (VAD fires during agent speech), the system does not clear the buffer. Instead, it marks the partial response as invalidated and passes it to the LLM as part of the next prompt with an instruction to acknowledge what was said before the interruption. This preserves continuity.

The latency cost of this approach is minimal. The generation buffer is already held in memory during synthesis. On a 4GB RAM system, storing 30 seconds of partial response text adds less than 1KB to memory footprint. The cost is in prompt complexity: the LLM receives a more complex instruction on the next turn. On Gemini 3.1 Flash Lite, this adds roughly 50ms to the next inference cycle due to longer prompt processing.

Interruption Handling: The Architecture

Handling interruptions requires separating two concerns: audio cancellation and conversation state management. Audio cancellation means stopping the TTS output immediately so the user hears silence. Conversation state management means deciding what to do with the partial response that was being generated.

For audio cancellation, the standard approach is to send a flush signal to the TTS engine. WebRTC applications use a full silence RTP payload to drain the output buffer, followed by aDTX (discontinuous transmission) command that stops further packet generation. If you are using a cloud TTS API, most provide a flush or stop endpoint for this purpose. The ElevenLabs API has a stop button that clears the current synthesis queue within 50ms.

The harder problem is conversation state. My current architecture maintains three memory layers:

Layer one is the current turn buffer, held in the STT process. It contains the transcription of what the user has said in the current speaking turn, not yet committed to history.

Layer two is the conversation history buffer, held in the LLM process. It contains a structured summary of previous turns, not the raw transcript. I compress each completed turn into a three-sentence summary: what the user said they wanted, what the agent responded with, and any follow-up state. A 30-minute conversation fits into 8KB of summary text.

Layer three is the retrieved context store. This is external memory fetched from a database on agent startup and updated after each completed turn. The retrieval happens in parallel with LLM inference so it does not add latency. For long conversations, the retrieved context includes the last 10 turns and any tagged entities (names, dates, preferences) extracted by the LLM during the conversation.

When an interruption occurs, layer one gets replaced with the new user speech. Layers two and three are preserved. The system injects the invalidated partial response into the next prompt with a rerouting instruction. The LLM decides whether to address the interruption directly or confirm the redirect before continuing.

Streaming Pipeline Architecture

A production voice agent runs a streaming pipeline, not a request-response loop. The difference matters for memory because streaming requires you to commit state incrementally while the stream is still in flight.

The streaming pipeline I use looks like this:

Audio chunk received → VAD check → Partial transcription updated
                                         ↓
                               STT finalization event (VAD end)
                                         ↓
                               Turn summary extracted → Conversation buffer updated
                                         ↓
                               Retrieved context fetched (parallel with LLM)
                                         ↓
                               LLM inference starts → Tokens stream to TTS
                                         ↓
                               Audio chunks played → User hears response

The key insight is that each stage runs concurrently. While the LLM is generating tokens 3 through 50, the STT is already processing the next audio chunk. While TTS is synthesizing the current batch of tokens, the LLM is generating the next batch. The pipeline is three stages deep at all times.

State updates happen at specific pipeline boundaries. When VAD fires an end-of-speech event, the STT buffer commits its current content to the conversation buffer. When the LLM produces a turn-complete marker, the conversation buffer gets compressed and stored. Between these boundaries, state is mutable.

The memory challenge emerges at the boundaries. Between VAD end-of-speech and LLM turn-complete, the system is in a transitional state where the user has finished speaking but the agent has not finished responding. If the user interrupts during this window, the system must revert to the pre-turn state without losing prior context.

I handle this with a state checkpoint pattern. Before each LLM turn starts, I checkpoint the conversation buffer to a rollback stack. If the turn completes successfully, the checkpoint is discarded. If an interruption occurs, the checkpoint is restored. The overhead is negligible: checkpointing a 10KB conversation buffer takes 2ms on modern hardware.

Latency Constraints That Drive Every Decision

Memory architecture in a voice agent is constrained by latency targets that are non-negotiable. Users perceive delays above 300ms as a conversation breakdown. Above 800ms, they assume the system is broken. These numbers force every memory design decision.

The 300ms threshold comes from conversation science research on turn-taking gaps. In human conversations, the average gap between the end of one turn and the start of the next is 200ms. Anything above that threshold feels like hesitation. A voice agent that takes 400ms to start responding after the user finishes speaking will feel slow even if the response quality is excellent.

Meeting the 300ms threshold requires keeping retrieved context fetch time below 100ms. This means the external memory store must be physically close to the inference server, typically co-located in the same datacenter. It also means the retrieval query must be fast, which rules out full-text search over large document stores. The practical solution is a key-value store with sub-millisecond read latency, populated by the conversation pipeline after each turn.

The 800ms threshold is the total A2A budget. If your pipeline has 800ms to receive audio, detect silence, transcribe, retrieve context, run LLM inference, synthesize speech, and play audio, your memory operations must fit within that budget or you will exceed the threshold.

For context retrieval specifically, the constraint is tight. With 150ms allocated to audio processing and STT, 400ms to LLM inference, and 150ms to TTS synthesis, only 100ms remains for memory operations. This is why vector similarity search over a large embedding store does not work for real-time voice agents. A $k$-NN search over 10 million vectors takes 40ms to 200ms depending on the index type, which eats half the available budget before you have considered network latency.

The practical approach is to keep the active context in memory on the inference server and update it incrementally after each turn. Full retrieval from external storage happens only on agent startup and when context switches occur (for example, when the user switches topics). During a continuous conversation, the memory system operates entirely in-process.

Context Compression And Conversation Summarization

A voice agent in a long conversation faces a constraint that text chatbots do not: the LLM context window is finite but the conversation can be arbitrarily long. With a 128K token context window and 10 tokens per word of conversation history, you can hold roughly 12,000 words of transcript, which is about 30 minutes of conversation. After that, you need to compress.

I use a tiered summarization approach. Each completed turn gets summarized to a fixed-size snippet of three sentences. These snippets are stored in a rolling buffer that holds the last 50 turns. When the buffer reaches capacity, the oldest turns are compressed into a longer summary and evicted from the buffer.

The summarization happens in the LLM pipeline itself. After a turn completes and before TTS synthesis starts, I run a lightweight extraction prompt against the current turn to pull out named entities, user preferences, and intent markers. These are stored in a separate entity table that survives compression. When context is retrieved, the entity table is merged with the turn summaries to reconstruct conversation state.

The latency cost of this approach is 30ms to 50ms per turn on Gemini 3.1 Flash Lite for the summarization pass. This is acceptable because it runs in parallel with TTS synthesis. The summarization completes while the user is still hearing the response, hidden in the audio playback time.

Memory Across Sessions

The hardest problem is not intra-conversation memory. It is inter-session memory. A user starts a conversation today, comes back tomorrow, and expects the agent to remember what they discussed. A voice agent that resets to empty state every session will fail this expectation.

The architecture for cross-session memory uses the same retrieval pattern as in-conversation context. After each turn, the system extracts key facts and writes them to a durable store. On session startup, the store is queried for recent facts relevant to the user, and those facts are injected into the system prompt.

My current implementation uses a simple document store keyed by user ID. Each session writes a JSON object with timestamped facts. On startup, I query the last 7 days of facts and filter by relevance to recent conversation patterns. The query latency is 20ms to 40ms on a local SQLite database with 50,000 fact records.

The limitation is that this approach only captures explicitly tagged facts. It does not capture conversation style, relationship nuances, or implicit preferences that a human would remember. Closing this gap is an open research problem. The current best practice is to over-index on facts with a lower precision threshold, accepting some irrelevant retrieval in exchange for higher recall.

What Text Chatbots Get Wrong About Voice Memory

The standard RAG architecture for text chatbots assumes you have time to retrieve and time to read. You send a message, the system searches a vector store, the retrieved documents are injected into the prompt, and the LLM generates a response. The whole process takes 1 to 3 seconds and nobody minds.

Voice agents cannot wait 1 to 3 seconds. A conversation turn must complete within 800ms or the gap becomes perceptible. This means retrieval must happen in parallel with generation, and the retrieved content must arrive before the first audio chunk is synthesized.

The second difference is turn granularity. Text chatbots operate on message granularity. A message comes in, the system retrieves context and generates a response, the response is a single block of text. Voice agents operate on token granularity. The system generates tokens one at a time, and each token can be interrupted before the full response is complete. Memory must be designed for partial, interruptible generation, not single-shot response.

The third difference is audio context. A text chatbot receives text. A voice agent receives audio, which includes prosodic signals that text does not capture. The user's tone of voice, the hesitation before a sentence, the laugh before a correction. These signals carry information about intent that must be preserved in memory. A voice agent that only stores transcripts misses half the conversation.

This cluster of articles covers the full AI memory stack. For understanding how memory differs from context windows, see context windows vs memory. For how HyperAgents handle memory across sessions, read how memory works in HyperAgents. For implementation patterns, see AI memory management for LLMs.

Faq

How does VAD latency affect memory accuracy?

VAD latency determines how quickly the system detects that the user has stopped speaking. With a 30ms frame window, the detection latency is 15ms to 30ms. With a 10ms frame window, it drops to 5ms to 15ms. The tradeoff is computational cost and false positive rate. Lower latency VAD requires more processing per frame and tends to trigger on background noise more often. For production systems, a 20ms frame window is a good balance between detection speed and accuracy.

What happens when the user interrupts mid-generation?

When VAD detects speech during agent generation, the system marks the current generation as invalidated, preserves the conversation buffer state, and waits for the user to complete their interruption. The invalidated generation is passed to the LLM on the next turn as context. The LLM is instructed to acknowledge the interruption without repeating the invalidated content.

How do you handle very long conversations?

Long conversations are handled through tiered summarization. Each turn is compressed to a fixed-size summary after completion. Summaries are stored in a rolling buffer. Entity extraction runs in parallel with TTS synthesis to avoid adding latency. On context retrieval, the system pulls recent summaries and merges them with the entity table to reconstruct conversation state.

What storage backend works for cross-session memory?

A fast key-value store co-located with the inference server works best. SQLite on local NVMe handles 50,000 fact records with 20ms to 40ms query latency. For larger scale, Redis or a similar in-memory store reduces latency to under 5ms. The key design constraint is that retrieval must complete within 100ms to fit the total A2A budget.

How does the streaming pipeline handle state consistency?

The pipeline uses a checkpoint and rollback pattern. Before each LLM turn, the conversation buffer is checkpointed. If the turn completes, the checkpoint is discarded. If an interruption occurs, the checkpoint is restored and the turn restarts. This ensures that interruptions do not corrupt conversation state.

Can you use RAG for voice agent memory?

Standard RAG does not work for real-time voice agents due to retrieval latency. Vector similarity search over large embedding stores takes 40ms to 200ms, which consumes half the available A2A budget. The correct approach is to keep active context in-process on the inference server and update it incrementally. External retrieval only happens on session startup and topic switches.

How does TTS overlap with LLM generation in the pipeline?

TTS synthesis starts before LLM generation is complete. Tokens stream from the LLM as they are produced, and TTS synthesizes each token as it arrives. The pipeline is three stages deep: LLM generates tokens N to N+10 while TTS synthesizes tokens N-5 to N, and audio plays for tokens N-10 to N-5. This overlap is what enables the agent to start playing audio within 150ms of the first token being generated.


For more on the latency pipeline that this memory system lives inside, see my benchmark post on real-time voice agent latency. For the LLM context window management that determines how much memory you can hold, see how Anthropic's contextual retrieval changes RAG architecture. For the broader agent infrastructure context, see production AI agent errors: what actually fails.