The first time I split a single agent into two, I thought I was solving a parallelism problem. What I actually did was create two new failure modes I did not have before.
The single-vs-multi-agent decision gets treated like an architecture purity question. Engineers debate it inabsolutes: one agent is too monolithic, so many agents must be better. Or: multiple agents introduce too much coordination overhead, so one agent is always the right call. Both framings are wrong.
The real question is what kind of failure you can afford.
What you actually get with a single agent
A single agent loop is synchronous by default. Perceive, think, act, remember. Repeat. When it works, it works cleanly. When it fails, you get one trace, one error, one place to look.
I have run single-agent systems in production for months without incident. The failures I hit were almost always in tool call reliability, not in the loop architecture itself. The agent would try to use a tool that returned nothing, or a tool that timed out, and it would either recover or hang. The failure surface was contained.
The hard limit I ran into was context. A single agent working on a complex task accumulates history inside its context window. After a certain point, the cost of passing that history to every reasoning step becomes prohibitive. The agent starts forgetting the beginning of the task by the time it reaches the end. This is not a software bug. It is a physics problem with the memory hierarchy in AI systems.
A single agent also serializes your work. If the task requires doing X and Y and Z and X and Y and Z again, a single agent does them in sequence. You can batch prompts and get some throughput, but the agent loop itself is not concurrent.
Where multi-agent systems pay off
I moved to multi-agent architectures for two reasons. The first was specialization. A coding agent that also has to manage a file browser and a shell and a PR reviewer is juggling too many tool schemas. When I split those into separate agents, each one knew exactly what its tools did and the tool schemas stayed clean. This is the pattern behind what I later recognized as the supervisor agent pattern in production.
The second reason was isolation. When a research agent goes off the rails, I do not want it to corrupt the state of a code generation agent working on the same task. Separate agents mean separate memory, separate context, separate failure boundaries.
The speed gains are real but conditional. If two agents can work on independent sub-problems simultaneously, I see latency drop by roughly the sum of their individual times rather than the concatenation. For a task分解 into three parallel workstreams, this can mean a 3x improvement in wall-clock time. That math only holds if the work is actually parallelizable and the agents are not waiting on each other for shared state.
The failure modes nobody talks about
Here is what the architecture debates skip over: multi-agent systems fail in transit. Between agents.
A single agent fails inside a loop you control. A multi-agent system fails across message boundaries. The supervisor sends a task to a sub-agent and gets back something unexpected. The sub-agent completes its work but the memory update never reaches the shared store. The coordination agent dispatches tasks to three workers and one of them silently drops off the network.
I ran into this repeatedly when building a pipeline that used event-driven agent architectures. The event bus was reliable in testing. In production, under load, message delivery became non-deterministic. Agents would complete work and publish results that nobody consumed because the consumer had restarted and re-subscribed. The system looked alive. The work was not happening.
Error propagation is the other trap. In a single-agent system, an error stays local. In a multi-agent system, one agent's error can cascade if other agents trust its outputs without validation. I wrote about this in the context of production AI agent errors and it applies directly here. A planner agent that receives bad output from a research agent will build its entire subsequent plan on that bad foundation.
Memory becomes exponentially more complex
Single-agent memory is one problem. Multi-agent memory is several interconnected problems.
If agents share memory, I have to reason about consistency. When Agent A writes to the shared store and Agent B reads from it, at what point does B see A's write? If I use event-driven message passing instead, each agent has its own view of state and I have to manage synchronization manually.
The failure mode I hit was this: a planning agent would read the shared memory expecting the research agent to have finished and written its findings. The research agent had finished. It had written its findings to its own isolated memory, not the shared store. The planning agent proceeded with stale data.
I eventually had to build explicit memory synchronization steps into the workflow. Each agent would confirm its writes had propagated before the next agent was allowed to proceed. This added latency and made the workflow less concurrent, which undercut the original reason for splitting into multiple agents.
When I look at shared vs isolated memory in multi-agent workflows, the tradeoff is not which is better. It is which failure mode is easier to debug.
When to make the call
I use a simple heuristic. If the task has one clear goal and the steps to reach it are linear or lightly branched, a single agent with good context window management will get you there faster and with fewer surprises.
If the task has multiple independent workstreams, requires different tool sets that do not share schemas, or needs true concurrency for latency reasons, I split into multiple agents.
The decision is not architectural. It is operational. Ask yourself: what failure looks am I choosing between? A single agent that fails completely is often easier to recover from than a multi-agent system that fails partially and keeps running in a broken state.
I have shipped both. The multi-agent systems felt more impressive going in. The single-agent systems were easier to debug when things went wrong.