Pass State, Not Story
Agent coordination fails not from missing context but from summaries that strip confidence metadata, so downstream agents treat narrative as operational truth.
This is the third article in the agent coordination series, following The Agent Demo Ends Too Early and A Five-Agent Stack Is Not a Company.
The summary that ate the nuance
The default handoff between agents today is some version of a summary. One agent finishes its work, compresses what happened into a few paragraphs, and passes it downstream. The receiving agent reads the summary the way anyone reads a summary: as a statement of current reality.
This works fine when the summary is accurate. It breaks when the summary is confident about things the original conversation was not. A question becomes a known limitation. A suggestion becomes a decision. A risk someone flagged in passing becomes a risk that was evaluated and accepted. Each of these upgrades is small, locally reasonable, and invisible to anyone who wasn't in the original conversation.
The problem compounds across handoffs. Agent B summarizes what it received from Agent A, now two layers of compression deep. Agent C gets a version of reality that reads as settled, sourced, and current, because that is what summaries look like. By the time the work reaches an agent that can act on it, the provenance of every claim in the handoff has been stripped away. The confident summary has become the operating manual.
This is particularly dangerous because it looks exactly like coordination working well. The handoff is clean, the context is there, the downstream agent has everything it needs. The failure only becomes visible when someone tries to trace a decision back to its origin and discovers that the decision was never made. It was inferred, compressed, and promoted, one summary at a time.
Why transcripts don't fix this
The intuitive response is to pass the full transcript instead. If the summary loses nuance, give the next agent everything.
A 200,000-token transcript contains the truth somewhere inside it. It also contains abandoned approaches, superseded decisions, stale objections, off-hand comments, and ideas that were raised and quietly dropped. The next agent has to figure out which parts of that record are current, which are historical, and which were never serious in the first place. A longer context window makes this harder, not easier, because it gives every possible interpretation more supporting evidence.
Anyone who has worked in a Slack channel with a long decision thread recognizes the dynamic. Everyone has read the thread. Everyone has a different understanding of what was decided. The thread can justify any reading because the thread contains everything, and "everything" includes contradictions.
A larger context window doesn't grant decision rights. It grants the ability to remember more without the ability to distinguish which memories are still operative.
The confidence laundering problem
The specific mechanism that makes summaries dangerous is what you might call confidence laundering.
In step one, an agent notes that a feature "might need additional security review." In step two, another agent's summary records that "additional security review was flagged." In step three, a downstream agent reads that line and infers that the review happened, because flagged issues in a well-run process get addressed. In step four, the next summary says "security review complete," because from that agent's perspective the topic was raised and handled upstream.
Each agent in the chain faithfully represented what it received. The chain still produced a false claim. The original tentative observation moved through four handoffs, gaining confidence at each step, until it arrived as a verified fact at the point where someone was ready to act on it.
The mechanism matters because it is distinct from hallucination. The agents did not invent information. They compressed it, and compression without confidence metadata is lossy in a specific direction: toward certainty. Summaries do not naturally say "this part is a guess" or "this was a question, not a conclusion." They present everything at the same level of authority, because that is what a well-written summary does.
Anyone who has used a long-running LLM session recognizes a version of this from context compaction. The model works with you for an hour, holding the full messy context: the false starts, the half-decisions, the thing you were mid-thought on when the context window filled up. Then compaction fires, and what comes out is a neat, past-tense, declarative summary. The post-compaction model reads that summary as settled history. If the summary said "approach identified," the fresh model treats it as "approach decided." The thread of active exploration becomes a closed chapter.
Three things that should never collapse into one sentence
The handoff summary fuses three things that need to stay separate.
"The API test passed" is evidence. A verifiable claim about an observable event. "It looks ready to ship" is confidence, a judgment call based on the evidence, filtered through the agent's understanding of what "ready" means. "Ship it" is authority, a decision made by someone with the standing to make it, under whatever approval process governs the work.
Most agent handoff summaries blend all three into a single paragraph that reads as if the work has been evaluated, judged, and authorized. The receiving agent has no way to tell which parts are reporting facts, which are expressing opinions, and which are granting permissions. A useful handoff keeps these layers visible so the downstream agent knows whether it is inheriting a measurement, an assessment, or a directive.
This does not require a rigid schema. It requires the summary to be honest about what it knows and what it is guessing. Two sentences at the bottom of a natural-language handoff ("These claims are verified against test output; this assessment of readiness is my judgment, not a stakeholder approval") would prevent most confidence laundering chains before they start.
The flexibility objection
The strongest counterargument is genuinely compelling: the reason LLMs work as agents is precisely that they can handle messy, informal, natural-language context. Forcing structured handoff packets strips out the flexibility that makes agents better than traditional software in the first place.
This is a real tension, and the answer is not to replace natural language with JSON schemas. What matters is consequence. If the downstream agent is drafting a document or brainstorming options, a loose summary is fine. If it can spend money, contact customers, deploy code, publish claims, or change records, the system needs more than a story. It needs the summary to carry provenance: which claims are grounded in evidence, which are inferences, what questions remain open, who owns the next action, and what approval governs it.
The right handoff is a receipt, not a constitution. Small enough to use, explicit enough to audit.
Making failure reconstructable
Human organizations solved this problem with boring infrastructure: tickets, chart notes, warehouse receiving slips, signoff sheets, commit messages. When something goes wrong in a hospital, someone can trace the chart. When a shipment is lost, someone can trace the receiving log. The record does not prevent every error, but it makes errors reconstructable, which means they can be understood, attributed, and corrected.
Agent systems today mostly lack this layer. When a multi-agent workflow produces a bad outcome, the forensic question is: which agent made the call, based on what information, under whose authority? If the answer requires reading thousands of tokens of unstructured transcript and guessing where the confidence upgrade happened, the system cannot learn from its failures. It can only be restarted and hoped to do better.
The receipt is the primitive that makes coordination accountable. It records what moved, when, from whom to whom, based on what evidence, under what authority. Agents need the same boring machinery that human organizations have been building for centuries, because when work moves between people, or between agents, someone has to be able to reconstruct why.
The transcript still matters. When something breaks, someone will need the full messy record — the hedging, the abandoned ideas, the moment a question became an assumption.
But the next agent in the chain does not need the full record. It needs to know what is true right now, how confident that claim is, and who is responsible for what happens next. A shared story can make every agent feel informed. Only shared state can make the next action accountable.