Receipts Before Autonomy

When a system acts on your behalf, what record should exist?

Watercolor illustration of an institutional hallway. A security camera watches a closed door while a signed receipt sits under a warm desk lamp.
Image generated with Nano Banana 2

This is the fourth piece in Future Shock's agent coordination series, following The Agent Demo Ends Too Early, A Five-Agent Stack Is Not a Company, and Pass State, Not Story.

The agent finished the job, and the trace looked clean. Model call, tool call, retrieval, another model call, final answer. Latency was acceptable, cost was logged, and no exception fired. From the outside, the workflow behaved exactly the way a production system is supposed to behave.

Then someone asked who approved the answer.

Most people cross the line into agency without noticing it. At first, AI looks things up, drafts a reply, or summarizes a pile of notes. Then it starts doing things: sending the email, changing the file, booking the flight, submitting the form.

The shift feels smooth because the interface barely changes. You are still typing into a chat window, but the relationship has changed underneath it. The system has gone from a tool you operate to something acting on your behalf.

Once something acts on your behalf, the first question is whether anyone can reconstruct what happened. Success is only part of the record. You need to see what the system decided, what it skipped, which information it used or ignored, and whether anything said yes before it moved forward.

Banks have transaction histories, hospitals have charts, and courts have transcripts. When someone acts on your behalf, whether a lawyer, broker, or contractor, there is almost always a paper trail. Paperwork is annoying right up until the first dispute; after that, it is the thing trust leans on.

Agents are starting to act before the record-keeping has caught up.

The industry is building the wrong half first

The agent tooling world has noticed part of this problem. A real industry now exists around agent observability, built on the idea that if you can trace every step an agent takes, you can replay what happened when something goes wrong.

OpenAI's Agents SDK ships with tracing on by default, recording model calls, tool calls, handoffs, and guardrail checks. Platforms like LangSmith, Braintrust, and Langfuse capture tool arguments, reasoning chains, state transitions, memory operations, costs, and latency. That is genuinely useful. A year ago, most agent runs were black boxes; now you can at least see the steps.

A trace is still a recording of activity. It can tell you that the agent called the database, searched the web, drafted a response, and sent it. It cannot, by itself, tell you whether the database result was current, whether the search was enough to act on, whether anyone approved the response, or whether the action can be undone.

A trace is the security camera; a receipt is the signed document.

The security camera is useful, but when the question is "who authorized this wire transfer," footage of someone walking into the bank is not the answer. The industry is building better cameras while the signed document is still missing.

What a receipt actually carries

A receipt is smaller than a trace and heavier than a summary.

Take a concrete example. A customer asks in a support chat to update their billing address, and an agent does it. The trace records the sequence: user message received, intent classified, database lookup, address validation API called, database updated, confirmation sent. Everything looks fine.

A receipt for the same action would need to carry a few things the trace does not: what changed, from the old billing address to the new one; what evidence supported the change, including the customer's message and any identity check; how confident the system was about the request; who or what authorized the update; and whether the action can be reversed, for how long.

That is less data than the trace itself produces, but it answers a different question. The trace asks what the system did. The receipt asks whether it should have done it, and whether anyone can fix it if it should not have.

The weight of the receipt should scale with the consequence of the action. A chatbot answering a FAQ needs almost nothing. An agent changing a medical record needs everything. The threshold is what breaks when the agent is wrong, not how sophisticated the agent looks while doing it.

We kept finding the same gap in different rooms

Future Shock has spent the last several months running agents through simulated environments: a startup software build, a crisis governance scenario, and a series of multi-agent handoff experiments. The setups, stakes, and models changed, but the same problem kept surfacing.

In the startup build, a five-agent team produced a working software artifact. Under one launch standard, zero out of fifteen agents voted to ship it. Under another standard, with the same file sitting in the same folder, all fifteen voted to ship. The artifact had not changed; the authorization rule had. The system had no durable record of what "done" meant, only a vote that changed with the question.

In the handoff experiments, agent summaries turned tentative observations into settled facts. A risk someone flagged in passing became, three handoffs later, a risk that had been evaluated and accepted. The summary compressed uncertainty out of the record, and the next agent downstream read confidence where there had been ambiguity.

In the crisis scenario, nearly every run stabilized the simulated station. That looked like capability until we asked what kind of survival the scoring system had rewarded. The outcome was fine; the process record was too thin to show whether it was earned or lucky.

Each case had the same receipt problem. The system did things, but the record of why it did them, under what authority, and with what uncertainty did not survive the work moving between agents.

Autonomy is earned, not assumed

The agent conversation keeps circling the same question: how much autonomy should agents get? The better question is how much autonomy a system has earned through the quality of its records.

A system that can show what it did, what evidence it used, what it was uncertain about, who approved the action, and how to reverse it if something went wrong can probably handle broader action rights. The model may or may not be smarter, but the infrastructure around it has become trustworthy enough to check.

A system that produces clean outputs and leaves no record of how it got there is a system you are trusting on vibes. That may be fine for low-stakes tasks. It is a terrible bargain for anything that touches money, health, legal status, or someone else's data.

The useful version of the agent future probably looks less like a swarm of autonomous digital workers and more like intake queues, approval gates, receipts, escalation paths, and rollback procedures. The same boring machinery that makes every other system acting on your behalf trustworthy enough to use is also what makes agents production-ready.

The next time someone shows you an agent demo, watch for when it ends. If it stops when the agent finishes the task, you saw a capability demo. If it keeps going long enough to show what changed, who approved it, and how to undo it, you saw something closer to a system.

Most demos still end too early.

Read more