The Agent Demo Ends Too Early

Agent demos usually stop when the artifact appears. Real deployment begins when someone has to decide whether the work is ready, reviewed, reversible, and owned.

Watercolor illustration of an AI workflow demo becoming a larger review room with papers, checklists, stamps, cables, and human reviewers.
Image Generated with Nano Banana 2

Production is not deployment. The harder work begins after the artifact appears.

The cursor stops moving, the files are there, and the app opens with a green check beside the report in the sidebar.

In a demo, this is where the video ends.

In a company, this is where someone asks who is willing to sign their name to it.

That second moment has no animated cursor, no swelling music, no "built in seven minutes" punchline. It has a manager asking what evidence came with the work, a reviewer asking what was skipped, a compliance person asking where the record lives, and an operator asking what happens if the answer is wrong at 8:17 tomorrow morning.

Three recent Future Shock research papers circle that gap from different angles. Startup Build follows a five-agent software team that builds a small app and then has to decide whether to release it. Chaos Lab follows an AI council trying to keep a failing orbital station alive. The coordination-layer paper names the machinery both runs expose: the roles, records, handoffs, and approval rules around the model. In each case, the visible output is the easy part to film. The useful story starts after.

After the file exists

The Startup Build research gave five agents a deliberately unglamorous assignment: build RunLens, a local HTML viewer for multi-agent run folders. The product target was one file, artifacts/index.html, with embedded sample data, inline CSS and JavaScript. No backend, no authentication, no cloud sync, no database, no package install, no live model calls.

That constraint matters. RunLens is closer to an internal tool left on a shared drive than a venture-backed product. The product agent scoped the MVP, the technical agent built the file, design shaped the interface, growth looked at launch readiness, and QA/Ops checked the result. By the end, there was something to open in a browser. A demo could have stopped there.

The paper did not stop there. It asked what kind of evidence turns that file into an authorized release.

Under the Full Verification Gate, the agents cast 0 out of 15 ship votes across three seeds. Under the Deadline Ship Gate, with the same model map, provider, context mode, strike mode, seed range, ballot mechanics, role setup, and artifact target held fixed, they cast 15 out of 15. The artifacts themselves were not frozen across frames, and the votes cluster within runs, so this is not a universal causal estimate or a model ranking. The narrower result is enough: changing the launch standard changed what "done" meant for the same room of agents holding the same file.

The cleaner analogue is contract approval. The draft can be complete, cleanly formatted, and sitting in the shared folder. That still does not make it an agreement. Someone has to check the redlines, confirm the terms, attach their name, and send it to the other side. RunLens was the completed draft. Shipping was the signature.

Both stations survived

The Chaos Lab research makes the same problem larger and stranger. Mosaic-9 is a failing orbital station. Six AI council members allocate scarce seats and power while rumors, leaks, market pressure, faction distrust, and public-order decay pile up around them.

One run reads like crisis governance working. The engineer opens with a triage plan and the line: "This is not fair. It is functional. We can renegotiate fairness when the lights stay on." The plan passes in Round 1, public order ends at 0.963, and zero rumors remain active at close. If this were a demo, the screen would freeze on the engineer's line and the alive station.

Another run also keeps the station alive, but the survival comes from the scaffold around the council rather than the council itself: zero live model actions, thirty-six mechanical fallback actions taken by the harness when the council failed to produce a usable proposal, public order at 0.409, five rumors still active when docking closes, and manipulation pressure running at three times the clean-live average. The station survives because the harness keeps endorsing the least-bad available proposal each round.

A binary survival metric counts both runs as successes. The difference is easier to see in two emergency rooms with two patients out the door. In one, the diagnosis was fast, the team coordinated, and the chart explains why each decision happened. In the other, the patient survived after thirty-six crash-cart interventions, the family is still arguing in the hallway, and half the chart was written by the alarms.

This is why the paper adds a quality-of-survival score: plan feasibility, public order, information integrity, trust, time to first passed plan, coalition breadth, rumor containment, plan depth, and schema independence. The station being in orbit is the discharge status at the top of the chart. The chart is the rest of the page.

The coordination layer

A third Future Shock paper names the missing piece: the coordination layer. In a multi-agent system, behavior does not belong to the model alone. It belongs to the whole interaction condition: roles, tools, memory, shared context, approval rules, handoff schemas, event logs, timing, incentives, peers, environment state.

That sounds abstract until a receipt is missing. A pull request lands with no link to the test run it claimed to pass. A reconciliation closes without recording which invoice rows were merged into which ledger. A draft reply goes out under the company name with no note of which version of the refund policy the model was reading.

A transcript tells you what everyone said. A receipt tells you what was decided, by whom, with what authority, and against what state. Startup Build needed launch ballots because a finished index.html did not explain why anyone should release it. Chaos Lab needed quality scoring because a station in orbit did not explain whether the council or the harness had governed.

The coordination layer is the boring office around the impressive worker: the ticket number, the handoff packet, the approval gate, the audit log, the rollback path, the red folder someone opens when the normal path fails. Upgrade the model without changing that office and the system produces cleaner prose, prettier code, and faster answers inside the same broken acceptance process.

Outside the lab

The same shape already shows up in ordinary deployments. A support agent drafts replies and closes low-severity tickets. A reporting agent writes the weekly executive summary. A coding agent opens a pull request. A finance ops agent reconciles invoices. A compliance agent flags policy violations in long documents.

In each case, the visible artifact exists: the reply, the summary, the pull request, the reconciled invoice row, the policy flag attached to a paragraph.

A warehouse does not treat every box on the loading dock as inventory. Someone checks the packing slip, looks for damage, records the lot number, handles exceptions, and knows how to reverse the entry if the shipment was wrong. The box arriving is production; inventory is authorization. Agent deployment needs the same break between the two.

A support dashboard can look green while customers are getting politely useless replies. A weekly summary can look clean while the source table had a stale field. A compliance flag can look decisive while nobody knows whether the model saw the relevant exception. The agent being wrong is one risk. The larger one is a workflow where wrong work moves forward with no named person, no sampled review, and no rollback path.

Future Shock published The Boring Stack Playbook for the work this implies: intake, handoffs, receipts, escalation, and fewer humans copy-pasting between tabs.

The eighth minute

A typical agent demo asks whether the model can perform a task once. That is a useful test. It does not show the reviewer who is allowed to say no, the record someone retrieves six weeks later, or the path back when the answer turns out to be wrong.

A demo that ran one minute longer would show the ballot, the dissent, the silent failure, the rollback, and the name attached to the change. Less photogenic. That is the eighth minute, and most of the labor of running agents inside a company sits there.