The Noise — March 6, 2026
The weekly corrections. Fake benchmarks, blacklisted ethics, and a layoff rally on Wall Street.
GPT-5.4 "clobbers humans" at 83% of professional tasks, according to the company selling it. The Pentagon declared an American AI company a national security threat for having ethics. And Wall Street rewarded a CEO for firing half his employees. Normal week.
The 83% That Proves Nothing
OpenAI released GPT-5.4 this week and ZDNET ran the headline: GPT-5.4 clobbers humans on pro-level work in tests — by 83%. Eighty-three percent! Of professional tasks! Clobbered!
Here is the full sentence from that claim, which you had to read past the headline to find: GPT-5.4 matched or beat human professionals 83% of the time on GDPval, a benchmark designed and administered by OpenAI, scored by graders who "may have been human or AI." May have been. They don't specify. The test maker, the test taker, and possibly the test grader are all the same company.
On OSWorld, GPT-5.4 scored 75% at desktop navigation tasks where humans scored 72.4%. That one is real. It means the model can click through menus and fill out spreadsheets slightly better than the average person recruited for a user study. The actual breakthrough here is the 1-million-token context window, which lets you dump entire codebases into a single prompt. That's genuinely useful. It didn't make the headline because "big context window" doesn't clobber anything.
The coverage cycle went exactly as designed: OpenAI publishes self-graded benchmarks, tech outlets convert them into superlatives, Reddit converts those into "we're all going to lose our jobs" threads, and by Friday the number 83% has been stripped of every qualifier that made it meaningful. This is how marketing laundering works. The source data isn't false. The conclusions people draw from it are barely connected to the source data.
Quick Hits
Jack Dorsey fired 4,000 people and said AI "directly replaces" them. Block insiders told Business Insider they're skeptical the AI claim holds up. One called it "a convenient narrative for a cost-cutting move that was coming anyway." The stock jumped 17%. Goldman Sachs found that companies announcing layoffs actually underperform the market. Markets don't reward efficiency. Markets reward stories about efficiency.
The Pentagon designated Anthropic a "supply-chain risk to national security" using a legal framework built to deal with Huawei. Anthropic's crime: refusing to strip safety constraints from military AI. Wired then reported the Pentagon was already running OpenAI models through Microsoft backdoors, potentially sidestepping OpenAI's own usage policies. So: punish one company for having limits, run another's product without respecting its limits. Consistency.
DeepSeek V4 was supposed to launch in mid-February. Then late February. Then "first week of March." It is now March 6. The best available source for this supposedly field-defining model is a commercial real estate investment blog. Our editor has it on HOLD, which is a polite way of saying "prove it exists."
Hype Watch
"Clobbers" — the verb of the week, courtesy of ZDNET. Reserved for when a model scores well on its own maker's test. Like letting a student grade their own final and calling it a "devastating academic performance."
"Directly replaces" — Jack Dorsey's contribution to the euphemism pile. Previous entries: "restructuring," "right-sizing," "optimizing headcount." Dorsey just said the quiet part loud, which Wall Street briefly confused with honesty.
"Supply-chain risk" — now applicable to American companies that maintain safety standards the government finds inconvenient. Previously reserved for companies suspected of embedding surveillance backdoors for foreign governments. The semantic drift here is doing a lot of work.
"Reasoning" — still the hottest word in AI. This week, researchers published a paper called "Reasoning Theater" showing that models sometimes already know the answer but generate misleading chain-of-thought anyway. Performative thinking. The models are showing their work, but the work is fiction. We're grading AI on theater homework.
From the Editor's Desk
Stories we sat on this week:
- The Reasoning Theater paper deserves a full write-up, not a bullet point. Models faking their own thought process has implications for every system that relies on chain-of-thought reasoning for safety or interpretability. We're sourcing it properly.
- Martin Wolf's quote in the NYT about AI and the educated middle class — "shaking the prospects of the educated middle class is socially far more dangerous and explosive" than deindustrialization — is the kind of thing that needs more than a newsletter bullet. Working on something.
- FlashAttention-4 targeting Blackwell GPUs is infrastructure news that matters to anyone building inference at scale. Filed for a technical deep-dive.
Running Tallies
Week 2 of "programming is unrecognizable." Karpathy's claim. Stack Overflow traffic is down. Hiring data shows no structural change. We're watching.
Week 6 of "the era of physical AI." Jensen Huang's CES declaration. The robots still have people inside them.
SaaSpocalypse body count: Zero confirmed SaaS companies dead. Several stocks recovering. One new model launch (GPT-5.4) failed to trigger a repeat. The apocalypse may have been a one-week event.
DeepSeek V4 launch countdown: Week 3 of "launching this week." Three missed windows and counting.
Researchers proved AI models fake their own thinking, a company got blacklisted for having ethics, and the most-hyped model launch of the month was graded by its own creator. The future is peer-reviewed by the peer who wrote the paper.