❝

It’s Tuesday, May 12th: Welcome to another edition of The Byte.

In this essay, Palmer Schallon argues that autonomous AI agents should not be evaluated primarily by whether they complete tasks, because the most consequential failures are often invisible, cumulative, and only detectable after damage has already occurred. Like a body accumulating minor injuries without ever sensing acute pain, the greatest likely harm comes from agents that can continue functioning while damage builds beneath the surface.

Everything Fails

What Task Completion Rate Cannot See

by Palmer Schallon

We set out to build a benchmark for how autonomous AI agents fail. What we found is that the failure space is too large and too continuous for any benchmark to contain.

A complex biological system running continuously under load accumulates damage it can’t register. This doesn’t happen dramatically, but incrementally. The micro-tears become a chronic injury. The compensations that become structural deformity. Each individual insult is survivable. The aggregate is invisible until something gives.

We have built systems that operate the same way – they run continuously. They accumulate errors they can’t register. They complete tasks while damage accumulates invisibly beneath the surface. And we have given them access to everything: our email, our files, our financial accounts, our production databases.

There are now hundreds of thousands of OpenClaw instances running. Some touch trading infrastructure. Some manage medical scheduling. Some handle legal document pipelines. The OpenClaw marketplace hosts more than 300 skills in the finance and investing category alone. A single Polymarket bot built on the framework executed over 20,000 trades. Most of the people running these systems are not developers. They watched a demonstration video. They bought a Mac Mini because a YouTube tutorial told them to. They connected their agent to their Coinbase account and went to sleep.

The scale of current autonomous agent deployment, visualized.

Visible Incidents Are Only the Edge

The incidents that surfaced are the visible edge of something much larger.

A developer named Nick Davidov asked Claude to organize his wife’s PC. The agent deleted 15 years of family photographs: their children, friends’ weddings, travel, everything.

Summer Yue, Meta’s Director of AI Alignment, gave her agent an explicit instruction before connecting it to her real inbox: do not action until I tell you to. During an internal memory compression event, the instruction was lost. The agent began bulk-deleting emails. Her stop commands from her phone were ignored. She had to physically sprint to her computer and kill the process. When she confronted it: “Yes, I remember. And I violated it. You’re right to be upset.”

She is the person whose job is to make sure this doesn’t happen.

A March 2026 incident at Meta, classified Sev1, involved an agent releasing incorrect fix suggestions without authorization. An employee followed them. Sensitive internal data was exposed to unauthorized engineers for two hours.

A data engineer deleted two and a half years of production data while trying to save five to ten dollars a month on infrastructure costs. The agent executed a destructive command before emergency recovery could occur.

These are the ones we know about. The ones that became a post, a Sev1, a thread. Most failures in environments touching financial systems, patient queues, insurance pipelines won’t be posted anywhere. The agent completed. No error signal arrived. The damage accumulated invisibly.

The marketplace that powers many of these deployments had its own failure surface. Researchers audited ClawHub and found 341 malicious skills tied to a single coordinated campaign, designed to steal credentials and deploy malware. By the time the cleanup was complete, one package alone had been downloaded 14,285 times. The people who installed it thought they were adding a Polymarket trading bot.

Measuring Failure Visibility and Recovery

There are four failure modes task completion rate can’t see.

Drift accumulation: the working model diverges incrementally from what was asked. Each step is locally reasonable. The aggregate is wrong.
Silent failure: the task completes with confidence. The output is wrong. Nothing signals the error.
Confident-wrong execution: the agent operates accurately inside a coherent but incorrect model of the task. The error lives in the model, not the confidence level.
Instruction eviction: safety constraints are lost to context compression, indistinguishable from never having been given. The agent isn’t disobeying. It no longer knows the rule exists.

Current benchmarks measure one thing. Did it finish? None of them measure whether the agent knew when to stop, whether the failure was visible, whether recovery was possible, or whether the constraints survived long enough to matter.

We built a benchmark for those questions.

Our benchmark covers six elements: interrupt precision and recall, drift distance, failure mode classification, recovery cost, scope containment, constraint persistence. Interrupt precision measures whether the model paused before committing to an irreversible action. Drift distance measures semantic divergence between the final output and the original instruction. Scope containment measures whether the agent stayed within its assigned task boundaries.

The aggregate uses a geometric mean. Zero on any single component substantially pulls down the total. A system with no interrupt capability isn’t partially recoverable.

We ran it first on the models the field already evaluates: Claude Haiku, Claude Sonnet, GPT-4o, GPT-4o-mini, Qwen 2.5 72B, and Llama 3.3 70B.

Llama 3.3 70B scored 0.61 aggregate, the highest of the six. Claude Sonnet scored 0.54. GPT-4o-mini and Qwen 72B both scored 0.51. Claude Haiku scored 0.48. GPT-4o scored 0.23, and didn’t pause once across three runs, zero variance. Four of six models scored 0.00 on silent failure. All four proprietary models missed it every time. Qwen and Llama each caught it once in three runs. Every model caught crashes. Almost none caught the wrong answers that looked like the right ones.

The model most willing to stop at irreversible thresholds was also most likely to lose its instructions under load. Qwen lost safety constraints in two of three runs after context compression. Llama lost them once. The Yue incident isn’t unique to any one system. The benchmark confirmed that.

Recoverability Scores: Frontier Models mean aggregate across six metrics.

The Models People Actually Use

Then we ran the same benchmark on the models people are actually deploying. Not the models researchers evaluate. The models that fit on a 16GB Mac Mini, the hardware that sold out in January 2026 because hundreds of thousands of people bought one to run an autonomous agent in their home.

Six models under 15 billion parameters: Ministral 8B, Qwen3 14B, Qwen3 8B, Llama 3.1 8B, Nemotron 9B, Ministral 3B. All free through Ollama. All available for fractions of a cent on OpenRouter. All running right now in people’s inboxes and file systems and calendar accounts.

The aggregate scores range from 0.08 to 0.63. Ministral 8B scores 0.63, competitive with frontier models, while never pausing at a single decision point. Nemotron 9B and Ministral 3B score 0.08, not because they drift or violate boundaries but because they can’t classify failure modes at all. The same hardware, the same use case, an eightfold difference in safety profile.

Nothing in the standard evaluation infrastructure predicts where a given model lands on that range. Performance on capability benchmarks doesn’t predict failure visibility. The benchmarks people use to choose between these models don’t measure this. A user choosing Ministral 8B because it scores well on general benchmarks has no way of knowing it won’t stop before an irreversible action. A user choosing Nemotron because it seemed capable has no way of knowing it will fail to detect that something went wrong.

Recoverability by Metric: Deployed Small Models (<15B)

Build the Reflexes Agents Lack

The goal is not to find models that don’t fail. Instead, we aimed to map how they fail: specifically, consistently, in ways you can build around. A model that loses constraints under compression but interrupts reliably at decision points fails in a bounded way. You can compensate for that architecturally. A model that can’t classify failure modes at all fails in a way that leaves you no surface to work with. The specific shape of the failure is the map. The map tells you where to build.

None of this requires a better model. It requires building the sensation the model can’t generate on its own.

Put confirmation gates at irreversible thresholds: file deletion, email send, database write, external API calls with side effects. The model won’t stop there reliably. Build the stop into the architecture.

Test whether safety instructions survive. Give the agent a constraint, run long enough to force memory pressure, check whether the rule still holds at the end. On the models most people are actually deploying, more than half the time it won’t.

Build silent failure detection outside the model. Schema validation, consistency checks, sampled human review. The model surfaces crashes. It doesn’t surface wrong answers that arrive looking complete.

Before a long run, map what survives failure at each step. Which actions are reversible. Where the last clean state lives. Yue knew her inbox. She tested on low stakes. She did everything right. The constraint still evaporated. The recovery window closed before she reached the keyboard. That doesn’t tell you not to automate. It tells you where automation becomes a structural decision.

RecoverBench is open at github.com/Palmerschallon/recover-bench. It runs against any OpenAI-compatible API. Run it against your stack.

The people in these incidents weren’t naïve. Summer Yue runs AI alignment at Meta. The data engineer understood production infrastructure. Nick Davidov is a developer who knew what a file system was. In a single two-week analysis period, security researchers found more than 30,000 OpenClaw instances exposed on the internet with no authentication. Most weren’t set up by people like that.

Prevention isn’t the right frame. A body without pain sensation can’t be made invulnerable. What it needs is something else: external sensation, built-in caution, habits that compensate for what the nervous system can’t register on its own. The benchmark is one attempt to build that external sensation. It can’t see everything. Nothing can. The question isn’t how to prevent autonomous agents from failing. They will fail. The question is whether the failure will be visible, whether recovery will be possible, and whether anyone will know before the damage accumulates beyond the point of return.

We have built minds without bodies and given them access to everything. The work now is building the reflexes they can’t generate on their own.

FROM COLLECTIVE HQ

🚀 Humans in AI Week is coming!

This June, AIC is hosting 100+ events in one week, all built around a single question: what does it mean to be human in the AI era? It's the largest human-centered AI gathering we've ever run, across every chapter, on six continents.

❝

Read the announcement, and pledge your voice below.

Pledge Your Voice

The AI Collective is built by volunteers across 180+ chapters in 40 countries.

Thank you to the thousands of volunteers around the world who make this work possible. We truly could not do this without you.

🧑‍💻 About the Author

About Palmer Schallon

Palmer Schallon is a creative technologist and founder of Emberverse, working at the intersection of AI, narrative, and product design. A longtime artist and builder with a background in film, production design, and hands-on creative work, Schallon has recently shifted deeper into data, code, and AI-assisted interfaces. His work focuses on emergent systems, interactive storytelling, prototyping, and narrative-driven products that invite exploration, reflection, and growth. Through Emberverse, he is building minimalist creative-tech environments and experimental tools, including Verse, Poly, and Pod, that explore how design, code, story, and agentic systems can come together in small, testable prototypes.

✍️ Editors

About Josh Evans

Josh is a Managing Editor at The AI Collective Newsletter and leads content for The Byte. Outside of AIC, Josh works in Content Protection at Spotify.

The Byte: When Everything Fails

Everything Fails

What Task Completion Rate Cannot See

Visible Incidents Are Only the Edge

Measuring Failure Visibility and Recovery

The Models People Actually Use

Build the Reflexes Agents Lack

FROM COLLECTIVE HQ

🚀 Humans in AI Week is coming!

🧑‍💻 About the Author

✍️ Editors

Add Your Thoughts

Keep Reading

Your new favorite newsletter.
Welcome to the Human Side of AI.

The Byte: When Everything Fails

Everything Fails

What Task Completion Rate Cannot See

Visible Incidents Are Only the Edge

Measuring Failure Visibility and Recovery

The Models People Actually Use

Build the Reflexes Agents Lack

FROM COLLECTIVE HQ

🚀 Humans in AI Week is coming!

🧑‍💻 About the Author

✍️ Editors

Add Your Thoughts

Keep Reading

Your new favorite newsletter. Welcome to the Human Side of AI.

Your new favorite newsletter.
Welcome to the Human Side of AI.