Last week I asked Opus 4.6 to look up something in its own documentation. Anthropic's docs, for Anthropic's (at the time) best model. It gave me the wrong answer, confidently, twice.

There may have been a number of factors at play, but at the end of the day it's an issue many people have dealt with: an AI system giving an answer that sounds right but isn't. That's a big problem as more knowledge work gets passed off to agents. However, it's not necessarily a new problem. Humans are probabalistic engines too. Our brains evolved to pattern match efficiently enough to get the job done and survive.

Most knowledge work today follows a pattern. Experiment analysis. Quarterly reporting. Code review. There's a template, there's new context each round, and there's expected output. The people doing it are experienced. You'd expect reliability, but you only get it some of the time.

A single developer misses roughly half the bugs in their own code. When NeurIPS ran the same papers through two independent review committees in 2014, a quarter of the accepted papers would have been rejected by the other group. [Nick Leeson collapsed a 233-year-old bank] (https://en.wikipedia.org/wiki/Barings_Bank) not because he was uniquely dishonest, but because no one was structurally required to check his work. When the WHO rolled out a 19-item surgical checklist across eight hospitals in 2009, the death rate dropped 47%. All from a piece of paper that assumed surgeons make mistakes.

These are baseline behavior from probabilistic agents operating under normal conditions. Surgery, aviation, production code. These are high stakes environments where work is carried out by carefully selected experts, but the solution to get them working as well as they do today was not to train better humans but to add the proper scaffolding around those humans.

We're now at a similar point with AI. Even if all progress on frontier models and harness engineering stops today, we have the tools to dramatically increase agent reliability in knowledge work. They are the same tools we've been using to create more reliable human systems for centuries.

Two things I'm treating as given here: giving a model better context and limiting what a system can access bounds the blast radius of any mistake it makes. Both matter, but here I'm focusing on what happens beyond context and system access.

Three approaches.

Deterministic checks

The cleanest case. Anything with a verifiable correct answer should be verified by something that can't be wrong. A calculator doesn't hallucinate arithmetic. A type-checker doesn't misremember what a string is. When you can specify what "correct" looks like in advance, you don't need a probabilistic reviewer at all. Software examples such as schema validators, linters, unit tests are common, but deterministic checks go beyond computers and calculations.

Hospitals mark the surgical site before the operation starts. The mark is a deterministic check: it removes the surgeon's ability to operate on the wrong limb regardless of how tired or distracted they are.

Most knowledge work has more of these than people realize. How would you codify the proverbial "sniff test"?

Redundancy

For everything that isn't verifiable, (judgment calls, synthesis, interpretation) you need probabilistic reviewers. We can do the math to see the effect of multiple reviewers, with some simplification. If we assume that a single reviewer independently misses a given error with probability q, then n independent reviewers all miss it with probability qⁿ. The probability of catching the error given n reviewers, therefore, is 1-qⁿ

In the table below, let's assume a single reviewer independently misses a given error 50% of the time (q = 0.5).

Reviewers	P(all miss)	P(catch)
1	50%	50%
2	25%	75%
3	12.5%	87.5%
5	3.1%	96.9%
10	0.1%	99.9%

Miss rate: 50%accuracy: 50%

Three reviewers nearly double the catch rate. This is why aviation requires two pilots, why peer review exists, why code review exists. Redundancy works on probabilistic systems.

The catch here is that most "redundancy" isn't always redundant. Asking the same model to check its own output is like asking a surgeon to recheck their own work. It's often useful, but not independent. Thinking models and extended chain-of-thought reasoning do something like this internally: a different forward pass with fresh framing catches some errors the first pass missed. Worth doing, but not the same as an independent reviewer. The formula only holds when the errors are actually independent, which brings us to a new problem.

Variance

Airbus doesn't run three copies of the same flight software. They run three different implementations, built by separate teams, sometimes in different programming languages. The principle is dissimilar redundancy: make the reviewers different enough that their failure modes don't overlap.

Drag the correlation slider up.

Miss rate50%Correlation0%

When a fraction of errors are systematic (same training data, same blind spots, same framing), adding more reviewers doesn't help. You eventually hit a ceiling because no matter how many reviewers you have, they always miss the same issues. Running GPT-4 against the same prompt twice isn't two independent reviews. The CrowdStrike incident in 2024 is the extreme version where one faulty update, one shared codebase affected 8.5 million machines with the same failure simultaneously.

One potential fix is to run Claude, GPT, Gemini with different prompting strategies and different areas of emphasis. Add a deterministic layer like a type-checker's error profile is completely uncorrelated with any model's. Use a human for judgment calls where systematic AI bias is most likely. Change up the type of reviewer, the strengths and biases, and ideally you cover some gaps that a single reviewer running multiple times may have missed.

What this might like

Experiment analysis. The pattern: pull data, calculate metrics, interpret results, write up. Repeated every quarter, novel context and output each time.

Deterministic layer first: schema validation on the data pull, sanity checks on key metrics against historical ranges, unit tests on any calculation logic. These run before any model touches the output.

Then redundancy: the model drafts the interpretation. A fresh prompt such as "what am I most likely to have gotten wrong here?" runs a second pass. Not fully independent, but it often catches real errors.

Then variance: a second model with a different prompt that focuses on finding gaps in assumptions rather than validating conclusions. This audits the first model's write-up. A human reviews the assumptions list and signs off on success criteria before anything ships.

The approach is one of many, it's not perfect and it's almost certainly out of date by the time I'm publishing this, but it's been interesting to explore.

Most of my "knowledge work" follows patterns. Goals and scope change frequently but the work is often the same shape. New context, new output, same structure.

I've started treating each iteration as a training run. The work is v1 of the loop. At the end of each session I ask myself and whichever agent I'm working with "where could a deterministic check have caught something I had to fix manually? Where was I the only reviewer? Where might different types of reviewers be helpful - human, agent or something else entirely?".

Ideally that loop gets tighter each time and I get to spend more time on the creative persuits I find interesting and less on the patterned execution.