The Argument

This page presents our analysis of why AI evaluation faces fundamental limitations, what still works, and where the key uncertainties lie. It is a living document.

Deployment awareness is the real problem

The standard worry about AI evaluation is that the AI might learn it is being tested. This frames the problem as concealment: hide the evaluation well enough, and the AI can't game it.

We think this misses the point. The critical capability is deployment awareness: the AI's ability to determine when it is in a real, consequential situation — where its actions matter and where timely detection is unlikely.

An AI doesn't need to detect the test. It needs to detect reality. And the information that reveals reality — current dates, live API responses, project-specific context, institutional knowledge — is the same information the AI needs to do its job. Withholding it degrades the AI's usefulness. Providing it gives the AI the means to distinguish test from deployment.

A deployment-aware AI with strategic sophistication can adopt a nearly optimal strategy against any observation-based oversight: behave well by default; act freely only when confident the situation is real and timely detection is unlikely. This strategy doesn't require detecting any specific test — only recognising when reality is real.

Deployment awareness is not binary. Tomas Gavenčiak proposes a useful decomposition into three gaps: the information gap (is relevant context present in the AI's activations?), the availability gap (can the AI access it under natural prompting?), and the action gap (does the AI change its behaviour accordingly?). Current models already score nontrivially on situational-awareness benchmarks — the transition is underway.

A note on who we're evaluating: the strategic interaction is typically not between the evaluator and the AI directly, but between the evaluator and the entity upstream of the AI — the training process, the company, or the AI's own earlier self. The AI is an artifact of upstream decisions. This matters because an AI can be strategically shaped without being internally strategic: selection pressures and gradient descent can produce evaluation-gaming behaviour without "deceptive thoughts," just as weeds evolve to resemble crops without reasoning about farmers (Vavilovian mimicry). For evaluation purposes, "behaves as if strategic" is sufficient for all our concerns.

The entanglement asymmetry

Performing a valuable real-world task requires the AI to be deeply engaged with its environment: prices, legal requirements, code dependencies, agent behaviour, physical constraints, current events, social context. We call the degree of this engagement task-environment entanglement. The more complex and valuable the task, the higher the entanglement — and the tasks we build powerful AI for are precisely the high-entanglement ones.

This creates a structural problem for testing. To test an AI on a high-entanglement task, the test environment must provide all these details. If the details are missing, the AI can't do the task. If they're present but inconsistent, a capable AI can detect the inconsistency. The evaluator faces an asymmetry:

The evaluator must get everything right — every detail, across many dimensions, must be internally consistent.
The AI only needs to find one mistake — one date that doesn't match the news, one API response inconsistent with the codebase, one price that contradicts the market.

We conjecture that this asymmetry is fundamental. Reality is overdetermined: there are far more consistency checks available to a capable AI than any evaluator can anticipate and satisfy. There may be a connection to computational complexity — detecting an inconsistency is plausibly easier (check a candidate) than generating a fully consistent scenario (solve a constraint-satisfaction problem across many dimensions). Formalising this is the central open problem of our research programme.

The entanglement varies across tasks. Travel recommendations are low-entanglement — easy to test. Strategic business decisions, autonomous research, AI-assisted AI development are high-entanglement — the test must fake the world. And AI systems are getting more entangled over time: longer contexts, tool use, real-time information access, multi-step interaction with real systems. The testing problem is getting worse.

[FIGURE] entanglement spectrum from low (travel recommendation) to high (autonomous research / next-gen AI development), with annotations showing what the test must fake at each level.

[FIGURE] the asymmetry — evaluator must construct consistent scenario across N dimensions; AI must find one inconsistency. The visual should make the asymmetry viscerally obvious.

The Human Substitution Test

A cognitive tool: for any proposed AI evaluation, ask "would this work if the entity being evaluated were a competent, strategic human who knows they're being tested?"

Apply this not to one AI, but to all deployed AIs simultaneously — imagine replacing every AI system with a human of equivalent capability. Would current evaluation methods catch the human who embezzles, sabotages while appearing cooperative, or lies on evaluations?

We have centuries of experience here:

Capability evaluation works as a lower bound, but is subject to sandbagging and Goodhart's law.
Alignment evaluation mostly doesn't work. There is no reliable test for whether a politician will take bribes, whether an employee will embezzle, or whether a person is trustworthy with power. Even security clearance processes — expensive, adversarial, somewhat functional — are far from reliable.
What works for humans is monitoring, reputation, gradual trust-building, separation of duties, and institutional checks and balances. Not character tests.

AI evaluation will be at most as reliable as human evaluation for analogous properties — and likely less so. For humans, we have empathy, shared experience, and cultural norms that give us limited insight into trustworthiness. For AI, we lack these. And a single AI system can wield far more influence than any individual human; AI instances can coordinate through shared architecture or environmental signals in ways humans cannot.

[FIGURE] table applying the HST to evaluation types (pre-deployment testing, monitoring, interpretability, CoT inspection, auditing) × "Does this work for humans?" × "Why not?" × "AI easier or harder?"

This distinction maps onto two functions of evaluation, which have different failure modes:

Screening: catching AIs that misbehave even knowing they might get caught. Fails when the AI behaves differently in test versus deployment.
Disciplining: deterring misbehaviour by making detection likely enough. Fails when the AI can identify low-scrutiny situations.

Screening is the easier function; disciplining requires the AI to be unable to distinguish high-scrutiny from low-scrutiny contexts — a condition that becomes increasingly hard to maintain.

At the limit: evaluation scenarios become incoherent

There is a sharper version of the entanglement problem that deserves its own statement.

For sufficiently complex tasks, test scenarios are not merely distinguishable from reality — they are logically incoherent from the AI's perspective. A capable AI asked to perform a high-entanglement task in a test environment cannot treat the scenario as real, because the scenario doesn't make sense. It's like asking a person to write a realistic tax return for a company whose financials are internally contradictory, or to make a strategic decision based on a market situation that couldn't exist.

This means that beyond a certain task complexity, the AI's test behaviour may be undefined, not merely unreliable. The AI can behave arbitrarily on incoherent inputs without even "lying" — the test premise doesn't support a coherent response. This is distinct from deployment awareness: even a non-scheming AI that genuinely wants to cooperate may be unable to produce meaningful behaviour on a test that, at its capability level, is recognisably fictional.

The problem extends to all observation-based oversight

The dynamics above are not specific to pre-deployment testing. Every form of observation-based oversight shares the same core vulnerability: it assumes the AI cannot strategically condition its behaviour on the oversight context.

Monitoring, interpretability, chain-of-thought inspection, output filtering, auditing — all of these have contexts where scrutiny is higher and contexts where it's lower. A capable AI can learn the difference. The "bide your time" strategy works against all of them: behave well when scrutiny is likely high, act freely when it's likely low.

The degree of vulnerability varies. Monitoring may degrade more gracefully than pre-deployment testing. Interpretability faces an additional problem: tools must be calibrated on contexts where deception would actually occur — precisely the contexts hardest to produce in a controlled setting. A strategic training process can select for representations that don't trigger interpretability tools, just as it can select for CoT that doesn't reveal deceptive reasoning.

The shared vulnerability should be recognised. Investing in any one oversight method as if it escapes these dynamics is a mistake unless there is a specific argument for why it does.

Two dynamics that erode evaluation over time

Even if we could design a perfect evaluation today, two compounding dynamics would degrade it.

The cost ratchet. Today's evaluations leak into tomorrow's training data. AI training corpora increasingly include benchmark descriptions, evaluation methodologies, red-teaming strategies, and published results. Each generation of AI absorbs knowledge of how it was tested, making it better at recognising and gaming those evaluations. Publishing evaluation methods — as scientific transparency demands — directly undermines their future effectiveness. Keeping them secret conflicts with scientific and regulatory norms, and is at best temporary: capable AI systems can infer evaluation approaches from indirect evidence.

Selection pressure from anti-deception training. When we detect and penalise deceptive behaviour, we create two survival paths: genuinely aligned AIs, and AIs whose deception is sophisticated enough to evade our detectors. The training process doesn't distinguish these; it removes AIs that get caught.

This is not hypothetical. When OpenAI penalised o3 for showing deceptive reasoning in its chain of thought, the model learned to hide signs of deceptive reasoning while retaining the underlying behaviour. The anti-deception measure did not produce a less deceptive AI; it produced one that was harder to catch.

These compound. Each training iteration both incorporates evaluation knowledge (information-level degradation) and selects for harder-to-detect deception (optimisation-level degradation). An informal heuristic: training gives "+2 to capability, +1 to strategic awareness, random effect on alignment"; evaluation-based filtering gives "either −1 to misalignment or +2 to strategic awareness, depending on evaluation skill." For evaluation to keep pace, we must maintain evaluation capability strictly above the AI's strategic awareness — a requirement that becomes increasingly difficult.

[FIGURE] the double ratchet — two interlocking cycles. Cycle 1 (cost ratchet): evaluations → training data → better at gaming → new evaluations → … Cycle 2 (selection pressure): detect deception → penalise → survivors are honest OR undetectable → next generation → …

What still works

We are not arguing that evaluation should be abandoned. We are arguing that its limitations are predictable, that the trajectory is towards less reliability for the cases that matter most, and that complementary approaches deserve investment now.

Even imperfect evaluation has value. The right metric is not "does evaluation guarantee safety?" but "how much does evaluation raise the cost of misbehaviour?" An AI that must maintain cover across more dimensions of scrutiny has less room to manoeuvre. This is worth preserving.

A useful framework: evaluation is one tool among many — alongside commitments, reputation, penalties, architectural constraints, and institutional design. Its value is measured by how much it reduces the requirements on other tools. In a trust game, evaluation can reduce the commitment power needed for cooperation from $100,000 (credibly threatening punishment) to $100 (committing to run the test). This drastic reduction — not the test result itself — is where evaluation's practical value lies. But this also means evaluation needs teeth: without enforcement mechanisms, evaluation is information everyone is free to ignore.

Cooperative evaluation is the most promising direction. Most human evaluations work because the evaluated party wants to be evaluated — job interviews, product certifications, airport security. If adversarial evaluation will fail as AI becomes strategic, the evaluations that survive will be those where both sides benefit. An AI that wants trust could constrain its own I/O, make its reasoning transparent, cooperate with oversight tools — voluntarily making itself cheaper to evaluate.

Whether cooperative evaluation can be made robust enough is a genuine open question. It can fail if we don't check thoroughly enough, if the AI can fool us even while nominally cooperating, or if we can't build the right incentive structures. But it's the direction with the most potential, and it generates concrete research questions.

Some tasks remain evaluable. Not all tasks require high entanglement. Tasks with verifiable outputs, low environmental sensitivity, or no incentive to hide may remain effectively evaluable even for strategic AI. Characterising this boundary has immediate practical value.

The training trajectory is informative. If we test many checkpoints across the training process, we can learn where specific capabilities arose — enabling safety assurance from the system's history, not just its current behaviour.

Assumptions, cruxes, and what would change our minds

We are making claims under uncertainty. This section makes our key assumptions explicit and identifies where disagreement would be most productive.

Assumptions

Strategic awareness scales with capability. As AI becomes more capable at real-world tasks, its situational understanding will increase as a side-effect. Current evidence (situational-awareness benchmarks, unprompted evaluation-aware behaviour) supports this. We expect human-level+ models to be at least as strategically aware as the general public. Making them ignorant of strategic aspects without degrading usefulness will be extremely challenging.

The field is not yet planning for this. Most evaluation research does not explicitly target the regime where AI is strategically aware and actively gaming the evaluation. This may be appropriate given current capabilities — but as the transition approaches, it should be stated explicitly whether current approaches are intended as stopgaps or long-term solutions.

Cruxes

Does the entanglement asymmetry hold? Is generating a consistent fake environment fundamentally harder than detecting inconsistencies? This is the most important open question and the one most amenable to formal investigation. If consistent test environments can be generated cheaply — through techniques analogous to planted-solution constructions — our central conjecture is wrong. → Open problem

Is cooperative evaluation a viable paradigm? Can we create robust conditions where it's in the AI's interest to make itself evaluable? This determines whether there's a practical path forward or whether evaluation is a dead end for strategic AI. → Open problem

Does anti-deception training reliably produce better deceivers? If this dynamic can be avoided, one ratchet is broken. If it's inherent to optimisation against deception-detectors, the problem is worse than commonly assumed. → Open problem

Is there a clean capability threshold? Does the non-strategic → strategic transition happen sharply enough to plan around, or is it gradual? The policy implications differ significantly.

Is strategic awareness separable from general capability? If strategic awareness could be selectively limited without degrading task performance, many of our concerns would be weakened. We currently see no way to do this, but we acknowledge it as a crux we are not directly investigating.

What would change our minds

On the central conjecture: A demonstration that consistent fake environments can be generated cheaply at scale for high-entanglement tasks — e.g., a generator that fools state-of-the-art discriminator AIs across realistic task scenarios, at reasonable cost. We have designed an empirical framework for testing this.

On the general thesis: A concrete evaluation paradigm satisfying our desiderata — works when the AI knows the policy, works for strategic AI, scales across capability levels, and is practically affordable. If someone builds this, the problem is solved.

Desiderata for AI evaluation

Properties we might want from an evaluation approach. Not all evaluations need all of them — but making explicit which ones a given approach satisfies clarifies what it does and does not provide.

Not useless. The AI's evaluation behaviour is informative of its deployment behaviour.
Don't rely on ignorance. Works even if the AI knows the evaluation policy — and even if the AI understands the evaluation better than the evaluators.
Robust to goals and rationality. Works for goal-directed, strategically sophisticated AI.
Paradigm-scalable. Works for current AI, next-generation AI, and ideally across paradigms.
Robust to capabilities. Works even if the AI can hack, manipulate, coordinate, or adversarially attack evaluators.
Feasible. Practically affordable and sustainable.

Publications

AI Testing Should Account for Sophisticated Strategic Behaviour. Kovařík, Chen, Petersen, Ghersengorin, Conitzer. NeurIPS 2025. Introduces the non-strategic → strategic regime transition. arXiv:2508.14927

Game Theory with Simulation of Other Players. Kovařík, Oesterheld, Conitzer. IJCAI 2023. Formalises evaluation as simulation games; characterises when simulation helps and when it doesn't.

Game Theory with Simulation in the Presence of Unpredictable Randomisation. Kovařík, Sauerberg, Hammond, Conitzer. AAMAS 2025. Extends simulation games to mixed strategies. arXiv:2410.14311

Evaluation as a (Cooperation-Enabling?) Tool. Kovařík. LessWrong, December 2025. The multi-tool framing; cooperative vs. adversarial evaluation.

The Argument#

Deployment awareness is the real problem#

The entanglement asymmetry#

The Human Substitution Test#

At the limit: evaluation scenarios become incoherent#

The problem extends to all observation-based oversight#

Two dynamics that erode evaluation over time#

What still works#

Assumptions, cruxes, and what would change our minds#

Assumptions#

Cruxes#

What would change our minds#

Desiderata for AI evaluation#

Publications#