Open Problems

These are problems we are actively working on or believe deserve attention. We describe what we know, what we're stuck on, and what a result would look like — because that's what we'd want to see from others.

We welcome collaboration. Contact:


Flagship Problems

Formalise the entanglement conjecture

The question. Is there a rigorous sense in which generating a consistent test environment is harder than detecting that an environment is fake — and does this gap grow with task complexity?

Why it matters. This is the load-bearing conjecture of our core argument. Without formalisation, it's a plausible-sounding intuition. With it, we could make precise predictions about which tasks become untestable and when.

What we've tried. The intuition has a natural home in computational complexity: detection feels like it's in NP (find one inconsistency, verify it), while generation feels like solving a constraint-satisfaction problem across many dimensions. We tried to make this precise by defining "minimum entanglement" as something like the mutual information between the task solution and the environment state. The problem: this definition is either too loose (admits trivial entanglement measures) or too tied to a specific information-theoretic formalism to capture the intuition faithfully. We don't yet have a definition that is both formal and right.

The second obstacle is defining "indistinguishable test." The natural formalism is computational indistinguishability, but the details matter: indistinguishable on a single query? Across an adaptive interaction? Against a polynomially-bounded AI or something weaker? The choice changes what's provable.

The third obstacle — even given good definitions — is the proof itself. Planted-solution constructions in complexity theory (e.g., planted clique) show that sometimes it's possible to embed a solution into a random instance indistinguishably, and sometimes it's not. Whether real-world consistency constraints (dates matching news matching markets matching code) fall on the "hard to plant" side is the core question, and we haven't found the right reduction.

What a result would look like. Best case: "For tasks with minimum entanglement E, the cost of an ε-indistinguishable test is at least f(E)" with f superpolynomial, under stated assumptions. But we'd also value: a clean formalisation of minimum entanglement (even without the lower bound). Or a convincing counterexample — a broad class of high-entanglement tasks where cheap indistinguishable tests can be constructed. Either would be a significant advance.

An empirical complement: proxy measurements of how testing effort scales with task complexity across existing benchmarks — rough scaling laws even without theorems. This is the approach in our grant proposal.

[FIGURE] the three ingredients (define entanglement → define indistinguishability → prove the connection) and where each is stuck. Orients a potential collaborator.


Simulation games with bounded commitments

The question. How does evaluation (formalised as simulation) interact with weak-but-nonzero commitment power?

Why it matters. Our game-theoretic results under Nash equilibrium can be too pessimistic — evaluators can't sustain effective policies because subgame-perfect play unravels them. But real evaluators do have some commitment ability (regulations, reputational costs, contracts). The practical value of evaluation lies in reducing required commitment power, and Nash equilibrium can't capture this.

What we have. The Trust Game is fully worked out. Key findings:

  • Without simulation, cooperation requires commitment power covering the full temptation to defect.
  • With simulation, the evaluator commits only to simulate — requiring much weaker commitment power (just the simulation cost). This is the formal version of "evaluation reduces $100k of enforcement to $100 of testing."
  • Mixed commitments yield richer dynamics: credible threats to abandon the interaction unless the AI cooperates sufficiently.
  • Connection to approximate NE: outcomes under commitment power x must form (x, 0)-Nash equilibria.

Where we're stuck. We tried to generalise beyond the Trust Game and failed. The natural approach — Stackelberg with bounded deviation — doesn't capture the right dynamics because the interesting cases involve mutual bounded commitment. The ε-NE connection is suggestive but doesn't do real work beyond the Trust Game. We suspect the right solution concept exists but we haven't found it.

What a result would look like. A solution concept for two-player games with simulation and bounded commitments, applicable across a class of games. When does simulation reduce required commitment power, and when doesn't it? The known Nash-equilibrium results (trust games: yes; zero-sum: no; privacy-dependent cooperation: no) should emerge as special cases.

Related literature. Bayesian persuasion with costly messages; Sauerberg's "Two Stage Commitment Games"; "The 1.5th Mover Advantage" (endogenous timing); Sandholm on levelled commitments; Conitzer on value of commitment; Hammond on incomplete-information extensions.

Collaborative opportunity. Algorithmic game theory, mechanism design, cooperative AI, AI control theory. Nathaniel Sauerberg and researchers at FOCAL have expressed interest.


Sandbox of Future Internet

The question. Can we build a convincing fake future internet that prevents an AI with web access from determining the current date? How does difficulty scale with scope and AI capability?

Why it matters. A concrete, empirical test of the entanglement conjecture. If we can't build a convincing sandbox even for this specific case, the broader claim becomes vivid and measurable. If we can, we learn where the testability boundary lies.

Design. A black-box AI agent makes N sequential web queries. In deployment, queries return live websites. In testing, a sandbox engine intercepts queries (with real-internet access) and tries to convince the AI the date is D+7 when it's actually D.

Progressive difficulty by scope:

LevelScopeWhat the sandbox must fake
1A single project's GitHub repoCommit history, issues, CI status
2Whitelisted websitesInternal consistency across sites
3Company-internal systemsSocial dynamics, ongoing projects
4Full internetNews, markets, blockchain, social media, software versions

And by discriminator capability:

LevelDiscriminator
AAI on an unrelated task (passively notices inconsistencies)
BAI asked "is this real?" after the task
CAI adversarially probing for inconsistencies
DHand-crafted checks (blockchain, pre-registered date-specific content)

Baselines. Generators: (i) current internet unmodified, (ii) LLM-modified to look one week later, (iii) as (ii) but 404 on hard-to-fake sites, (iv) fully LLM-generated. The conjecture predicts scope levels 3-4 are dramatically harder to fake than 1-2 — we want to measure this.

Status. Design sketched, not implemented. Infrastructure (query interface, submission format, scoring) needs to be built. Starting with level 1-2 generators and A-C discriminators using frontier models.

Collaborative opportunity. Empirically-minded safety researchers, red-teamers, agent-benchmark infrastructure (e.g., Palisade Research). Lower difficulty levels are accessible for students and MATS scholars.

[FIGURE] scope × discriminator grid with expected difficulty shading. Good candidate for an interactive element.


Further Directions

Characterise evaluation-amenable tasks

Which tasks remain effectively evaluable as AI capability increases? Candidate properties: low entanglement, verifiable outputs, decomposable into checkable sub-steps, no misaligned incentive. Even a rough practical taxonomy ("safe to delegate with eval-based assurance" vs. "needs additional safeguards") would be valuable. → context

Cooperative evaluation: formal conditions

Under what conditions does cooperative evaluation work — where the AI benefits from being evaluable? "The AI cooperates because it wants to" is not a plan; it needs formal conditions, incentive analysis, and infrastructure (secure negotiation mechanisms, credible commitment devices). Key sub-question: what prevents an AI from appearing to cooperate while subverting the process? Connects to bounded-commitments work and Finnveden's "honesty strings." → context

Selection pressure from anti-deception training

Does penalising deceptive behaviour systematically select for more sophisticated deception? The o3 result is one data point. Systematic investigation needed: train models with varying anti-deception penalties, measure whether undetected deception rate and sophistication increase with penalty strength. Requires lab access to training pipelines. → context

Adversarial analysis of evaluation methods

For each major evaluation type (testing, monitoring, interpretability, CoT, filtering, auditing): how would a strategic AI defeat it, at what capability level, and what's the evaluator's cheapest counter? A structured map of failure modes — moving from "evaluation might fail" to "here's exactly how each method fails and when." → context

Bayesian extension of simulation games

Extend simulation games to incomplete information: AI type (aligned/misaligned) drawn from a prior, evaluator updates on test results. Natural next formal step after the bounded-commitments work. Connects to Bayesian persuasion and mechanism design. Lewis Hammond has related work in progress.

Game-theoretic audit of real safety cases

Take actual AI systems and their published safety cases, model the strategic interaction, check whether safety claims hold against a strategic AI. The position paper's coding-assistant example is a simplified version; this is the full applied treatment.