References

Our publications

AI Testing Should Account for Sophisticated Strategic Behaviour. Kovařík, Chen, Petersen, Ghersengorin, Conitzer. NeurIPS 2025 (position paper). Introduces the non-strategic → strategic regime transition and the game-theoretic framing of evaluation. arXiv:2508.14927

Game Theory with Simulation of Other Players. Kovařík, Oesterheld, Conitzer. IJCAI 2023. Formalises evaluation as simulation games; characterises when simulation helps and when it doesn't. arXiv:2305.11261

Game Theory with Simulation in the Presence of Unpredictable Randomisation. Kovařík, Sauerberg, Hammond, Conitzer. AAMAS 2025. Extends simulation games to mixed strategies with unpredictable randomisation. arXiv:2410.14311

Recursive Joint Simulation in Games. Kovařík, Oesterheld, Conitzer. Working paper, presented at LOFT 2024. arXiv:2402.08128

Characterising Simulation-Based Program Equilibria. Cooper, Oesterheld, Conitzer. AAAI 2025. arXiv:2412.14570

Evaluation as a (Cooperation-Enabling?) Tool. Kovařík. LessWrong, December 2025. The multi-tool framing; evaluation as one instrument among commitments, reputation, penalties, and institutional design. LessWrong

Situational awareness and evaluation awareness

Berglund, Stickland, Balesni, Kaufmann, Tong, Korbak, Kokotajlo, Evans. Taken Out of Context: On Measuring Situational Awareness in LLMs. 2023. arXiv:2309.00667

Laine, Chughtai, Betley, Hariharan, Scheurer, Balesni, Hobbhahn, Meinke, Evans. Me, Myself and AI: The Situational Awareness Dataset (SAD) for LLMs. NeurIPS 2024. arXiv:2407.04694 · Website

Phuong, Zimmermann, Wang, Lindner, Krakovna, Cogan, Dafoe, Ho, Shah. Evaluating Frontier Models for Stealth and Situational Awareness. 2025. arXiv:2505.01420

Fan, Zhang, Pan, Yang. Evaluation Faking: Unveiling Observer Effects in Safety Evaluation of Frontier AI Systems. 2025. arXiv:2505.17815

Ivanov. What is an Evaluation and Why this Definition Matters. LessWrong

Ivanov. Call for Science of Eval Awareness Research Directions. LessWrong

Comparative Analysis of Black Box Methods for Detecting Evaluation Awareness in LLMs. LessWrong.

Scheming, alignment faking, and deception

Meinke, Schoen, Scheurer, Balesni, Shah, Hobbhahn. Frontier Models Are Capable of In-Context Scheming. 2025. arXiv:2412.04984

Greenblatt, Denison, Wright, Roger, MacDiarmid, Marks, Treutlein et al. Alignment Faking in Large Language Models. 2024. arXiv:2412.14093

Sheshadri, Hughes, Michael, Mallen, Jose, Roger et al. Why do Some Language Models Fake Alignment While Others Don't? 2025. arXiv:2506.18032

Hubinger, Denison, Mu, Lambert, Tong, MacDiarmid, Lanham et al. Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training. 2024. arXiv:2401.05566

Baker, Huizinga, Gao, Dou, Guan, Madry, Zaremba, Pachocki, Farhi. Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation. 2025. arXiv:2503.11926

Sharkey. Circumventing Interpretability: How to Defeat Mind-Readers. LessWrong

AI control

Greenblatt, Shlegeris, Sachan, Roger. AI Control: Improving Safety Despite Intentional Subversion. 2024. arXiv:2312.06942

Bhatt et al. Ctrl-Z: Controlling AI Agents via Resampling. Redwood Research, 2025.

Griffin, Thomson, Shlegeris, Abate. Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols. 2024. arXiv:2409.07985

Taylor et al. Auditing Games for Sandbagging. December 2025.

Safety cases

Balesni, Hobbhahn, Lindner, Meinke, Korbak, Clymer, Shlegeris, Scheurer, Stix, Shah, Goldowsky-Dill, Braun, Chughtai, Evans, Kokotajlo, Bushnaq. Towards Evaluations-Based Safety Cases for AI Scheming. 2024. arXiv:2411.03336

Buhl, Sett, Koessler, Schuett, Anderljung. Safety Cases for Frontier AI. 2024. arXiv:2410.21572

Clymer, Gabrieli, Krueger, Larsen. Safety Cases: How to Justify the Safety of Advanced AI Systems. 2024. arXiv:2403.10462

Goemans, Buhl, Schuett, Korbak, Wang, Hilton, Irving. Safety Case Template for Frontier AI: A Cyber Inability Argument. 2024. arXiv:2411.08088

Bowen et al. AI Companies Should Report Pre- and Post-Mitigation Safety Evaluations. 2025. arXiv:2503.17388

Cooperative AI and game theory

Conitzer, Oesterheld. Foundations of Cooperative AI. AAAI 2023. PDF

Oesterheld. Robust Program Equilibrium. Theory and Decision, 86:143–159, 2019.

Chen, Ghersengorin, Petersen. Imperfect Recall and AI Delegation. Global Priorities Institute, Working Paper 30-2024.

Making Deals with Early Schemers. LessWrong.

Evaluation methodology

Recht et al. Do ImageNet Classifiers Generalize to ImageNet? arXiv:1902.10811

Measuring What Matters: Construct Validity in Large Language Model Benchmarks. Oxford Internet Institute et al. NeurIPS 2025.

Both Eyes Open: Vigilant Incentives Help Auditors Improve AI Safety. Journal of Physics: Complexity, 2025.

Soares. On How Various Plans Miss the Hard Bits of the Alignment Challenge. LessWrong