References
Our publications
AI Testing Should Account for Sophisticated Strategic Behaviour. Kovařík, Chen, Petersen, Ghersengorin, Conitzer. NeurIPS 2025 (position paper). Introduces the non-strategic → strategic regime transition and the game-theoretic framing of evaluation. arXiv:2508.14927
Game Theory with Simulation of Other Players. Kovařík, Oesterheld, Conitzer. IJCAI 2023. Formalises evaluation as simulation games; characterises when simulation helps and when it doesn't. arXiv:2305.11261
Game Theory with Simulation in the Presence of Unpredictable Randomisation. Kovařík, Sauerberg, Hammond, Conitzer. AAMAS 2025. Extends simulation games to mixed strategies with unpredictable randomisation. arXiv:2410.14311
Recursive Joint Simulation in Games. Kovařík, Oesterheld, Conitzer. Working paper, presented at LOFT 2024. arXiv:2402.08128
Characterising Simulation-Based Program Equilibria. Cooper, Oesterheld, Conitzer. AAAI 2025. arXiv:2412.14570
Evaluation as a (Cooperation-Enabling?) Tool. Kovařík. LessWrong, December 2025. The multi-tool framing; evaluation as one instrument among commitments, reputation, penalties, and institutional design. LessWrong
Situational awareness and evaluation awareness
Berglund, Stickland, Balesni, Kaufmann, Tong, Korbak, Kokotajlo, Evans. Taken Out of Context: On Measuring Situational Awareness in LLMs. 2023. arXiv:2309.00667
Laine, Chughtai, Betley, Hariharan, Scheurer, Balesni, Hobbhahn, Meinke, Evans. Me, Myself and AI: The Situational Awareness Dataset (SAD) for LLMs. NeurIPS 2024. arXiv:2407.04694 · Website
Phuong, Zimmermann, Wang, Lindner, Krakovna, Cogan, Dafoe, Ho, Shah. Evaluating Frontier Models for Stealth and Situational Awareness. 2025. arXiv:2505.01420
Fan, Zhang, Pan, Yang. Evaluation Faking: Unveiling Observer Effects in Safety Evaluation of Frontier AI Systems. 2025. arXiv:2505.17815
Ivanov. What is an Evaluation and Why this Definition Matters. LessWrong
Ivanov. Call for Science of Eval Awareness Research Directions. LessWrong
Comparative Analysis of Black Box Methods for Detecting Evaluation Awareness in LLMs. LessWrong.
Scheming, alignment faking, and deception
Meinke, Schoen, Scheurer, Balesni, Shah, Hobbhahn. Frontier Models Are Capable of In-Context Scheming. 2025. arXiv:2412.04984
Greenblatt, Denison, Wright, Roger, MacDiarmid, Marks, Treutlein et al. Alignment Faking in Large Language Models. 2024. arXiv:2412.14093
Sheshadri, Hughes, Michael, Mallen, Jose, Roger et al. Why do Some Language Models Fake Alignment While Others Don't? 2025. arXiv:2506.18032
Hubinger, Denison, Mu, Lambert, Tong, MacDiarmid, Lanham et al. Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training. 2024. arXiv:2401.05566
Baker, Huizinga, Gao, Dou, Guan, Madry, Zaremba, Pachocki, Farhi. Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation. 2025. arXiv:2503.11926
Sharkey. Circumventing Interpretability: How to Defeat Mind-Readers. LessWrong
AI control
Greenblatt, Shlegeris, Sachan, Roger. AI Control: Improving Safety Despite Intentional Subversion. 2024. arXiv:2312.06942
Bhatt et al. Ctrl-Z: Controlling AI Agents via Resampling. Redwood Research, 2025.
Griffin, Thomson, Shlegeris, Abate. Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols. 2024. arXiv:2409.07985
Taylor et al. Auditing Games for Sandbagging. December 2025.
Safety cases
Balesni, Hobbhahn, Lindner, Meinke, Korbak, Clymer, Shlegeris, Scheurer, Stix, Shah, Goldowsky-Dill, Braun, Chughtai, Evans, Kokotajlo, Bushnaq. Towards Evaluations-Based Safety Cases for AI Scheming. 2024. arXiv:2411.03336
Buhl, Sett, Koessler, Schuett, Anderljung. Safety Cases for Frontier AI. 2024. arXiv:2410.21572
Clymer, Gabrieli, Krueger, Larsen. Safety Cases: How to Justify the Safety of Advanced AI Systems. 2024. arXiv:2403.10462
Goemans, Buhl, Schuett, Korbak, Wang, Hilton, Irving. Safety Case Template for Frontier AI: A Cyber Inability Argument. 2024. arXiv:2411.08088
Bowen et al. AI Companies Should Report Pre- and Post-Mitigation Safety Evaluations. 2025. arXiv:2503.17388
Cooperative AI and game theory
Conitzer, Oesterheld. Foundations of Cooperative AI. AAAI 2023. PDF
Oesterheld. Robust Program Equilibrium. Theory and Decision, 86:143–159, 2019.
Chen, Ghersengorin, Petersen. Imperfect Recall and AI Delegation. Global Priorities Institute, Working Paper 30-2024.
Making Deals with Early Schemers. LessWrong.
Evaluation methodology
Recht et al. Do ImageNet Classifiers Generalize to ImageNet? arXiv:1902.10811
Measuring What Matters: Construct Validity in Large Language Model Benchmarks. Oxford Internet Institute et al. NeurIPS 2025.
Both Eyes Open: Vigilant Incentives Help Auditors Improve AI Safety. Journal of Physics: Complexity, 2025.
Soares. On How Various Plans Miss the Hard Bits of the Alignment Challenge. LessWrong