Skip to content
Longterm Wiki
Back

Victoria Krakovna – Research Publications

web
vkrakovna.wordpress.com·vkrakovna.wordpress.com/research/

Victoria Krakovna is a senior DeepMind safety researcher; this page indexes her body of work and is a useful entry point for understanding DeepMind's technical safety research agenda across side effects, power-seeking, and frontier model evaluation.

Metadata

Importance: 72/100homepage

Summary

This is the research publications page of Victoria Krakovna (DeepMind safety researcher), listing her papers spanning side effects avoidance, power-seeking, goal misgeneralization, tampering incentives, dangerous capabilities evaluation, and stealth/situational awareness in frontier models. The page serves as a comprehensive index of her contributions to technical AI safety research from 2017 to 2025.

Key Points

  • Covers foundational work on side effects penalties and relative reachability, including the influential AI Safety Gridworlds benchmark (2017).
  • Includes research on power-seeking tendencies in trained agents and quantifying stability of non-power-seeking behavior.
  • Features papers on tampering incentives (REALab, Decoupled Approval) and goal misgeneralization in RL agents.
  • Recent work (2024-2025) focuses on evaluating frontier models for dangerous capabilities, stealth, and situational/scheming awareness.
  • Several papers are products of MATS (ML Alignment Theory Scholars) mentorship projects, reflecting Krakovna's role in the safety research community.

Cited by 1 page

PageTypeQuality
Corrigibility FailureRisk62.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202612 KB
### Papers

[Evaluating Frontier Models for Stealth and Situational Awareness](https://arxiv.org/abs/2505.01420). Mary Phuong, Roland Zimmermann, Ziyue Wang, David Lindner, Victoria Krakovna, Sarah Cogan, Allan Dafoe, Lewis Ho, Rohin Shah. Arxiv 2025. ( [blog post](https://deepmindsafetyresearch.medium.com/evaluating-and-monitoring-for-ai-scheming-d3448219a967))

[Evaluating Frontier Models for Dangerous Capabilities](https://arxiv.org/abs/2403.13793). Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Toby Shevlane, et al. Arxiv 2024.

[Limitations of Agents Simulated by Predictive Models](https://arxiv.org/abs/2402.05829). Raymond Douglas, Jacek Karwowski, Chan Bae, Andis Draguns, Victoria Krakovna (MATS project). Arxiv 2024.

[Quantifying stability of non-power-seeking in artificial agents.](https://arxiv.org/abs/2401.03529) Evan Ryan Gunter, Yevgeny Liokumovich, Victoria Krakovna (MATS project). Arxiv 2024.

[Power-seeking can be probable and predictive for trained agents](https://arxiv.org/abs/2304.06528). Victoria Krakovna and Janos Kramar. Arxiv, 2023. ( [blog post](https://www.alignmentforum.org/posts/fLpuusx9wQyyEBtkJ/power-seeking-can-be-probable-and-predictive-for-trained))

[Goal Misgeneralization: Why Correct Specifications Aren’t Enough For Correct Goals](https://arxiv.org/abs/2210.01790). Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato, Zac Kenton. Arxiv, 2022.

[Avoiding Side Effects By Considering Future Tasks](https://papers.nips.cc/paper/2020/file/dc1913d422398c25c5f0b81cab94cc87-Paper.pdf). Victoria Krakovna, Laurent Orseau, Richard Ngo, Miljan Martic, Shane Legg. Neural Information Processing Systems, 2020. ( [arXiv](https://arxiv.org/abs/2010.07877), [code](https://github.com/deepmind/deepmind-research/tree/master/side_effects_penalties), [AN summary](https://mailchi.mp/051273eb96eb/an-122arguing-for-agi-driven-existential-risk-from-first-principles))

[Avoiding Tampering Incentives in Deep RL via Decoupled Approval](https://arxiv.org/abs/2011.08827). Jonathan Uesato, Ramana Kumar, Victoria Krakovna, Tom Everitt, Richard Ngo, Shane Legg. ArXiv, 2020. ( [blog post](https://medium.com/@deepmindsafetyresearch/realab-conceptualising-the-tampering-problem-56caab69b6d3), [AN summary](https://mailchi.mp/77121cab0cff/an-126-avoiding-wireheading-by-decoupling-action-feedback-from-action-effects))

[REALab: An Embedded Perspective on Tampering](https://arxiv.org/abs/2011.08820). Ramana Kumar, Jonathan Uesato, Richard Ngo, Tom Everitt, Victoria Krakovna, Shane Legg. ArXiv, 2020. ( [blog post](https://medium.com/@deepmindsafetyresearch/realab-conceptualising-the-tampering-problem-56caab69b6d3))

[Modeling AGI Safety Frameworks with Causal Influence Diagrams](https://arxiv.org/abs/1906.08663). Tom Everitt, Ramana Kumar, Victoria Krakovna, Shane Legg. IJCAI AI Safety workshop, 2019. ( [AN summary](https://mailchi.mp/2abdf1

... (truncated, 12 KB total)
Resource ID: 45af23d90ccfc785 | Stable ID: ZWYzYWY0Yz