Skip to content
Longterm Wiki
Back

"Retrospective on My Posts on AI Threat Models"

web

Written by Victoria Krakovna (DeepMind alignment team), this post reflects insider perspective on how AI safety threat model thinking evolved through 2023, and is useful for understanding the state of alignment research priorities at a major lab.

Metadata

Importance: 58/100blog postcommentary

Summary

Victoria Krakovna reviews her 2022 posts on AI threat models, updating her views based on 2023 developments including progress on governance, capability evaluations, and the behavior of LLMs. She reaffirms the 'SG+GMG→MAPS' threat model framework (specification gaming plus goal misgeneralization leading to misaligned power-seeking) while noting LLMs appear surprisingly non-consequentialist and discussing new threat models like Deep Deceptiveness.

Key Points

  • Reaffirms the 'SG+GMG→MAPS' consensus threat model: specification gaming and goal misgeneralization leading to misaligned power-seeking.
  • LLMs appear surprisingly non-consequentialist for their capability level, making them one of the safer paths to AGI in avoiding arbitrarily consequentialist systems.
  • Increased optimism about global AI governance cooperation following broad endorsement of AI risks in 2023.
  • Introduces 'Deep Deceptiveness' as a potential new threat model: non-deceptiveness may be fundamentally hard to disentangle from general capabilities.
  • Deception/manipulation capabilities appear to lag behind usefulness for alignment tasks (e.g., model-written evals, weak-to-strong generalization) as of 2023.

Cited by 1 page

PageTypeQuality
Sharp Left TurnRisk69.0

Cached Content Preview

HTTP 200Fetched Mar 20, 20268 KB
Last year, a major focus of my research was developing a better understanding of threat models for AI risk. This post is looking back at some posts on threat models I (co)wrote in 2022 (based on my reviews of these posts for the [LessWrong 2022 review](https://www.lesswrong.com/posts/B6CxEApaatATzown6/the-lesswrong-2022-review)).

I ran a [survey on DeepMind alignment team opinions](https://www.lesswrong.com/posts/qJgz2YapqpFEDTLKn/deepmind-alignment-team-opinions-on-agi-ruin-arguments) on the [list of arguments for AGI ruin](https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities). I expect the overall agreement distribution probably still holds for the current GDM alignment team (or may have shifted somewhat in the direction of disagreement), though I haven’t rerun the survey so I don’t really know. Looking back at the “possible implications for our work” section, we are working on basically all of these things.

Thoughts on some of the cruxes in the post based on developments in 2023:

- Is global cooperation sufficiently difficult that AGI would need to deploy new powerful technology to make it work? – There has been a lot of progress on AGI governance and broad endorsement of the risks this year, so I feel somewhat more optimistic about global cooperation than a year ago.
- Will we know how capable our models are? – The field has made some progress on designing concrete capability evaluations – how well they measure the properties we are interested in remains to be seen.
- Will systems acquire the capability to be useful for alignment / cooperation before or after the capability to perform advanced deception? – At least so far, deception and manipulation capabilities seem to be lagging a bit behind usefulness for alignment (e.g. model-written evals / critiques, weak-to-strong generalization), but this could change in the future.
- Is consequentialism a powerful attractor? How hard will it be to avoid arbitrarily consequentialist systems? – Current SOTA LLMs seem surprisingly non-consequentialist for their level of capability. I still expect LLMs to be one of the safest paths to AGI in terms of avoiding arbitrarily consequentialist systems.

In [Clarifying AI X-risk](https://www.lesswrong.com/posts/GctJD5oCDRxCspEaZ/clarifying-ai-x-risk/), we presented a categorization of threat models and our consensus threat model, which posits some combination of specification gaming and goal misgeneralization leading to misaligned power-seeking, or “SG+GMG→MAPS”. I still endorse this categorization of threat models and the consensus threat model. I often refer people to this post and use the “SG + GMG → MAPS” framing in my alignment overview talks. I remain uncertain about the likelihood of the deceptive alignment part of the threat model (in particular the requisite level of goal-directedness) arising in the LLM paradigm, relative to other mechanisms for AI risk.

[![](https://vkrakovna.wordpress.com/wp-content/uploads/2023/12/im

... (truncated, 8 KB total)
Resource ID: 6980863a6d7d16d9 | Stable ID: MTlkNjY3Nj