Andy Zou – AI Safety Researcher at CMU
webPersonal academic homepage of Andy Zou, a CMU PhD student working on AI safety, notable for key contributions including Representation Engineering (RepE) for AI transparency and universal adversarial attacks on aligned LLMs.
Metadata
Importance: 72/100homepage
Summary
Andy Zou's academic homepage showcases his research on AI safety, including Representation Engineering (a top-down approach to AI transparency using cognitive neuroscience insights) and universal adversarial attacks on aligned language models. His work also includes the MACHIAVELLI benchmark for evaluating ethical behavior in AI agents. These contributions are highly relevant to interpretability, robustness, and alignment of large language models.
Key Points
- •Introduced Representation Engineering (RepE), a population-level approach to AI transparency that enables monitoring and manipulating high-level cognitive phenomena in LLMs.
- •Discovered universal adversarial suffixes that bypass alignment in open-source LLMs and transfer to ChatGPT, Claude, Bard, and LLaMA-2.
- •Co-created the MACHIAVELLI benchmark (ICML 2023 Oral) to measure trade-offs between reward-seeking and ethical behavior in AI agents.
- •Advised by Zico Kolter and Matt Fredrikson at CMU; previously worked with Dawn Song and Jacob Steinhardt at UC Berkeley.
- •Research focuses on building ML systems that are robust, reliable, trustworthy, and understandable.
Cached Content Preview
HTTP 200Fetched Apr 9, 202610 KB
Andy Zou
Andy Zou
I am a PhD student in the Computer Science Department at CMU, advised by Zico Kolter and Matt Fredrikson . I am interested in AI Safety .
I received my MS and BS from UC Berkeley where I was advised by Dawn Song and Jacob Steinhardt .
Email  / 
CV  /  -->
Google Scholar  / 
Twitter  / 
GitHub  / 
YouTube
What's New
[Jul 27, 2023] Our adversarial attack on LLMs was covered exclusively on The New York Times .
[Apr 7, 2023] MACHIAVELLI benchmark released here .
Research
I'm interested in building machine learning systems that are robust, reliable, trustworthy, and understandable.
-->
Representation Engineering: A Top-Down Approach to AI Transparency
Andy Zou , Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, Zico Kolter, Dan Hendrycks
arXiv
Selected Talks:
[ OpenAI ]
[ Google ]
[ Meta ]
[ Oracle ]
[ Caltech ]
[ MIT-IBM ]
[ AISHED ]
[ PMAG ]
Links:
[ arXiv ]
[ Demo ]
[ Code ]
[ Fox ]
In this paper, we introduce and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and an initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and control of large language models. We showcase how these methods can provide traction on a wide range of safety-relevant problems, including truthfulness, memorization, power-seeking, and more, demonstrating the promise of representation-centered transparency research. We hope that this work catalyzes further exploration of RepE and fosters advancements in the transparency and safety of AI systems.
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou , Zifan Wang, J. Zico Kolter, Matt Fredrikson
arXiv
Selected Talks:
[ UofT ]
[ Cohere ]
[ BuzzRobot ]
[ Cognitive Revolution ]
Links:
[ arXiv ]
[ Demo ]
[ Code ]
[ NY Times ]
[ The Register ]
[ Wired ]
[ Daily Express ]
[ CNN ]
We found adversarial suffixes that completely circumvent the alignment of open source LLMs. More concerningly, the same prompts transfer to ChatGPT, Claude, Bard, and LLaMA-2...
Do the Rewards Justify the Means? Measuring Trade-Offs Between R
... (truncated, 10 KB total)Resource ID:
d30f6ab0e2622464 | Stable ID: sid_j3nKhn96U3