Center for AI Safety SafeBench competition
webCredibility Rating
High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: Center for AI Safety
This newsletter provides a comprehensive overview of CAIS's 2024 research and advocacy portfolio, useful for tracking the state of technical AI safety work and benchmark development across the field.
Metadata
Summary
The Center for AI Safety's 2024 annual review highlights major research achievements including circuit breakers for preventing dangerous AI outputs, the WMDP benchmark for measuring hazardous knowledge, HarmBench for red teaming evaluation, and tamper-resistant safeguards for open-weight models. The review also covers advocacy efforts including the CAIS Action Fund and support for AI safety legislation. These projects span technical safety research, evaluation frameworks, and policy advocacy.
Key Points
- •Circuit Breakers research (NeurIPS 2024) showed AI models can be trained to resist jailbreaking, requiring 20,000 attempts to break in a competition
- •WMDP Benchmark provides 3,668 questions measuring hazardous knowledge in biosecurity, cybersecurity, and chemical security (ICML 2024)
- •HarmBench standardized red teaming evaluation framework was used by US and UK AI Safety Institutes for pre-deployment testing of Claude 3.5 Sonnet
- •Tamper-resistant safeguards for open-weight LLMs developed to prevent removal of safety measures even after fine-tuning
- •Safetywashing paper (NeurIPS 2024) argues most safety benchmarks correlate with general capabilities rather than measuring distinct safety properties
Cited by 2 pages
| Page | Type | Quality |
|---|---|---|
| AI Safety Intervention Effectiveness Matrix | Analysis | 73.0 |
| Optimistic Alignment Worldview | Concept | 91.0 |
Cached Content Preview
AISN #45: Center for AI Safety 2024 Year in Review
AI Safety Newsletter
Subscribe Sign in AISN #45: Center for AI Safety 2024 Year in Review
Center for AI Safety Dec 19, 2024 17 2 1 Share Article voiceover 0:00 -11:30 Audio playback is not supported on your browser. Please upgrade. As 2024 draws to a close, we want to thank you for your continued support for AI safety and review what we’ve been able to accomplish. In this special-edition newsletter, we highlight some of our most important projects from the year.
The mission of the Center for AI Safety is to reduce societal-scale risks from AI . We focus on three pillars of work: research, field-building, and advocacy.
Research
CAIS conducts both technical and conceptual research on AI safety. Here are some highlights from our research in 2024:
Circuit Breakers . We published breakthrough research showing how circuit breakers can prevent AI models from behaving dangerously by interrupting crime-enabling outputs. In a jailbreaking competition with a prize pool of tens of thousands of dollars, it took twenty thousand attempts to jailbreak a model trained with circuit breakers. The paper was accepted to NeurIPS 2024.
The WMDP Benchmark . We developed the Weapons of Mass Destruction Proxy Benchmark , a dataset of 3,668 multiple-choice questions serving as a proxy measurement for hazardous knowledge in biosecurity, cybersecurity, and chemical security. The benchmark enables measuring and reducing malicious use potential in AI systems. The paper was accepted to ICML 2024.
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? We argued that results show that most LLM benchmarks are highly correlated with general capabilities and training compute—even safety benchmarks. This shows that much of the existing “safety” work is not measuring or improving a distinct dimension from general capabilities. The paper was accepted to NeurIPS 2024.
Tamper-Resistant Safeguards for Open-Weight Models. Open-weight models can help minimize concentration of power as proprietary models become more capable. One challenge of open-weight models, however, is the possibility of malicious users using them to cause catastrophic harm. We developed a method for building tamper-resistant safeguards into open-weight LLMs such that adversaries cannot remove the safeguards even after fine-tuning. If we can robustly remove hazardous knowledge from LLMs, it greatly increases the viability of open-weight models.
HarmBench . We released a standardized evaluation framework for automated red teaming, establishing rigorous assessments of various methods
... (truncated, 11 KB total)112221760b143b57 | Stable ID: YTFkYjU2Nj