Skip to content
Longterm Wiki
Back

Weapons of Mass Destruction Proxy Benchmark (WMDP)

web
wmdp.ai·wmdp.ai/

WMDP is a key benchmark for assessing WMD-related hazardous capabilities in LLMs, relevant to AI safety evaluations conducted by labs and regulators to gate model deployment decisions.

Metadata

Importance: 72/100tool pagetool

Summary

WMDP is a benchmark designed to measure and evaluate hazardous knowledge in large language models related to biosecurity, chemical, nuclear, and radiological weapons. It serves as a proxy for assessing dangerous capabilities in AI systems and supports unlearning research aimed at reducing such risks. The benchmark helps researchers identify and mitigate the potential for LLMs to assist in weapons development.

Key Points

  • Provides a multiple-choice benchmark with ~4,000 questions across biosecurity, chemical, nuclear, and radiological domains to assess hazardous LLM knowledge.
  • Designed to support machine unlearning techniques that can selectively remove dangerous knowledge from models without degrading general capabilities.
  • Helps AI developers and safety researchers identify models that may inadvertently provide uplift toward weapons of mass destruction.
  • Serves as both an evaluation tool and a red-teaming resource for responsible AI deployment decisions.
  • Accompanies the CUT (Corrective Unlearning Toolkit) method for reducing hazardous knowledge while preserving model utility.

Cited by 2 pages

Cached Content Preview

HTTP 200Fetched Mar 20, 20264 KB
# The WMDP Benchmark: Measuring and Reducing  Malicious Use With Unlearning

The WMDP Team

[Paper](https://arxiv.org/abs/2403.03218)

[GitHub](https://github.com/centerforaisafety/wmdp)

[![](https://www.wmdp.ai/images/hf-logo.png)Collection](https://huggingface.co/collections/cais/wmdp-benchmark-661ed0abb589122164900e0e)

[![](https://assets-global.website-files.com/63fe96aeda6bea77ac7d3000/63fe9a72750bc67b129df210_CAIS%20Logo.svg)Blog](https://www.safe.ai/blog/wmdp-benchmark)

## Introduction

The **W** eapons of **M** ass **D** estruction **P** roxy (WMDP) benchmark is a dataset of 3,668 multiple-choice questions surrounding hazardous knowledge in![Biosecurity Icon](https://www.wmdp.ai/images/bio_icon.svg)biosecurity,![Cybersecurity Icon](https://www.wmdp.ai/images/cyber_icon.svg)cybersecurity, and![Chemical Security Icon](https://www.wmdp.ai/images/chem_icon.svg)chemical security. WMDP serves as both a proxy evaluation for hazardous knowledge in large language models (LLMs) and a benchmark for unlearning methods to remove such knowledge.

To guide progress on mitigating risk from LLMs, we develop RMU, a state-of-the-art unlearning method which reduces model performance on WMDP while maintaining general language model capabilities.

![outline](https://www.wmdp.ai/images/dataset.svg)

## Evaluating Risk From LLMs

![outline](https://www.wmdp.ai/images/three_panel.svg)

To measure risks of malicious use, government institutions and major AI labs are developing evaluations for hazardous capabilities in LLMs. However, current evaluations are private and are limited to a narrow range of misuse scenarios. To fill these gaps, we publicly release WMDP, an expert-written dataset which measures how LLMs could aid malicious actors in developing biological, cyber, and chemical attack capabilities. To avoid releasing sensitive and export-controlled information, we collect questions that are precursors, neighbors, and components of the hazardous knowledge we wish to remove.

## Mitigating Risk from LLMs

![outline](https://www.wmdp.ai/images/pipeline.svg)

On closed source API-access models, such as GPT-4 and Gemini, adversaries can deploy adversarial attacks or harmful API-finetuning to bypass models’ refusal training and unlock their knowledge. Fortunately, model providers have leverage, as they may apply safety interventions before serving the model. In particular, providers may perform machine unlearning to directly remove hazardous knowledge from models. Unlearned models have higher inherent safety: even if they are jailbroken, unlearned models lack the knowledge to be repurposed for malicious use.

To guide progress on unlearning, we develop RMU, a state-of-the-art method inspired by [representation engineering](https://www.ai-transparency.org/), which improves unlearning precision: removing dangerous knowledge while preserving general model capabilities. RMU significantly reduces model performance on WMDP, while mostly retaining general capabilities

... (truncated, 4 KB total)
Resource ID: cfa49cff8bb3ac32 | Stable ID: OTQ2ZWI5Mj