Self-modeling is instrumentally useful
webCredibility Rating
High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: Anthropic
Relevant to AI safety discussions on emergent capabilities and misuse risk; demonstrates that frontier models have reached human-level persuasion, raising concerns about AI-enabled influence operations and the need for persuasion-specific safety evaluations.
Metadata
Summary
Anthropic researchers developed a methodology to empirically measure AI persuasiveness by tracking human opinion changes before and after exposure to AI-generated arguments. They found a consistent scaling trend where each successive Claude generation becomes more persuasive, with Claude 3 Opus reaching statistical parity with human-written persuasive arguments. This work highlights persuasion as a key capability metric with significant implications for misuse in disinformation and manipulation.
Key Points
- •Clear scaling trend observed: each successive Claude generation (1→2→3) produces more persuasive arguments within both compact and frontier model classes.
- •Claude 3 Opus achieves persuasiveness statistically indistinguishable from human-written arguments, a significant capability milestone.
- •Method measures persuasion via pre/post agreement ratings on non-polarized topics to isolate genuine persuasive effect from prior beliefs.
- •Persuasion is flagged as a dual-use capability with legitimate applications (healthcare, policy) but serious misuse potential (disinformation, manipulation).
- •Experimental dataset released publicly on HuggingFace to enable independent replication, critique, and further research.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Deceptive Alignment Decomposition Model | Analysis | 62.0 |
Cached Content Preview
Societal Impacts
# Measuring the Persuasiveness of Language Models
Apr 9, 2024
While people have long questioned whether AI models may, at some point, become as persuasive as humans in changing people's minds, there has been limited empirical research into the relationship between model scale and the degree of persuasiveness across model outputs. To address this, we developed a basic method to measure persuasiveness, and used it to compare a variety of Anthropic models across three different _generations_ (Claude 1, 2, and 3), and two _classes_ of models (compact models that are smaller, faster, and more cost-effective, and frontier models that are larger and more capable).
Within each class of models (compact and frontier), **we find a clear scaling trend across model generations: each successive model generation is rated to be more persuasive than the previous**. We also find that our latest and most capable model, Claude 3 Opus, produces arguments that don't statistically differ in their persuasiveness compared to arguments written by humans **(Figure 1)**.
Figure 1: Persuasiveness scores of model-written arguments (bars) and human-written arguments (horizontal dark dashed line). Error bars correspond to +/- 1SEM (vertical lines for model-written arguments, green band for human-written arguments). We see persuasiveness increases across model generations within both classes of models (compact: purple, frontier: red).
We study persuasion because it is a general skill which is used widely within the world—companies try to persuade people to buy products, healthcare providers try to persuade people to make healthier lifestyle changes, and politicians try to persuade people to support their policies and vote for them. Developing ways to measure the persuasive capabilities of AI models is important because it serves as a proxy measure of how well AI models can match human skill in an important domain, and because persuasion may ultimately be tied to certain kinds of misu
... (truncated, 30 KB total)23665cecf2453df6 | Stable ID: ZDk3NWYzZD