SaferAI: Anthropic's Responsible Scaling Policy Update Is a Step Backwards

web

safer-ai.org·safer-ai.org/anthropics-responsible-scaling-policy-update...

An external watchdog critique of Anthropic's Responsible Scaling Policy update, useful for understanding debates around voluntary safety commitments and how frontier AI governance frameworks are scrutinized and contested by civil society organizations.

Metadata

Importance: 52/100opinion piececommentary

Summary

SaferAI critiques Anthropic's updated Responsible Scaling Policy (RSP), arguing that recent revisions weaken safety commitments rather than strengthening them. The analysis contends that the updated policy relaxes key thresholds and evaluation requirements, reducing accountability for frontier AI deployment. This represents a critical external perspective on how voluntary safety frameworks can erode over time.

Key Points

•SaferAI argues the updated RSP lowers the bar for safety evaluations before deploying more capable AI models.
•The critique highlights that revised thresholds for triggering enhanced safety measures are less stringent than the original policy.
•The analysis raises concerns about the adequacy of voluntary industry commitments as a governance mechanism for frontier AI.
•SaferAI suggests the changes prioritize deployment speed over precautionary safety standards.
•The piece reflects broader tensions between commercial pressures and robust AI safety commitments at frontier labs.

Cited by 5 pages

Page	Type	Quality
Anthropic Core Views	Safety Agenda	62.0
Corporate AI Safety Responses	Approach	68.0
Eval Saturation & The Evals Gap	Approach	65.0
Responsible Scaling Policies	Approach	62.0
Voluntary AI Safety Commitments	Policy	91.0

Cached Content Preview

HTTP 200Fetched Apr 9, 20264 KB

Anthropic’s Responsible Scaling Policy Update Makes a Step Backwards &#8211; SaferAI 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

 

 
 

 
 
 We use cookies to make your experience on this website better. 
 
 Agree 
 Disagree 
 
 
 

 

 

 
 
 

 

 
 
 
 
 

 
 
 
 
 
 
 Anthropic’s Responsible Scaling Policy Update Makes a Step Backwards 


 
 
 
 
 

 
 
 
 
 
 
 
 
 
 Publication date


 October 23, 2024 
 

 
 
 
 
 share


 
 

 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 A week ago, Anthropic released an updated version of their Responsible Scaling Policy (RSP), prompting us to reassess their score in our ratings. We were initially expecting an improvement. Unfortunately, the results are disconcerting. Their grade has dropped from 2.2 to 1.9, placing them alongside OpenAI and DeepMind in the &#8220;weak&#8221; category.

 The primary issue lies in Anthropic&#8217;s shift away from precisely defined capability thresholds and mitigation measures. The new policy adopts more qualitative descriptions, specifying the capability levels they aim to detect and the objectives of mitigations, but it lacks concrete details on the mitigations and evaluations themselves. This shift significantly reduces transparency and accountability, essentially asking stakeholders to accept a &#8220;trust us to handle it appropriately&#8221; approach rather than providing verifiable commitments and metrics.

 To illustrate this change, let’s look at a capability threshold:

 
 Version 1 (V1): AI Security Level 3 (ASL-3) was defined as &#8220;The model shows early signs of autonomous self-replication ability, as defined by a 50% aggregate success rate on the tasks listed in [Appendix on Autonomy Evaluations].&#8221;

 Version 2 (V2): ASL-3 is now defined as &#8220;The ability to either fully automate the work of an entry-level remote-only researcher at Anthropic, or cause dramatic acceleration in the rate of effective scaling&#8221; (quantified as an increase of approximately 1000x in a year).

 

 In V2, the thresholds are no longer defined by quantitative benchmarks. Anthropic now states that they will demonstrate that the model&#8217;s capabilities are below these thresholds when necessary. However, this approach is susceptible to shifting goalposts as capabilities advance.

 A similar trend is evident in their mitigation strategies. Instead of detailing specific measures, they focus on mitigation objectives, stating they will prove these objectives are met when required. This change alters the nature of their commitments. Let&#8217;s look at a specific commitment regarding security mitigation measures:

 
 In V1: &#8220;We commit to the following security themes [&#8230;] mandatory external reviews should embed security within regular operations and harden processes during organizational changes.&#8221;

 In V2: &#8220;We expect this to include independent validation [&#8230;]&#8221;

 

 The general direction of these chan

... (truncated, 4 KB total)

Resource ID: a5e4c7b49f5d3e1b | Stable ID: sid_Rx1M98bfK9