can worsen with model size

web

openreview.net·openreview.net/pdf?id=bx24KpJ4Eb

This OpenReview paper examines scaling behavior of alignment techniques, relevant to debates about whether larger models are automatically safer or whether alignment interventions like RLHF become more costly or less effective at scale. Page was temporarily unavailable at time of analysis.

Metadata

Importance: 55/100conference paperprimary source

Summary

This paper investigates how alignment techniques such as RLHF may exhibit scaling problems, where safety-relevant behaviors or alignment costs worsen rather than improve as models grow larger. The work likely examines the relationship between model scale and alignment properties.

Key Points

•Alignment properties or costs may not improve monotonically with model scale, potentially degrading with larger models
•RLHF and human feedback-based training may introduce unexpected scaling challenges
•Larger models could exhibit worse alignment behavior in certain metrics despite improved general capabilities
•Results suggest caution about assuming scale automatically improves safety or alignment outcomes

Cited by 1 page

Page	Type	Quality
RLHF	Research Area	63.0

Cached Content Preview

HTTP 200Fetched Apr 9, 20260 KB

Wayback Machine

 

 
 
 
 

 Jan
 FEB
 Mar
 

 
 

 
 11
 
 

 
 

 2025
 2026
 2027
 

 
 
 

 

 

 
 
success

 
fail

 
 
 
 
 
 
 
 
 
 
 

 

 
 
 
 
 
 
 
 
 

 

 About this capture
 

 

 

 

 

 

 
COLLECTED BY

 

 

 
 
Collection: Save Page Now Outlinks

 

 

 

 

 
TIMESTAMPS

 

 

 

 

 

 

The Wayback Machine - http://web.archive.org/web/20260211011847/https://openreview.net/pdf?id=bx24KpJ4Eb

Resource ID: 7712afe39f75a44c | Stable ID: sid_P9rtN2Xszu