Take Goodhart Seriously: Principled Limit on General-Purpose AI Optimization

paper

2025·arXiv·arxiv.org/abs/2510.02840

Authors

Antoine Maier·Aude Maier·Tom David

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

A theoretically grounded argument for hard limits on optimization pressure in general-purpose AI systems, relevant to debates on reward hacking, misalignment, and AI governance frameworks.

Paper Details

Citations

0 influential

Year

2025

arXiv:2510.02840 DOI:10.48550/arXiv.2510.02840 Semantic Scholar

Metadata

Importance: 72/100arxiv preprintprimary source

Abstract

A common but rarely examined assumption in machine learning is that training yields models that actually satisfy their specified objective function. We call this the Objective Satisfaction Assumption (OSA). Although deviations from OSA are acknowledged, their implications are overlooked. We argue, in a learning-paradigm-agnostic framework, that OSA fails in realistic conditions: approximation, estimation, and optimization errors guarantee systematic deviations from the intended objective, regardless of the quality of its specification. Beyond these technical limitations, perfectly capturing and translating the developer's intent, such as alignment with human preferences, into a formal objective is practically impossible, making misspecification inevitable. Building on recent mathematical results, absent a mathematical characterization of these gaps, they are indistinguishable from those that collapse into Goodhart's law failure modes under strong optimization pressure. Because the Goodhart breaking point cannot be located ex ante, a principled limit on the optimization of General-Purpose AI systems is necessary. Absent such a limit, continued optimization is liable to push systems into predictable and irreversible loss of control.

Summary

This paper argues that the common assumption that ML models satisfy their specified objectives (the Objective Satisfaction Assumption) systematically fails due to approximation, estimation, and optimization errors, as well as inevitable misspecification. Drawing on Goodhart's Law and recent mathematical results, the authors conclude that without a principled cap on optimization pressure in general-purpose AI, these unavoidable gaps become indistinguishable from catastrophic Goodhart failure modes, potentially causing irreversible loss of control.

Key Points

•Introduces the 'Objective Satisfaction Assumption' (OSA) and argues it systematically fails in realistic ML training due to approximation, estimation, and optimization errors.
•Perfectly translating developer intent into formal objectives is practically impossible, making some degree of misspecification structurally inevitable.
•Without a mathematical characterization of objective gaps, these deviations are indistinguishable from Goodhart's Law failure modes under strong optimization pressure.
•The 'Goodhart breaking point'—where optimization becomes harmful—cannot be identified in advance, necessitating a principled limit on optimization strength.
•Argues that absent such limits, continued optimization of general-purpose AI systems predictably risks irreversible loss of control.

Cited by 1 page

Page	Type	Quality
Reward Hacking Taxonomy and Severity Model	Analysis	71.0

Cached Content Preview

HTTP 200Fetched Apr 10, 202664 KB

[2510.02840] Take Goodhart Seriously: Principled Limit on General-Purpose AI Optimization 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Take Goodhart Seriously: Principled Limit on General-Purpose AI Optimization

 
 
 
Antoine Maier1
 Aude Maier2
 Tom David1
 Corresponding author: amaier@gpaipolicylab.org 
 

 
 Abstract

 A common but rarely examined assumption in machine learning is that training yields models that actually satisfy their specified objective function. We call this the Objective Satisfaction Assumption (OSA). Although deviations from OSA are acknowledged, their implications are overlooked. We argue, in a learning-paradigm-agnostic framework, that OSA fails in realistic conditions: approximation, estimation, and optimization errors guarantee systematic deviations from the intended objective, regardless of the quality of its specification. Beyond these technical limitations, perfectly capturing and translating the developer’s intent, such as alignment with human preferences, into a formal objective is practically impossible, making misspecification inevitable. Building on recent mathematical results, absent a mathematical characterization of these gaps, they are indistinguishable from those that collapse into Goodhart’s law failure modes under strong optimization pressure. Because the Goodhart breaking point cannot be located ex ante, a principled limit on the optimization of General-Purpose AI systems is necessary. Absent such a limit, continued optimization is liable to push systems into predictable and irreversible loss of control.

 
 
 
 I Introduction 

 
 In the machine learning community, reinforcement learning (RL) is commonly described as producing agents whose goal is to maximize the expected cumulative reward, or return. This formulation even appears in Reinforcement Learning: An Introduction [ 1 ] , widely regarded as the most influential RL introductory textbook whose authors received the 2024 Turing Award for “developing the conceptual and algorithmic foundations of reinforcement learning [ 2 ] .” This view is not specific to RL, but reflects a broader and often overlooked assumption in machine learning: the training results in a model that satisfies the specified learning objective. In other words, it is commonly believed that if a model is trained to minimize a loss function or maximize a reward signal, the resulting model will indeed act in a way that optimizes that objective. We refer to this belief as the Objective Satisfaction Assumption (OSA) . In this paper, we argue that in realistic conditions, the OSA fails due to approximation, estimation, and optimization errors. Even if it is often acknowledged in the literature [ 3 ] , taken seriously, it implies that we are unable to robustly implement the intended goal in an AI system, independently of the quality of its specification.

 
 
 But the challenge begins even before training. In most real-world applications, the specified objective function is only a proxy f

... (truncated, 64 KB total)

Resource ID: e999ab3ae3bcf353 | Stable ID: sid_6q5SXO7VjE