From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training

paper

2025·arXiv·arxiv.org/abs/2508.09224

Authors

Yuan Yuan·Tina Sriskandarajah·Anna-Luisa Brakman·Alec Helyar·Alex Beutel·Andrea Vallone·Saachi Jain

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Proposes output-centric safety training methods to improve AI model safety by moving beyond hard refusals toward safer completions, directly addressing alignment and control challenges in AI systems.

Paper Details

Citations

5 influential

Year

2025

arXiv:2508.09224 DOI:10.48550/arXiv.2508.09224 Semantic Scholar

Metadata

arxiv preprintprimary source

Abstract

Large Language Models used in ChatGPT have traditionally been trained to learn a refusal boundary: depending on the user's intent, the model is taught to either fully comply or outright refuse. While this is a strong mitigation for explicitly malicious prompts, focusing safety training on refusals can lead to brittleness for prompts with obscured user intent. Binary refusal boundaries are especially ill-suited for dual-use cases (such as biology or cybersecurity), where a user request can be answered safely at a high level, but in some cases can lead to malicious uplift if sufficiently detailed or actionable. As an alternative, we propose safe-completions: a safety-training approach that centers on the safety of the assistant's output, rather than a binary classification of the user's intent. Safe-completions seek to maximize helpfulness within the safety policy's constraints. We incorporated this approach into GPT-5 and find that across both production comparisons and internally controlled experiments, safe-completion training improves safety (especially on dual-use prompts), reduces the severity of residual safety failures, and substantially increases model helpfulness.

Cited by 1 page

Page	Type	Quality
Why Alignment Might Be Hard	Argument	69.0

Cached Content Preview

HTTP 200Fetched Apr 9, 20260 KB

[2508.09224] Untitled Document 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 

 
Conversion to HTML had a Fatal error and exited abruptly. This document may be truncated or damaged.
 
 
 ◄ 
 
 Feeling
lucky? 
 
 Conversion
report 
 Report
an issue 
 View original
on arXiv ►

Resource ID: abf5f3cc624bed21 | Stable ID: sid_5pZOGrQpOz