From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Proposes output-centric safety training methods to improve AI model safety by moving beyond hard refusals toward safer completions, directly addressing alignment and control challenges in AI systems.
Paper Details
Metadata
Abstract
Large Language Models used in ChatGPT have traditionally been trained to learn a refusal boundary: depending on the user's intent, the model is taught to either fully comply or outright refuse. While this is a strong mitigation for explicitly malicious prompts, focusing safety training on refusals can lead to brittleness for prompts with obscured user intent. Binary refusal boundaries are especially ill-suited for dual-use cases (such as biology or cybersecurity), where a user request can be answered safely at a high level, but in some cases can lead to malicious uplift if sufficiently detailed or actionable. As an alternative, we propose safe-completions: a safety-training approach that centers on the safety of the assistant's output, rather than a binary classification of the user's intent. Safe-completions seek to maximize helpfulness within the safety policy's constraints. We incorporated this approach into GPT-5 and find that across both production comparisons and internally controlled experiments, safe-completion training improves safety (especially on dual-use prompts), reduces the severity of residual safety failures, and substantially increases model helpfulness.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Why Alignment Might Be Hard | Argument | 69.0 |
Cached Content Preview
[2508.09224] Untitled Document Conversion to HTML had a Fatal error and exited abruptly. This document may be truncated or damaged. ◄ Feeling lucky? Conversion report Report an issue View original on arXiv ►
abf5f3cc624bed21 | Stable ID: MmVlM2U0MW