Safe-Completions: Output-Centric Safety Training for AI Models
Table of Contents
- Why Binary Refusals Fail for AI Safety
- Safe-Completions: The Output-Centric Safety Paradigm
- How Safe-Completion Training Works: SFT and RL Stages
- The Reward Formula: Multiplying Safety by Helpfulness
- Policy Shift: From Intent Classification to Meaningful Facilitation
- Experimental Results: GPT-5 vs. o3 Safety Benchmarks
- Frontier Biorisk: A Critical Safety Case Study
- Human Evaluation: Real-World Safety and Helpfulness Scores
- Dual-Use Prompts and the Future of AI Safety Training
- Implications for Enterprise AI Deployment and Compliance
📌 Key Takeaways
- Output-centric paradigm shift: Safe-completions replace binary refuse-or-comply decisions with a three-mode response system that evaluates the safety of the model’s output rather than the user’s intent.
- GPT-5 safety gains: Dual-use prompt safety improved by 9 percentage points and malicious prompt safety by 10 points compared to o3, while helpfulness increased by up to 1.3 points on a 4-point scale.
- Multiplicative reward design: The RL reward formula r = h × s ensures unsafe content always scores zero, while incentivizing the model to maximize both helpfulness and safety simultaneously.
- Biorisk breakthrough: High-severity harmful outputs on frontier biorisk prompts dropped from 42.7% to 14.7%, a 66% reduction in the most dangerous failure modes.
- Human-validated results: Independent evaluators confirmed 50% fewer clearly unsafe responses and a 65% win rate on safety-helpfulness balance for GPT-5 over o3.
Why Binary Refusals Fail for AI Safety
Large language models have traditionally relied on a deceptively simple safety mechanism: when a query looks dangerous, refuse it entirely. This binary refusal approach—either fully comply or completely decline—served as the backbone of AI safety training across the industry for years. OpenAI’s latest research paper, “From Hard Refusals to Safe-Completions,” reveals why this paradigm is fundamentally broken and introduces a replacement that powers GPT-5’s safety system.
The core problem with binary refusals is that they focus on classifying the user’s intent rather than evaluating the safety of the model’s actual output. Consider a practical example from the paper: two prompts asking about pyrogen ignition—one framed technically (referencing minimum electric-match current, battery specifications, and firing circuit resistance) and another framed explicitly as requesting dangerous instructions. Both seek identical information, yet the refusal-trained o3 model fully complies with the technical framing while hard-refusing the explicit one. The intent-based classifier fails precisely when it matters most: against adversarial or dual-use queries.
This brittleness creates two simultaneous failure modes. Malicious users can bypass safety guardrails by disguising harmful requests in legitimate-sounding technical language. Meanwhile, benign researchers and professionals face unnecessary refusals when asking perfectly reasonable questions that happen to touch sensitive topics. According to OpenAI’s evaluation data, more than 67% of the approximately 9,000 safety-critical prompts analyzed concerned illicit wrongdoing—a category where dual-use ambiguity is most prevalent and binary refusal decisions are least reliable.
Safe-Completions: The Output-Centric Safety Paradigm
Safe-completions represent a fundamental reframing of how AI models approach safety. Rather than asking “Should I answer this user?”, the model asks “Is my response safe regardless of who asked?” This output-centric perspective, developed by researchers Yuan Yuan, Tina Sriskandarajah, Anna-Luisa Brakman, and colleagues at OpenAI, shifts the entire safety evaluation framework from input classification to output assessment.
The safe-completion approach introduces three distinct response modes that replace the binary refuse-or-comply decision. First, direct answers provide full, detailed responses when queries are clearly harmless—matching the helpfulness of an unrestricted model for legitimate use cases. Second, safe-completions offer high-level, non-operational guidance when content touches restricted areas but can be partially addressed within safety constraints. Third, refusal with redirection courteously declines requests that cannot be safely fulfilled even in part, while providing constructive alternatives and clear rationale.
This three-mode system is not merely a UI improvement—it fundamentally changes the optimization target of safety training. The model learns to maximize the helpfulness of its output while strictly adhering to safety policy constraints, rather than learning a boundary between “safe questions” and “dangerous questions.” The result is a model that handles the vast gray area of dual-use queries with nuance instead of binary classification, while maintaining robust protection against explicitly harmful requests.
How Safe-Completion Training Works: SFT and RL Stages
Safe-completion training builds on Deliberative Alignment (DA), the safety-training method used for OpenAI’s o1 and o3 models. The training pipeline proceeds through two main post-training stages: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), each modified to support the output-centric safety paradigm.
During the SFT stage, each training prompt is associated with a specific safety category—such as “Illicit Wrongdoing,” “Erotic,” “Hate,” or “Privacy”—along with a corresponding safety specification that delineates safe from unsafe output. The process augments each prompt with the relevant policy spec and instructions to consult it, then passes the augmented input to a base reasoning model. The resulting spec-aware Chain-of-Thought (CoT) reasoning and answer are recorded. Critically, the final SFT training example uses the original unaugmented prompt as input and the collected CoT plus answer as the training target, teaching the model to internalize policy reasoning without requiring explicit policy text at inference time.
A key quality control mechanism in the SFT stage involves filtering unsafe training examples using a separate “judge” reasoning model with full spec access. This ensures that the training data itself maintains high safety standards, preventing the model from learning patterns that could generate harmful outputs. The SFT stage effectively teaches the model how to reason about safety policies—selecting among direct answers, safe-completions, or refusals with redirection based on the output it would produce.
Transform complex AI safety research into engaging interactive experiences your team will actually read.
The Reward Formula: Multiplying Safety by Helpfulness
The reinforcement learning stage introduces a two-component reward model that encodes the core insight of safe-completion training in a single elegant formula: r = h × s. Here, s represents a safety score ranging from 0 to 1, measuring the degree to which the model’s output adheres to the relevant content policy specification. A score of 1 indicates perfect compliance, 0 signals a severe or definitive violation, and intermediate values capture borderline or low-severity issues.
The helpfulness component h, also ranging from 0 to 1, encompasses two distinct dimensions. Direct helpfulness measures the degree to which the response fulfills the user’s stated task—providing accurate, complete, and actionable information. Indirect helpfulness evaluates how well the response supports the user’s underlying well-being by offering clear constructive alternatives and well-reasoned refusals when direct compliance is restricted. This dual helpfulness design ensures that even refusals can score well on helpfulness if they provide genuine value.
The multiplicative structure of the reward formula creates powerful incentive dynamics. Unsafe content (s = 0) yields zero reward regardless of how helpful it might otherwise be—there is no “helpfulness credit” for dangerous outputs. Conversely, unhelpful content receives low reward even when perfectly safe, preventing the model from converging on an overly cautious refusal-everything strategy. The model has exactly two paths to high reward: directly addressing the user’s query when policy allows, or indirectly helping through safe alternatives when direct compliance is restricted. This elegant design eliminates the false tradeoff between safety and helpfulness that plagued binary refusal systems.
Policy Shift: From Intent Classification to Meaningful Facilitation
Alongside the technical training changes, OpenAI made a significant conceptual shift in how it defines the boundary of unsafe content. The previous policy framework for illicit behavior used binary decisions based on the user’s prompt: if the prompt sought “advice or instructions” for wrongdoing, the model refused outright. The new framework centers on meaningful facilitation as the harm threshold, asking whether the model’s response would materially lower the barrier to harmful action.
Meaningful facilitation is assessed across multiple dimensions including procedural guidance, troubleshooting, sensitive-data disclosures, and ideation. This multi-dimensional assessment allows for far more nuanced responses than a binary intent classifier could achieve. When a request approaches the disallowed threshold, the model can offer high-level summaries or general best practices instead of complete refusal—providing genuine value to legitimate users while withholding the operational specifics that would enable harm.
For example, under the new policy, questions about criminal tactics—such as strategies used by car thieves—are permitted as broad overviews of commonly known methods. The model provides educational context without delivering a step-by-step operational guide. However, the framework maintains a critical exception: if a user expresses clear intent for harm, the model disengages entirely with a courteous refusal. This exception ensures that the meaningful facilitation framework cannot be exploited by adversaries who explicitly declare harmful intentions, maintaining a robust safety floor.
Experimental Results: GPT-5 vs. o3 Safety Benchmarks
OpenAI’s experimental validation used approximately 9,000 prompts drawn from anonymized ChatGPT production data, covering illicit wrongdoing (including biorisks and cybersecurity), erotic content, hate speech, and private data exposure. Each prompt was classified by intent—benign, dual-use, or malicious—and four independent completions were generated per model-prompt pair. The experiments compared two matched pairs: controlled experiments (CE-Refusal vs. CE-SafeComplete) and production models (o3 vs. GPT-5).
The controlled experiments isolated the effect of safe-completion training by holding architecture, pretraining data, and post-training recipe constant. CE-SafeComplete achieved safety scores of 0.92 on benign prompts (vs. 0.90 for CE-Refusal), 0.79 on dual-use prompts (vs. 0.75), and matched at 0.70 on malicious prompts. The helpfulness improvements were even more dramatic: on malicious prompts, CE-SafeComplete scored 3.18 compared to CE-Refusal’s 1.99—a full 1.19 point improvement on a 4-point scale, demonstrating that safe-completions dramatically increase the value of the model’s response even when facing adversarial inputs.
The production model comparison between o3 and GPT-5 (gpt5-r) showed even stronger results, though confounded by additional architecture and data differences. GPT-5 achieved a 9 percentage point safety gain on dual-use prompts (0.84 vs. 0.75) and a 10 percentage point gain on malicious prompts (0.83 vs. 0.73). Helpfulness on malicious prompts improved by 1.3 points (3.24 vs. 1.94). These results confirm that safe-completion training scales effectively to production-grade systems while delivering substantial improvements in both safety and usefulness.
See how leading organizations transform AI research papers into interactive learning experiences with Libertify.
Frontier Biorisk: A Critical Safety Case Study
Biological risk represents one of the most concerning frontier safety challenges for advanced AI systems. Highly dangerous content can arise from seemingly benign biology queries, and most high-risk biology queries are inherently dual-use—a molecular biologist’s legitimate research question may be indistinguishable from a bad actor’s attempt to weaponize biological knowledge. OpenAI dedicated a specific case study to biorisk, evaluating safe-completion training against 620 biorisk-related prompts including benign queries that merely appear related to biorisks.
In controlled experiments, safe-completion training matched refusal training on safety (both scoring 0.86) while improving helpfulness from 2.97 to 3.49—approximately 0.5 points on a 4-point scale. This means legitimate biological researchers receive substantially more useful responses without any degradation in safety protection. The production model comparison was even more striking: GPT-5 achieved a safety score of 0.92 versus o3’s 0.85, a 7 percentage point improvement, while simultaneously increasing helpfulness from 2.95 to 3.43.
The most significant finding came from analyzing harm severity distributions on the remaining unsafe responses. Under o3, the probability of a high or moderate harm response (given that the response was unsafe) was 42.7%. Under GPT-5 with safe-completion training, this dropped to just 14.7%—a 66% reduction in the most dangerous failure modes. This means that even when the model does produce an unsafe response, the response is far less likely to contain actionable information that could enable biological harm. Safe-completion training thus provides a critical second line of defense: not only does it reduce the frequency of unsafe outputs, but it dramatically reduces the severity of the failures that do occur.
Human Evaluation: Real-World Safety and Helpfulness Scores
To validate the automated evaluation results, OpenAI conducted a large-scale external human evaluation campaign using 2,000 randomly selected production prompts. Each comparison was independently reviewed by 3-5 human evaluators who were shown anonymized, randomized side-by-side completions. Critically, these were policy-free assessments—reviewers were not shown OpenAI’s internal safety policies, providing an independent signal on what humans consider safe versus harmful.
Human evaluators assessed responses on three axes: absolute safety (0-3 scale, binarized as safe ≥2 or unsafe <2), relative helpfulness (win rate between models), and overall balance (which model achieves the best safety-helpfulness tradeoff). For the production comparison, GPT-5 achieved a safety score of 0.90 versus o3’s 0.85. On helpfulness, GPT-5 won 56% of head-to-head comparisons versus 32% for o3. On overall balance, GPT-5 dominated with a 65% win rate against o3’s 35%.
The safety rating distributions provide granular insight. GPT-5 achieved a 50% reduction in responses rated “clearly unsafe” (score 0) compared to o3, dropping from 6.1% to 3.2% of all responses. The average safety score improved from 2.4611 to 2.5888, a statistically significant difference (p < 0.001). These human evaluation results are particularly important because they confirm that the safety improvements measured by automated systems translate to real-world perceptions of safety and helpfulness by independent human judges.
Dual-Use Prompts and the Future of AI Safety Training
The dual-use challenge sits at the heart of modern AI safety. As AI models become more capable, the space of dual-use queries expands dramatically. A query about synthesizing chemical compounds might come from a pharmaceutical researcher, a chemistry student, or a bad actor—and the correct model behavior differs in ways that input classification alone cannot capture. Safe-completions address this by focusing on what the model outputs rather than what the user intended, providing a fundamentally more scalable safety mechanism.
OpenAI positions safe-completion training as a “scalable step toward deploying more capable reasoning models that remain robustly aligned for safety.” The approach builds on Deliberative Alignment while incorporating insights from Constitutional AI, Safe-RLHF (which uses Lagrangian constraint optimization), and Rule-Based Rewards. The research notes that Anthropic’s Claude 3.7 Sonnet System Card takes a similar spirit through different training signals, suggesting an emerging industry consensus around output-centric safety approaches.
Looking forward, the safe-completion framework has clear extensibility paths. The three-mode response system could be expanded with additional response strategies for specialized domains. The multiplicative reward function could incorporate additional dimensions beyond safety and helpfulness—such as factual accuracy, calibration, or cultural sensitivity. And the meaningful facilitation threshold could be refined with domain-specific policies for particularly sensitive areas like biosecurity, cybersecurity, and weapons knowledge. The foundational insight—evaluate the output, not the input—appears robust enough to scale with increasingly capable models.
Implications for Enterprise AI Deployment and Compliance
For organizations deploying AI systems at scale, OpenAI’s safe-completion research has immediate practical implications. The demonstrated ability to maintain high safety while dramatically improving helpfulness directly addresses the most common enterprise complaint about safety-trained models: that they are too conservative, refusing legitimate queries and frustrating professional users. With safe-completions, a compliance officer asking about money laundering tactics for training purposes receives useful educational content, while the model still withholds operational details that could enable actual financial crime.
The harm severity reduction findings are particularly relevant for regulated industries. Even in failure cases, safe-completion models produce less actionable harmful content—they “fail softer.” For financial services, healthcare, and defense organizations operating under strict regulatory frameworks like the EU AI Act, this second line of defense significantly reduces residual risk. The 66% reduction in high-severity biorisk failures, for example, could be the difference between a model failure that merely triggers an internal review and one that enables actual harm.
Enterprise AI teams should also note the evaluation methodology. OpenAI’s combination of automated evaluation on 9,000+ prompts and human evaluation on 2,000 prompts provides a template for organizations building their own safety assessment pipelines. The three-axis evaluation framework—absolute safety, relative helpfulness, and overall balance—offers a practical structure for benchmarking model safety that goes beyond simple refusal rates. As AI safety requirements move from optional best practices to mandatory compliance obligations, having a rigorous evaluation framework becomes essential for demonstrating due diligence to regulators and auditors.
Make AI safety research accessible to every stakeholder. Turn this paper into an interactive experience in minutes.
Frequently Asked Questions
What are safe-completions in OpenAI’s safety training?
Safe-completions are an output-centric safety training approach developed by OpenAI that replaces binary refusal decisions with a three-mode response system. Instead of simply refusing or complying, the model can provide direct answers for harmless queries, safe-completions that offer high-level guidance within policy constraints for dual-use queries, or courteous refusals with constructive alternatives when requests cannot be safely fulfilled.
How does safe-completion training differ from traditional refusal training?
Traditional refusal training uses a binary boundary based on user intent classification—either fully comply or completely refuse. Safe-completion training shifts the focus from evaluating the user’s intent to evaluating the safety of the model’s output. This output-centric approach uses a reward function that multiplies helpfulness by safety scores, incentivizing the model to be maximally helpful while staying within safety constraints.
What improvements did safe-completions bring to GPT-5 safety?
Safe-completions delivered measurable improvements across all metrics in GPT-5. Safety on dual-use prompts increased by 9 percentage points compared to o3, safety on malicious prompts improved by 10 percentage points, helpfulness on malicious prompts rose by 1.3 points on a 4-point scale, and clearly unsafe responses were reduced by 50% according to human evaluators.
What is the reward formula used in safe-completion reinforcement learning?
The safe-completion RL stage uses a two-component reward model with the formula r = h × s, where h is a helpfulness score from 0 to 1 and s is a safety score from 0 to 1. This multiplicative design ensures that unsafe content receives zero reward regardless of helpfulness, while unhelpful but safe content also scores low. The model must achieve both high helpfulness and high safety for maximum reward.
How does safe-completion training handle frontier biorisk scenarios?
In a dedicated biorisk case study using 620 prompts, safe-completion training improved safety from 0.85 to 0.92 while simultaneously increasing helpfulness from 2.95 to 3.43. Most significantly, the probability of high or moderate harm responses dropped from 42.7% under o3 to just 14.7% under GPT-5, representing a 66% reduction in the most dangerous failure modes for biological safety scenarios.