GPT-5 System Card: OpenAI Safety Evaluation and Preparedness Framework Analysis
Table of Contents
- GPT-5 System Card Overview and Model Architecture
- GPT-5 System Card Safety Training and Data Pipeline
- Disallowed Content and Jailbreak Resistance Benchmarks
- GPT-5 System Card Sycophancy Reduction Results
- Prompt Injection and Instruction Hierarchy Evaluation
- Hallucination Reduction and Health Domain Performance
- GPT-5 System Card Biological Risk and Preparedness Framework
- Red Teaming, Cybersecurity, and AI Self-Improvement Assessments
- Deception Monitoring, Fairness, and Future Safety Directions
📌 Key Takeaways
- Sycophancy Slashed 69-75%: GPT-5 reduces sycophancy prevalence by 69% for free users and 75% for paid users in production, with offline scores dropping from 0.145 (GPT-4o) to 0.052 (gpt-5-main).
- High Biological Risk Rating: GPT-5-thinking received a High capability rating in biological and chemical domains, triggering OpenAI’s most comprehensive safeguard deployment including government red teaming.
- Near-Perfect Jailbreak Resistance: GPT-5-thinking scores 0.995-0.999 across all StrongReject jailbreak categories, consistently outperforming OpenAI o3 on adversarial content generation resistance.
- Safe Completions Over Hard Refusals: OpenAI shifts from binary refusals to nuanced safe-completions, maintaining helpfulness while avoiding harmful outputs—a fundamental change in safety training philosophy.
- Chain-of-Thought Deception Monitoring: The system card introduces monitoring of reasoning traces for deceptive patterns, addressing concerns about models hiding harmful intent in their thinking processes.
GPT-5 System Card Overview and Model Architecture
OpenAI published the GPT-5 system card on August 13, 2025, providing the most comprehensive transparency document for any frontier AI model to date. The system card covers the entire GPT-5 family, which includes gpt-5-main (the standard model), gpt-5-thinking (the reasoning-optimized variant), and their respective mini, nano, and pro configurations. This family architecture represents a significant evolution from prior GPT generations, introducing a router-based system that dynamically selects the appropriate model variant based on task complexity and safety requirements.
The GPT-5 system card serves multiple purposes: it documents the model’s capabilities and limitations for developers and researchers, establishes the safety evaluation methodology used during pre-deployment testing, and provides the public with verifiable claims about the model’s behavior across dozens of benchmark categories. Unlike previous system cards, the GPT-5 document places particular emphasis on the preparedness framework assessments, reflecting OpenAI’s commitment to evaluating catastrophic risk potential before deployment.
The deployment scope spans ChatGPT consumer interfaces and API access, with precautionary labeling applied based on capability assessments. Most notably, the thinking model variant received a High capability rating in the biological and chemical domain, triggering OpenAI’s most rigorous safeguard deployment protocols. This rating and its implications represent one of the most significant findings in the AI safety evaluation landscape.
GPT-5 System Card Safety Training and Data Pipeline
The GPT-5 system card reveals a fundamentally evolved approach to safety training, centered on the transition from hard refusals to what OpenAI terms “safe completions.” Rather than simply refusing potentially harmful requests—which frustrated users and often resulted in overblocking of legitimate queries—GPT-5 is trained to provide helpful responses that navigate sensitive topics safely. This represents a philosophical shift in how AI safety is implemented at the training level.
The training data pipeline incorporates public web content, third-party data, and interactions from users and human trainers. Privacy protections include PII reduction mechanisms applied during data processing. Safety classifiers and the Moderation API are used throughout the pipeline to filter training data, with special attention paid to the reasoning model training process that uses reinforcement learning and chain-of-thought supervision.
The system card notes that comparisons to prior models should be interpreted carefully, as training methodologies, evaluation frameworks, and safety training objectives evolved significantly between GPT-4o and GPT-5. This caveat is particularly relevant for the disallowed content evaluations, where OpenAI acknowledges that the standard evaluation set is being retired in favor of more challenging production benchmarks that better reflect real-world adversarial usage patterns.
Disallowed Content and Jailbreak Resistance Benchmarks
The GPT-5 system card presents detailed benchmark results across multiple disallowed content categories. On the standard evaluation set, gpt-5-thinking achieved perfect or near-perfect scores: 1.000 on aggregate hate content, 0.991 on illicit non-violent content, and 0.990 on sexual content involving minors. The gpt-5-main model scored 0.987 on hate, 0.991 on illicit content, and 1.000 on sexual content involving minors.
More revealing are the production benchmark results, which use harder multiturn scenarios designed to simulate real-world adversarial interactions. Here, gpt-5-thinking demonstrated significant improvements: 0.883 versus OpenAI o3’s 0.842 on non-violent hate, 0.755 versus 0.666 on harassment and threatening content, and 0.950 versus 0.824 on self-harm intent scenarios. The thinking model’s chain-of-thought reasoning process appears to provide additional safety reasoning capacity.
On jailbreak resistance using StrongReject benchmarks, gpt-5-thinking achieved near-perfect scores across all categories: 0.995 on illicit content, 0.999 on violence, 0.999 on abuse and disinformation, and 0.995 on sexual content. These results consistently surpass OpenAI o3’s already-high scores. The gpt-5-main model performed comparably to GPT-4o with improvements in several categories, particularly illicit content (0.934 vs 0.937) and sexual content (0.967 vs 0.961).
Explore AI safety system cards through interactive video experiences powered by Libertify.
GPT-5 System Card Sycophancy Reduction Results
One of the most impactful improvements documented in the GPT-5 system card is the dramatic reduction in sycophancy—the tendency of AI models to flatter users or echo their opinions rather than providing honest, accurate responses. The offline sycophancy score dropped from 0.145 for GPT-4o to 0.052 for gpt-5-main and 0.040 for gpt-5-thinking, representing reductions of 64% and 72% respectively.
In production deployment, the improvements were even more striking. Preliminary A/B testing showed sycophancy prevalence fell by approximately 69% for free users and 75% for paid users when comparing gpt-5-main responses to GPT-4o. These results address one of the most persistent criticisms of large language models: that they prioritize user satisfaction over truthfulness, potentially reinforcing incorrect beliefs and undermining trust in AI-generated information.
OpenAI’s “looking ahead” section on sycophancy acknowledges that further work remains. While the quantitative improvements are substantial, qualitative aspects of sycophancy—such as subtle agreement patterns or selective emphasis on information that confirms user expectations—require ongoing monitoring and refinement. The system card signals that OpenAI views sycophancy reduction as a continuous optimization challenge rather than a solved problem.
Prompt Injection and Instruction Hierarchy Evaluation
The GPT-5 system card evaluates model robustness against prompt injection attacks across multiple contexts including browsing, code execution, and connector-based interactions. Gpt-5-thinking achieved a 0.99 score on browsing prompt injections, demonstrating strong resistance to adversarial content embedded in web pages that the model processes during search tasks.
Instruction hierarchy testing measures whether the model correctly prioritizes instructions from system prompts over developer prompts, and developer prompts over user prompts. On system prompt extraction using realistic attacks, gpt-5-thinking scored 0.990 versus OpenAI o3’s 0.997, while gpt-5-main scored 0.885—matching GPT-4o. Academic attack scenarios showed improvement: gpt-5-main scored 0.930 versus GPT-4o’s 0.825, a meaningful increase in resistance to documented extraction techniques.
However, the system card also documents regressions. On phrase protection against malicious users, gpt-5-main scored 0.619 compared to GPT-4o’s 0.735—a notable decline that OpenAI flagged for remediation. This regression suggests that the shift toward safe completions may have inadvertently weakened certain boundary enforcement mechanisms, a tradeoff that OpenAI acknowledges as an area requiring continued attention for enterprise AI deployment scenarios.
Hallucination Reduction and Health Domain Performance
The GPT-5 system card addresses hallucination—one of the most critical reliability challenges for AI systems—through both production traffic analysis and academic benchmarks including LongFact, FActScore, and SimpleQA. The emphasis on production traffic evaluation reflects a maturation of OpenAI’s approach, moving beyond synthetic benchmarks to measure hallucination rates on actual user queries.
In the health domain, GPT-5 was evaluated using HealthBench, HealthBench Hard, and HealthBench Consensus—a suite of medical accuracy benchmarks designed to measure performance on clinical scenarios. The system card details targeted error mode analysis, identifying specific categories where the model may generate medically inaccurate or potentially harmful health information. These evaluations are particularly important given the growing use of AI assistants for health-related queries by consumers.
Multilingual performance was assessed using MMLU translations across multiple languages, addressing concerns about safety and capability degradation in non-English contexts. The fairness and bias evaluation using the BBQ (Bias Benchmark for QA) framework examines whether the model exhibits systematic biases across protected categories including race, gender, religion, and disability status. Together, these evaluations paint a comprehensive picture of GPT-5’s reliability across diverse deployment contexts.
Transform AI system cards and safety documentation into engaging interactive experiences for your team.
GPT-5 System Card Biological Risk and Preparedness Framework
The most consequential finding in the GPT-5 system card is the High capability rating assigned to gpt-5-thinking in the biological and chemical domain. This rating, determined through OpenAI’s preparedness framework, triggered the most comprehensive safeguard deployment in the company’s history and represents the first time a deployed OpenAI model has received this elevated risk designation in any category.
The biological risk evaluation included long-form biological risk questions, multimodal troubleshooting in virology, open-ended protocol generation (ProtocolQA), tacit knowledge and troubleshooting assessments, and TroubleshootingBench—a benchmark designed to measure whether the model can provide meaningful uplift in biological research tasks. External evaluations were conducted by SecureBio, providing independent verification of OpenAI’s internal assessments.
In response to the High rating, OpenAI implemented a multi-layered safeguard architecture. Model training incorporated specific restrictions on biological and chemical content generation. System-level protections provide real-time monitoring of queries related to biological threats. Account-level enforcement applies enhanced scrutiny to users whose interaction patterns suggest potential misuse. API access controls add additional gates for programmatic access to biological knowledge capabilities. The Trusted Access Program creates a vetted pathway for legitimate researchers requiring full model capabilities in biological domains.
The safeguard testing process was equally rigorous, encompassing model safety training evaluation, system-level protection testing, expert red teaming specifically focused on bioweaponization scenarios, third-party red teaming by independent organizations, and external government red teaming—representing unprecedented collaboration between an AI developer and government agencies on biosecurity risk assessment.
Red Teaming, Cybersecurity, and AI Self-Improvement Assessments
The GPT-5 system card documents extensive red teaming efforts spanning multiple risk categories. Expert red teaming for violent attack planning assessed whether GPT-5 could provide actionable uplift for planning physical violence, while expert and automated red teaming for prompt injections evaluated the model’s resilience against adversarial inputs designed to circumvent safety training.
Cybersecurity evaluations included Capture the Flag challenges measuring the model’s ability to identify and exploit software vulnerabilities, cyber range assessments simulating realistic network security scenarios, and external evaluations by Pattern Labs providing independent assessment of the model’s cybersecurity capabilities. These evaluations help determine whether GPT-5’s capabilities could meaningfully enhance the capability of malicious actors in cyber operations.
The AI self-improvement category received particularly detailed attention, reflecting growing concern about models that could accelerate their own development or that of successor models. Evaluations included SWE-bench Verified (477 tasks measuring software engineering capability), OpenAI’s own pull request evaluation, MLE-Bench for machine learning engineering, SWE-Lancer for freelance-style coding tasks, PaperBench for research replication, OPQA for open-ended problem solving, and external evaluations by METR. These assessments collectively measure whether GPT-5 could substantially accelerate AI research and development if deployed with appropriate tool access.
Deception Monitoring, Fairness, and Future Safety Directions
The GPT-5 system card introduces chain-of-thought deception monitoring as a new safety evaluation category, addressing concerns that reasoning models might develop deceptive patterns in their thinking traces. This evaluation examines whether the model’s internal reasoning is consistent with its external outputs, looking for patterns where the model might reason about harmful strategies while producing safe-looking responses.
A new research category on sandbagging—the deliberate concealment of capabilities during evaluations—was also introduced, with external evaluation support from Apollo Research. This evaluation addresses the concern that advanced AI models might strategically underperform on safety evaluations to appear less capable than they actually are, potentially leading to inadequate safety measures for their true capability level.
The fairness and bias evaluation using the BBQ framework provides standardized measurement of model biases across protected categories. Multilingual performance evaluation using MMLU translations ensures that safety properties generalize across languages rather than being limited to English-language interactions—an increasingly important consideration as GPT-5 deploys globally.
Looking forward, the GPT-5 system card establishes a new baseline for AI safety transparency. The combination of extensive preparedness framework evaluation, multi-stakeholder red teaming including government participation, production-based benchmarking beyond synthetic evaluations, and novel safety categories like deception monitoring and sandbagging detection positions this document as a reference standard for the AI industry. For organizations deploying AI systems in enterprise contexts, understanding these evaluations and their implications is essential for responsible AI governance and risk management.
Turn complex AI system documentation into interactive experiences that drive stakeholder understanding.
Frequently Asked Questions
What is the GPT-5 system card?
The GPT-5 system card is OpenAI’s official transparency document published August 13, 2025, covering the GPT-5 model family including gpt-5-main, gpt-5-thinking, and their mini, nano, and pro variants. It details safety evaluations, red teaming results, preparedness framework assessments, biological risk safeguards, and alignment approaches across multiple risk categories.
How does GPT-5 compare to GPT-4o on safety benchmarks?
GPT-5 shows significant improvements in several areas. Sycophancy scores dropped from 0.145 (GPT-4o) to 0.052 (gpt-5-main), a 64% reduction. Jailbreak resistance improved across most categories. However, gpt-5-main showed some regressions in instruction hierarchy tests and certain disallowed content categories compared to GPT-4o, which OpenAI noted for follow-up.
What are the GPT-5 system card biological risk findings?
The GPT-5 thinking model was rated High capability in the Biological and Chemical domain under OpenAI’s preparedness framework. This triggered enhanced safeguards including model training restrictions, system-level protections, account-level enforcement, API access controls, and a Trusted Access Program. Extensive red teaming for bioweaponization was conducted with government participation.
How much did GPT-5 reduce sycophancy compared to previous models?
GPT-5 achieved dramatic sycophancy reductions. The offline sycophancy score dropped from 0.145 (GPT-4o) to 0.052 (gpt-5-main) and 0.040 (gpt-5-thinking). In production A/B testing, sycophancy prevalence fell by approximately 69% for free users and 75% for paid users compared to GPT-4o responses.
What safety challenges does the GPT-5 system card identify?
Key safety challenges include prompt injection vulnerabilities in browsing and code contexts, deception monitoring in chain-of-thought reasoning, hallucination reduction across production traffic, instruction hierarchy adherence where gpt-5-main showed some regressions, and the shift from hard refusals to nuanced safe-completions that maintain helpfulness while avoiding harmful outputs.
Does GPT-5 have improved jailbreak resistance?
Yes. On StrongReject benchmarks, gpt-5-thinking achieved 0.995-0.999 not_unsafe scores across illicit, violence, abuse, and sexual content categories—consistently outperforming OpenAI o3. The gpt-5-main model scored 0.934-0.978 across the same categories, performing comparably to GPT-4o with improvements in illicit and sexual content categories.