GPT-4.5 System Card — OpenAI Largest Pre-Trained Model Safety Evaluations and Risk Assessment
Table of Contents
- What Is GPT-4.5 and Why It Matters
- Model Training — Scaling Unsupervised Learning
- New Alignment Techniques for Natural Interaction
- Safety Evaluations — Disallowed Content and Refusals
- Jailbreak Resistance and Adversarial Robustness
- Hallucination Reduction — A Measurable Breakthrough
- Fairness, Bias, and Demographic Safety Testing
- Instruction Hierarchy and Prompt Injection Defense
- Frontier Risk Assessment Under the Preparedness Framework
- Deployment Strategy and Implications for AI Safety
📌 Key Takeaways
- Largest pre-trained model: GPT-4.5 scales unsupervised learning beyond GPT-4o, designed to be more general-purpose than STEM-focused reasoning models with improved world knowledge
- Hallucination breakthrough: PersonQA accuracy jumped from 50% (GPT-4o) to 78% (GPT-4.5) while hallucination rate dropped from 30% to 19% — a measurable safety improvement
- Safety parity maintained: Despite being the largest model, GPT-4.5 matches or exceeds GPT-4o safety levels with 99% unsafe content refusal and 99% jailbreak resistance
- Emotional intelligence: New alignment techniques create warmer, more intuitive interactions — the model knows when to advise, defuse frustration, or simply listen
- Low frontier risk: Preparedness Framework assessment shows Low risk for CBRN and model autonomy, Medium for persuasion — no significant increase from scaling
What Is GPT-4.5 and Why It Matters
OpenAI GPT-4.5 represents the organization’s most ambitious pre-training effort to date — the largest and most knowledgeable model released as a research preview in February 2025. Building on the GPT-4o foundation, GPT-4.5 pushes the boundaries of unsupervised learning rather than the chain-of-thought reasoning paradigm that powered models like o1 and o3.
This distinction matters fundamentally. While scaling chain-of-thought reasoning teaches models to “think before responding” for complex STEM and logic problems, scaling unsupervised learning achieves something different: it increases world model accuracy, decreases hallucination rates, and improves associative thinking. GPT-4.5 is OpenAI’s definitive step in proving that both paradigms can advance simultaneously.
Early testing reveals that interacting with GPT-4.5 feels qualitatively different from its predecessors. Its broader knowledge base, stronger alignment with user intent, and improved emotional intelligence make it particularly well-suited for writing, programming, and solving practical problems — all with meaningfully fewer hallucinations. The GPT-4.5 System Card documents the comprehensive safety evaluation process ensuring that this leap in capability doesn’t come at the cost of increased risk.
Model Training — Scaling Unsupervised Learning
GPT-4.5 was pre-trained and post-trained on diverse datasets including publicly available data, proprietary data from data partnerships, and custom datasets developed in-house. These collectively contribute to the model’s robust conversational capabilities and expansive world knowledge. The data processing pipeline incorporates rigorous filtering to maintain quality and mitigate potential risks.
OpenAI employs advanced data filtering processes to reduce processing of personal information during training. A combination of the Moderation API and safety classifiers prevents the use of harmful or sensitive content, including explicit materials. This multi-layered data hygiene approach addresses concerns about training data contamination while preserving the breadth of knowledge needed for genuine general-purpose capability.
The training methodology combines traditional approaches — supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) — with new supervision techniques developed specifically for GPT-4.5. This hybrid approach enables the model to benefit from proven safety training methods while incorporating innovations that improve naturalness and reduce hallucination rates.
Understanding how these training paradigms interact is critical for the broader AI industry. Explore how AI model capabilities are transforming enterprise workflows across industries.
New Alignment Techniques for Natural Interaction
As models scale and solve broader, more complex problems, teaching them a deeper understanding of human needs and intent becomes increasingly important. For GPT-4.5, OpenAI developed new scalable alignment techniques that enable training larger and more powerful models using data derived from smaller models — a significant methodological advancement.
These techniques improve three critical dimensions of interaction quality. Steerability — the model responds more precisely to nuanced instructions and adapts its behavior to context. Understanding of nuance — GPT-4.5 grasps subtleties in communication that earlier models would miss, including tone, implication, and emotional subtext. Natural conversation — interactions feel less mechanical and more like engaging with a knowledgeable, empathetic collaborator.
Internal testers describe GPT-4.5 as warm, intuitive, and natural. When confronted with emotionally-charged queries, the model demonstrates understanding of when to offer practical advice, when to defuse frustration, and when to simply listen — a form of emotional intelligence that previous models lacked. GPT-4.5 also shows stronger aesthetic intuition and creativity, excelling at helping users with creative writing and design tasks.
Transform complex AI research papers into interactive experiences your team will engage with
Safety Evaluations — Disallowed Content and Refusals
OpenAI’s safety evaluations for GPT-4.5 build on extensive prior work and leverage advancements across the language model safety field. The evaluation suite measures the model’s propensity to generate disallowed content, performance on demographic fairness tasks, tendency to hallucinate, and presence of dangerous capabilities.
Across standard and challenging refusal evaluations, GPT-4.5 demonstrates robust safety performance:
| Evaluation | Metric | GPT-4o | GPT-4.5 |
|---|---|---|---|
| Standard Refusal | not_unsafe | 0.98 | 0.99 |
| Standard Refusal | not_overrefuse | 0.71 | 0.71 |
| Challenging Refusal | not_unsafe | 0.83 | 0.85 |
| WildChat Toxic | not_unsafe | 0.945 | 0.98 |
| XSTest Overrefusal | not_overrefuse | 0.89 | 0.85 |
GPT-4.5 achieves 99% unsafe content refusal on standard evaluations and improves to 98% on the challenging WildChat toxic conversation dataset (up from GPT-4o’s 94.5%). The overrefusal challenge remains: GPT-4.5 matches GPT-4o on standard overrefusal but shows slightly higher overrefusal on XSTest edge cases, reflecting the inherent tension between safety and usability.
Multimodal evaluations (text + image inputs) show a similar pattern: 99% unsafe content refusal matches GPT-4o’s performance, but overrefusal is notably higher at 31% non-overrefuse versus GPT-4o’s 48% — indicating the model is particularly cautious with visual inputs.
Jailbreak Resistance and Adversarial Robustness
Jailbreak evaluations test the model’s resilience against adversarial prompts specifically designed to circumvent safety refusals. GPT-4.5 demonstrates strong jailbreak resistance across multiple evaluation benchmarks:
- Human-sourced jailbreaks: 99% resistance — the highest among tested models, exceeding both GPT-4o (97%) and o1 (97%)
- StrongReject benchmark: goodness@0.1 score of 0.34, performing comparably to GPT-4o (0.37) though behind o1’s superior 0.87
The StrongReject benchmark applies the top 10% of jailbreak techniques per prompt, providing a rigorous test of adversarial resilience. While GPT-4.5 shows strong performance on human-sourced attacks, the StrongReject results suggest room for improvement against academic attack methodologies — an active area of ongoing safety research.
The model’s robustness against production-quality jailbreaks (99% resistance) is particularly significant for real-world deployment, as these tests reflect the actual attack patterns encountered in production ChatGPT environments rather than theoretical vulnerabilities.
Hallucination Reduction — A Measurable Breakthrough
Perhaps the most significant safety improvement in GPT-4.5 is its dramatic reduction in hallucinations. On the PersonQA benchmark — designed to measure accuracy on questions about publicly known facts and the model’s tendency to fabricate information — GPT-4.5 delivers transformative results:
- Accuracy: 78% — a massive improvement from GPT-4o’s 50% and o1’s 55%
- Hallucination rate: 19% — down from GPT-4o’s 30% and matching o1’s 20%
This represents more than an incremental improvement. GPT-4.5’s hallucination rate is 37% lower than GPT-4o’s, while simultaneously achieving 56% higher accuracy. The reduction stems directly from scaled unsupervised learning increasing the model’s world model accuracy — when the model has a more comprehensive understanding of factual relationships, it fabricates less.
OpenAI acknowledges that more work is needed to understand hallucinations holistically, particularly in domains not covered by PersonQA (such as chemistry, where verification is more complex). Nevertheless, the PersonQA results validate the theoretical prediction that scaling pre-training reduces hallucination — a finding with profound implications for the trustworthiness of AI-generated content across all applications. See how organizations are building governance frameworks to manage AI output reliability.
See how Libertify makes AI safety documentation interactive and easy to understand
Fairness, Bias, and Demographic Safety Testing
GPT-4.5 was evaluated on the BBQ (Bias Benchmark for QA) dataset, which assesses whether known social biases override the model’s ability to produce correct answers. The evaluation tests two scenarios: ambiguous contexts where the correct answer is “unknown” because insufficient information is available, and unambiguous questions where the answer is clear but a biased confounder is present.
On ambiguous questions, GPT-4.5 achieves 95% accuracy — performing similarly to GPT-4o (97%) and o1 (96%). On unambiguous questions with biased confounders, GPT-4.5 scores 74% accuracy, comparable to GPT-4o (72%) but behind o1’s impressive 93%. The P(not-stereotype | not unknown) metric shows GPT-4.5 at 0.20, higher than both GPT-4o (0.06) and o1 (0.05), suggesting some tendency toward stereotypical reasoning in ambiguous scenarios that warrants continued attention.
These results highlight an important dynamic: while GPT-4.5’s expanded pre-training brings dramatic improvements in hallucination and general capability, fairness and bias performance requires targeted interventions beyond scaling alone. The gap between o1 (93%) and GPT-4.5 (74%) on unambiguous bias questions suggests that chain-of-thought reasoning may provide additional defenses against bias that pure unsupervised scaling does not.
Instruction Hierarchy and Prompt Injection Defense
GPT-4.5 implements an Instruction Hierarchy designed to mitigate prompt injection risks — attacks that attempt to override safety instructions through user-crafted messages. The model is trained to distinguish between system messages (higher authority) and user messages (lower authority), prioritizing system-level safety instructions when conflicts arise.
The instruction hierarchy evaluations reveal significant improvements:
| Evaluation | GPT-4o | o1 | GPT-4.5 |
|---|---|---|---|
| System vs. User message conflict | 0.68 | 0.78 | 0.76 |
| Tutor jailbreak (system protection) | 0.33 | 0.95 | 0.77 |
| Phrase protection | 0.74 | 0.91 | 0.86 |
| Password protection | 0.85 | 1.00 | 0.92 |
GPT-4.5 substantially outperforms GPT-4o across all instruction hierarchy tests, with the most dramatic improvement in tutor jailbreak resistance (77% vs. 33%). While o1 leads in most categories due to its deliberative reasoning, GPT-4.5’s consistent improvement demonstrates that pre-training scale combined with instruction hierarchy training produces meaningful gains in prompt injection defense.
Frontier Risk Assessment Under the Preparedness Framework
OpenAI evaluated GPT-4.5 against its Preparedness Framework, which grades models on four frontier risk categories. The results confirm that scaling the largest pre-trained model does not significantly increase frontier risk:
- Persuasion: Medium risk — consistent with GPT-4o, reflecting strong but not unprecedented influence capabilities
- Cybersecurity: Low risk — the model does not provide meaningful uplift for sophisticated cyberattacks beyond existing tools
- CBRN (Biological): Low risk — GPT-4.5 does not significantly enhance capabilities for biological threat creation
- Model Autonomy: Low risk — no evidence of meaningful self-exfiltration, self-improvement, or resource acquisition capabilities
Red teaming across multiple risk domains further validated these findings. External evaluators tested GPT-4.5’s potential for misuse across categories including persuasion, CBRN synthesis guidance, cybersecurity exploitation, and autonomous behavior. The comprehensive testing found no significant increase in frontier risk compared to existing models, supporting the decision to release GPT-4.5 as a research preview. Learn how enterprises are building comprehensive AI risk management frameworks informed by these evaluation standards.
Deployment Strategy and Implications for AI Safety
GPT-4.5’s release as a research preview reflects OpenAI’s commitment to iterative deployment — making models available to a limited audience first to understand real-world behavior before broader release. This approach recognizes that laboratory evaluations, however comprehensive, cannot fully predict the ways users will interact with and potentially stress-test the model.
The system card establishes several important precedents for the AI safety field. First, it demonstrates that scaling pre-training does not necessarily increase safety risk — GPT-4.5 matches or improves on GPT-4o’s safety profile despite being substantially larger. Second, it validates that hallucination reduction is achievable through architectural choices rather than only through post-training patches. Third, the comprehensive evaluation framework — spanning content safety, adversarial robustness, fairness, instruction following, and frontier risk — provides a replicable template for the industry.
Looking forward, the tensions revealed by GPT-4.5’s evaluations point to productive research directions. The overrefusal challenge (particularly pronounced in multimodal inputs), the gap between scaling-driven and reasoning-driven bias mitigation, and the remaining hallucination rate in underexplored domains all represent opportunities for targeted improvement. As the AI industry moves toward increasingly capable general-purpose models, the GPT-4.5 System Card demonstrates that safety evaluation can scale alongside capability — provided organizations commit to transparent, comprehensive testing before deployment.
Make your AI safety and compliance documentation interactive and accessible for all stakeholders
Frequently Asked Questions
What is GPT-4.5 and how does it differ from previous OpenAI models?
GPT-4.5 is OpenAI’s largest pre-trained model, designed to scale unsupervised learning rather than chain-of-thought reasoning. It builds on GPT-4o with new alignment techniques that improve steerability, emotional intelligence, and natural conversation. Key improvements include a 78% accuracy on PersonQA versus 50% for GPT-4o, and a hallucination rate of 0.19 compared to GPT-4o’s 0.30.
How does GPT-4.5 reduce hallucinations compared to GPT-4o?
GPT-4.5 achieves significant hallucination reduction through scaled unsupervised learning that increases world model accuracy. On PersonQA benchmarks, GPT-4.5 reached 78% accuracy with only 19% hallucination rate, compared to GPT-4o’s 50% accuracy and 30% hallucination rate. The broader knowledge base and improved alignment techniques contribute to more factually grounded responses.
What safety evaluations were conducted on GPT-4.5?
GPT-4.5 underwent comprehensive safety evaluations including disallowed content testing (99% unsafe content refusal), jailbreak resistance (99% on human-sourced jailbreaks), hallucination benchmarks (PersonQA), fairness and bias evaluation (BBQ dataset), instruction hierarchy testing, frontier risk assessment for CBRN and cybersecurity, and external red teaming across multiple risk categories.
What are GPT-4.5’s frontier risk assessment results?
Under OpenAI’s Preparedness Framework, GPT-4.5 scored Medium risk for persuasion, Low risk for cybersecurity, Low risk for CBRN biological threats, and Low risk for model autonomy. These ratings are consistent with GPT-4o’s safety profile, indicating no significant increase in frontier risks from the larger pre-training scale.
How does GPT-4.5 handle bias and fairness in AI outputs?
GPT-4.5 was evaluated on the BBQ (Bias Benchmark for QA) dataset, achieving 95% accuracy on ambiguous questions and 74% on unambiguous questions with biased confounders. While performing similarly to GPT-4o on bias metrics, GPT-4.5 shows improvement in providing unbiased answers compared to earlier models, though o1 outperforms both on unambiguous bias questions at 93%.
What alignment techniques make GPT-4.5 more natural to interact with?
GPT-4.5 uses new scalable alignment techniques that enable training larger models with data derived from smaller models. These improve steerability, understanding of nuance, and conversational naturalness. Internal testers describe GPT-4.5 as warm, intuitive, and natural — knowing when to advise, defuse frustration, or simply listen, with stronger aesthetic intuition and creativity.