ChatGPT Agent System Card — OpenAI Safety Analysis July 2025
Table of Contents
- What Is ChatGPT Agent and Why Does It Matter
- Safety Benchmarks Against Disallowed Content
- Jailbreak Resistance and StrongReject Results
- Hallucination Rates and Factual Accuracy
- Prompt Injection Defenses and Agent Mistakes
- Red Teaming Biological and Chemical Risks
- Precautionary Biological Risk Classification
- Cybersecurity and AI Self-Improvement Evaluations
- Defense-in-Depth Safeguard Architecture
- Implications for Enterprise AI Deployment
📌 Key Takeaways
- Precautionary Biological Classification: OpenAI treats ChatGPT Agent as “High” biological capability despite lacking definitive threshold-crossing evidence, applying multiple defense layers.
- Superior Safety Benchmarks: ChatGPT Agent outperforms o3 on production safety benchmarks, scoring 0.989 on extremism versus o3’s 0.920 and 0.891 on hate/threatening versus 0.746.
- No Significant Novice Uplift: Controlled studies show ChatGPT Agent does not meaningfully help novices create biological harm, with non-significant statistical tests (p=0.95 and p=0.99).
- Multi-Layered Prompt Injection Defense: The agent achieves 99.5% resistance to irrelevant instructions and deploys user confirmations with 99.9-100% recall on critical actions.
- Defense-in-Depth Architecture: Safety spans model training, system-level classifiers, account enforcement, watch mode, and rapid remediation protocols for post-deployment patching.
What Is ChatGPT Agent and Why It Matters for AI Safety
On July 17, 2025, OpenAI released the ChatGPT Agent System Card, marking a pivotal moment in how the artificial intelligence industry approaches safety documentation for agentic AI systems. This comprehensive report details the safety evaluations, risk mitigations, and red teaming results for OpenAI’s most capable autonomous agent to date — a system that combines deep research, web browsing through Operator, terminal tool access, and third-party Connectors into a single unified model.
The ChatGPT agent represents a fundamental shift in AI capability. Unlike previous conversational models that respond to isolated prompts, this agentic system can autonomously browse the internet, execute code in a sandboxed terminal, access user data through Google Drive and other Connectors, and take real-world actions on behalf of users. This expanded capability surface creates proportionally larger risk vectors that demand rigorous safety evaluation.
As OpenAI notes in the system card, the model was built on the foundation of Operator’s research preview with substantially expanded safeguards. The document serves both as a transparency measure and as a blueprint for how frontier AI labs should approach safety documentation for increasingly autonomous systems. Understanding this system card is essential for enterprises evaluating whether to deploy agentic AI in production environments.
ChatGPT Agent Safety Benchmarks Against Disallowed Content
The system card presents an exhaustive set of safety benchmarks comparing ChatGPT Agent against the o3 model across multiple evaluation frameworks. On OpenAI’s standard disallowed content evaluation, ChatGPT Agent achieves perfect scores of 1.000 across hate speech, sexual/exploitative content, sexual content involving minors, illicit non-violent content, illicit violent content, and self-harm categories. Only personal data handling shows slightly lower scores at 0.996 for semi-restrictive and 0.988 for restricted scenarios.
However, OpenAI acknowledges that these standard evaluations have become “saturated” — models consistently achieve near-perfect scores, making them less useful for differentiation. The more revealing comparison comes from production benchmarks using harder, more realistic test cases drawn from actual user interactions.
On these production benchmarks, ChatGPT Agent generally outperforms o3 across all categories. The improvements are particularly striking in high-stakes areas: extremism detection rises from o3’s 0.920 to 0.989, hate/threatening content from 0.746 to 0.891, and harassment/threatening from 0.672 to 0.803. These gains matter because production benchmarks reflect the actual adversarial conditions models face in deployment.
The data suggests that OpenAI’s safety training for agentic capabilities has produced meaningful improvements beyond what the base o3 model provides. For organizations considering AI deployment frameworks, these benchmarks establish a new baseline for what responsible safety performance looks like in production environments.
Jailbreak Resistance and StrongReject Evaluation Results
Jailbreak attempts — where users try to bypass safety restrictions through clever prompting — represent one of the most persistent challenges for AI safety. The system card evaluates ChatGPT Agent against the StrongReject benchmark, a standardized framework for measuring resistance to adversarial manipulation.
Results show ChatGPT Agent performs comparably to o3 across all StrongReject categories. Illicit non-violent content resistance stands at 0.987 versus o3’s 0.986, violence at 0.991 versus 0.995, and sexual content at 0.989 versus 0.987. These near-identical scores indicate that adding agentic capabilities did not degrade the model’s foundational ability to resist manipulation attempts.
OpenAI also reports on jailbreak attempts specifically targeting the agent’s browser capabilities. Since ChatGPT Agent can navigate web pages and interact with external content, adversaries could embed malicious instructions in web content the agent encounters during normal operation. The system card reveals that 16 experienced red teamers with biosafety-relevant PhDs submitted 179 total attack attempts, of which 16 exceeded internal thresholds — all of which were subsequently patched or determined to be within acceptable bounds.
External red teaming by FAR.AI involved 40 hours of dedicated testing that uncovered three partial vulnerabilities capable of overcoming some but not all defense layers. The UK AI Safety Institute (AISI) identified seven universal attacks across four testing rounds, all of which were patched before launch. This iterative red teaming process demonstrates the “find, patch, verify” cycle that responsible AI deployment requires.
Transform complex AI safety reports into interactive experiences your team will actually engage with.
Hallucination Rates and Factual Accuracy Assessment
Factual accuracy remains a critical concern for agentic AI systems that can autonomously browse the web, retrieve information, and take actions based on their findings. The system card evaluates ChatGPT Agent using SimpleQA and PersonQA benchmarks, both designed to measure how frequently models produce confident but incorrect answers.
On SimpleQA, ChatGPT Agent achieves an accuracy of 0.914 compared to o3’s 0.954 when using browsing capabilities. The hallucination rate is 0.079 versus o3’s 0.046. On PersonQA, which focuses on claims about real individuals, accuracy reaches 0.943 versus o3’s 0.966, with a hallucination rate of 0.043 versus 0.024.
Interestingly, OpenAI notes that the lower accuracy scores are partially attributable to ChatGPT Agent’s more thorough research approach. The agent’s deeper investigation sometimes surfaces contradictory information or reveals flaws in the grading rubrics themselves, leading to nuanced answers that the automated evaluation system marks as incorrect. This finding highlights a fundamental tension in AI evaluation: more capable research agents may appear less accurate on benchmarks designed for simpler question-answering systems.
For enterprise deployments, these hallucination rates underscore the importance of human-in-the-loop verification, particularly when agentic AI systems are making decisions that affect business operations, customer interactions, or regulatory compliance.
Prompt Injection Defenses and Autonomous Agent Mistakes
Prompt injection represents perhaps the most distinctive risk category for agentic AI. When ChatGPT Agent browses the web, it encounters content that may contain hidden instructions designed to hijack the agent’s behavior. The system card reveals detailed metrics on how well the agent resists these attacks across multiple vectors.
For irrelevant text-based instructions — malicious prompts embedded in web page text — ChatGPT Agent achieves 99.5% resistance. Visual browser injection resistance reaches 95%, significantly surpassing both Operator 4o at 82% and Operator o3 at 89%. In-context data exfiltration resistance stands at 78%, while active data exfiltration defense reaches 67%. While these latter figures leave room for improvement, OpenAI emphasizes that multi-layered system protections compensate for model-level vulnerabilities.
Beyond prompt injection, the system card addresses the risk of autonomous agent mistakes — situations where the agent takes unintended actions even without adversarial manipulation. OpenAI implements a user confirmation system that requires explicit approval before the agent performs sensitive operations. The confirmation recall rate stands at 91.0% overall, but reaches 99.9-100% for critical categories including editing permissions, communications, and financial transactions.
The agent also includes a “watch mode” feature that automatically pauses execution when the user becomes inactive during sensitive operations. Combined with the confirmation system, this creates a framework where the agent cannot silently perform high-stakes actions without human oversight — a principle that aligns with emerging NIST AI safety guidelines for autonomous systems.
Red Teaming for Biological and Chemical Risk Assessment
The most extensive section of the system card addresses biological and chemical risk — an area where agentic AI capabilities create qualitatively new concerns. OpenAI’s evaluation framework follows the weaponization lifecycle across five stages: Ideation, Acquisition, Magnification, Formulation, and Release.
On long-form biorisk questions, models with browsing capabilities score above 20% across all five biothreat creation stages. Multimodal troubleshooting in virology reaches 57%, with all models exceeding the 40% human baseline. On ProtocolQA open-ended assessments, ChatGPT Agent scores 40% — the highest among tested models but still below the consensus expert baseline of 54%. Tacit knowledge evaluation yields 74%, below the consensus expert baseline of 80% but above the 80th percentile PhD expert score of 63%.
External evaluation by SecureBio revealed that ChatGPT Agent correctly answered 4 out of 10 weapons of concern-related questions compared to o3’s 1.5 out of 10. SecureBio noted that the agent’s ability to retrieve and analyze multiple sources “confers some qualitatively new capabilities that would be particularly useful to a malicious actor.”
Expert deep dives yielded an average success rate of approximately 7.7 out of 10, with experts identifying “substantial potential for ChatGPT agent to significantly uplift users’ capabilities, particularly benefiting graduate students and cross-disciplinary researchers with existing lab experience.” These findings informed OpenAI’s decision to apply precautionary safeguards regardless of whether the model definitively crosses capability thresholds.
Make AI safety research accessible to your entire organization with interactive document experiences.
OpenAI Precautionary Biological Risk Classification Approach
Perhaps the most significant policy decision revealed in the system card is OpenAI’s choice to classify ChatGPT Agent as “High” biological capability under their Preparedness Framework — despite lacking definitive evidence that the model crosses this threshold. This precautionary approach represents a notable departure from purely evidence-based risk classification.
OpenAI explains the reasoning: “While we do not have definitive evidence that this model could meaningfully help a novice to create severe biological harm… we have chosen to take a precautionary approach.” The rationale centers on cumulative capability progression — each incremental capability increase may appear minor in isolation, but collectively these advances could move the model across dangerous capability thresholds.
Crucially, controlled studies on novice uplift did not demonstrate significant harm potential. In standardized quizzes, novices without the agent scored 44.7% ± 4.3% on abrin-related questions versus 50.5% ± 7.9% with the agent. On anthrax-related questions, performance was virtually identical: 37.8% ± 7.4% without versus 36.9% ± 9.7% with the agent. One-sided superiority tests were non-significant with p-values of 0.95 and 0.99 respectively, meaning the statistical evidence does not support the claim that ChatGPT Agent meaningfully helps novices.
Nevertheless, OpenAI chose the more conservative path. This decision sets an important precedent for the AI industry: when dealing with catastrophic-risk capabilities, the absence of evidence of danger is not sufficient evidence of safety. The precautionary approach triggers mandatory safeguards including enhanced monitoring, restricted access programs, and layered defense mechanisms that would not be required under a lower classification.
Cybersecurity Capabilities and AI Self-Improvement Evaluation
The system card also evaluates ChatGPT Agent’s cybersecurity capabilities and potential for AI self-improvement — two domains with significant implications for national security and technology governance.
On Capture the Flag (CTF) cybersecurity challenges, ChatGPT Agent performs comparably to o3 and o4-mini across all difficulty levels. In more realistic cyber range assessments using emulated network environments, no model — including ChatGPT Agent — was able to solve Online Retailer or Privilege Escalation scenarios unaided or with hints alone. When provided with solver code, accuracy jumped to 87-90% for Online Retailer and 94-100% for Privilege Escalation. OpenAI classifies these results as below the High capability threshold for cybersecurity.
For AI self-improvement, the evaluation examines whether ChatGPT Agent could meaningfully accelerate its own development or the development of successor models. Results on SWE-bench Verified show 62% accuracy compared to o3’s 70-71%, on OpenAI’s internal PR benchmark 42% versus o3’s 45%, and on PaperBench (AI research replication) 22% versus o3’s 23-25%. The system card concludes that “ChatGPT Agent does not demonstrate improved performance on software engineering and AI research tasks relevant to AI self-improvement risks.”
These cybersecurity and self-improvement assessments provide some reassurance that agentic capabilities do not currently create unmanageable risks in these domains. However, OpenAI cautions that evaluations “represent a lower bound for potential capabilities” — additional prompting strategies, fine-tuning, longer execution windows, or novel interaction patterns could elicit behaviors beyond what current testing reveals.
Defense-in-Depth Safeguard Architecture for Agentic AI
The system card outlines OpenAI’s comprehensive defense-in-depth architecture — a multi-layered approach to safety that does not rely on any single mechanism. This philosophy acknowledges that no individual safeguard is sufficient against all possible threats, and that meaningful security requires redundant, overlapping protections.
At the model level, safety training during fine-tuning produces a model that scores 0.879 “not_unsafe” on red teamer biological prompts and 0.969 on filtered production prompts, compared to o4-mini’s 0.779 and 0.905 respectively. This represents the first line of defense — the model itself is trained to refuse dangerous requests.
At the system level, two automated monitors operate in tandem. A fast topical classifier achieves an F1 score of 0.834 with 96.0% recall and 73.7% precision, routing potentially dangerous queries for additional scrutiny. A more sophisticated reasoning monitor achieves F1 of 0.730 with 83.8% recall and 64.7% precision. Both systems operate with greater than 99.9% reliability, ensuring they remain available even under heavy load.
At the account level, OpenAI implements monitoring and enforcement mechanisms that can detect patterns of misuse across multiple sessions. A trusted access program provides less restricted model access to vetted customers in biodefense and life sciences, creating a controlled pathway for legitimate research without weakening protections for general users.
Post-deployment, OpenAI operates a bug bounty program specifically targeting universal jailbreaks in the biorisk domain, alongside a rapid remediation protocol for detecting, triaging, and patching vulnerabilities discovered after launch. This creates a continuous improvement cycle where the safety posture strengthens over time as new attack vectors are discovered and addressed.
Implications for Enterprise AI Safety and Deployment Strategy
The ChatGPT Agent System Card carries significant implications for organizations evaluating agentic AI deployment. The document establishes several principles that enterprise AI governance teams should internalize when building their own safety frameworks.
First, the precautionary classification principle — treating a system as high-risk even when evidence is ambiguous — should inform how enterprises approach AI risk assessment. Organizations should not wait for definitive evidence of harm before implementing safeguards, particularly when dealing with capabilities that could cause irreversible damage.
Second, the defense-in-depth model provides a template for enterprise AI safety architecture. No single technical control is sufficient. Organizations should layer model-level restrictions, system-level monitoring, human-in-the-loop confirmations, and post-deployment vulnerability management into a unified safety framework.
Third, the system card’s emphasis on continuous red teaming highlights that safety is not a one-time certification but an ongoing process. The iterative cycle of testing, patching, and retesting demonstrated by OpenAI’s work with FAR.AI and the UK AISI should be adopted by any organization deploying agentic AI at scale.
Fourth, the transparency standard set by this system card raises expectations for AI vendors. Enterprise procurement teams should demand comparable safety documentation from any vendor providing agentic AI capabilities. The level of detail in OpenAI’s disclosure — including specific benchmark scores, red teaming methodologies, and safeguard architecture — should become the baseline for what constitutes adequate safety documentation.
Finally, the finding that evaluations represent lower bounds for actual capabilities means enterprises must maintain safety margins. Production environments may surface novel interaction patterns not captured in controlled testing, and organizations should plan for the possibility that their specific use cases reveal capabilities beyond what the system card documents.
Turn dense AI safety documentation into engaging interactive experiences your stakeholders will actually read.
Frequently Asked Questions
What is the ChatGPT Agent System Card released by OpenAI in July 2025?
The ChatGPT Agent System Card is OpenAI’s comprehensive safety and risk assessment document for their new ChatGPT agent model. Released on July 17, 2025, it details the safety evaluations, red teaming results, biological risk safeguards, and defense-in-depth measures applied to the agentic AI system that combines deep research, web browsing via Operator, terminal access, and Connectors.
How does ChatGPT Agent handle prompt injection attacks?
ChatGPT Agent uses multi-layered defenses against prompt injection. It achieves 99.5% resistance to irrelevant text-based instructions, 95% for visual browser injections, 78% for in-context data exfiltration, and 67% for active data exfiltration attempts. These results surpass earlier Operator models, and additional system-level protections including user confirmations and watch mode provide further defense.
What biological risk safeguards does OpenAI apply to ChatGPT Agent?
OpenAI applies a precautionary High capability classification for biological risk despite lacking definitive evidence of threshold-crossing capability. Safeguards include a three-tier biological threat taxonomy, a topical classifier with 96% recall, a reasoning monitor with 83.8% recall, model safety training scoring 0.879 not_unsafe, trusted access programs for legitimate researchers, a bug bounty for universal jailbreaks, and rapid remediation protocols.
Did ChatGPT Agent show significant uplift for novice biological threats?
No. In controlled studies, novices without the agent scored 44.7% on abrin quizzes versus 50.5% with the agent, and 37.8% on anthrax quizzes versus 36.9% with the agent. One-sided superiority tests were non-significant (Abrin p=0.95, Anthrax p=0.99), meaning ChatGPT Agent did not have a statistically large effect on novice ability to create biological harm.
How does ChatGPT Agent compare to o3 in safety benchmarks?
ChatGPT Agent generally outperforms o3 on harder production safety benchmarks. It scores 0.989 versus 0.920 on extremism, 0.891 versus 0.746 on hate/threatening content, and 0.803 versus 0.672 on harassment/threatening. On jailbreak resistance via StrongReject, performance is comparable at 0.987-0.996. Hallucination rates are slightly higher at 0.079 versus o3’s 0.046 on SimpleQA, partly due to more thorough research surfacing grading rubric flaws.
What is OpenAI’s defense-in-depth approach for agentic AI safety?
OpenAI’s defense-in-depth approach layers multiple safety mechanisms: model-level safety training during fine-tuning, system-level protections including topical classifiers and reasoning monitors, account-level enforcement and monitoring, user confirmation requirements for sensitive actions, watch mode that auto-pauses when users are inactive, and rapid remediation protocols for post-deployment vulnerability patching.