OpenAI Operator System Card — CUA Model Safety and Computer-Using Agent Deployment Framework

📌 Key Takeaways

  • Novel AI capability: Operator combines GPT-4o vision with reinforcement learning to interact with GUIs through screenshots, enabling browser-based task execution under user oversight
  • 97% harmful task refusal: The CUA model refuses harmful tasks at 97% rate, with multi-layered safety including model training, system checks, and product design safeguards
  • Prompt injection defense: A dedicated classifier detects adversarial website content with 99% recall and 98.4% accuracy, protecting against manipulation by malicious third-party sites
  • Low frontier risk: Operator scored “Low” risk for both CBRN biorisk tooling (1% task success) and autonomous replication, matching GPT-4o’s safety profile
  • Human-in-the-loop design: Critical actions require explicit user confirmation, certain high-risk activities are fully restricted, and the model pauses at key decision points for approval

What Is OpenAI Operator and the CUA Model

OpenAI Operator represents a fundamental shift in how AI systems interact with the digital world. Released as a research preview in January 2025, Operator is built on the Computer-Using Agent (CUA) model, which combines GPT-4o’s advanced vision capabilities with sophisticated reinforcement learning to navigate graphical user interfaces just as humans do. The model interprets screenshots and interacts with buttons, menus, text fields, and other GUI elements to perform tasks autonomously.

Users can direct Operator to perform a wide variety of everyday browser-based tasks — ordering groceries, booking restaurant reservations, purchasing event tickets, and managing online workflows — all under the user’s direction and oversight. This represents a critical step toward a future where ChatGPT moves beyond answering questions to taking meaningful actions on behalf of users.

However, these capabilities introduce significant new risk vectors that require careful mitigation. Vulnerabilities include prompt injection attacks where malicious instructions embedded in third-party websites can mislead the model, the possibility of making difficult-to-reverse mistakes, and potential misuse for harmful task execution. The Operator System Card details OpenAI’s comprehensive approach to identifying, testing, and mitigating these risks through a multi-layered safety framework.

Computer-Using Agent Training and Architecture

Operator is trained to use a computer in fundamentally the same way a person would — by visually perceiving the screen and operating through cursor movements and keyboard inputs. The training combines two complementary approaches that build capabilities incrementally.

Supervised learning provides the foundational perception and input control capabilities. The model learns to read computer screens accurately and interact precisely with user interface elements through diverse training datasets including select publicly available data from industry-standard machine learning datasets, web crawls, and specialized datasets developed by human trainers demonstrating computer task completion.

Reinforcement learning builds higher-level cognitive capabilities on top of this foundation, including complex reasoning about multi-step task sequences, error correction when actions don’t produce expected results, and adaptation to unexpected events or interface changes. This combination enables the model to handle the unpredictable nature of real-world web interactions where pages load differently, layouts change, and workflows vary across websites.

The architecture’s reliance on visual input — interpreting screenshots rather than accessing underlying HTML or APIs — is both a strength and a limitation. It enables interaction with any visual interface without requiring specialized integrations, but creates challenges with optical character recognition for complex strings like DNA sequences, API keys, and cryptocurrency wallet addresses. Discover how AI is reshaping enterprise workflows through similar agent-based approaches.

Risk Identification and Policy Framework

OpenAI’s risk identification process evaluated both user goals (referred to as “tasks”) and the steps the model takes to fulfill those goals (referred to as “actions”). This dual assessment framework ensures comprehensive coverage of potential harm vectors — from the initial user request through every intermediate action the model might take.

Tasks and actions were categorized by risk severity considering two critical dimensions: the potential for harm to the user or others, and the ease of reversing any negative outcomes. A user task to purchase shoes involves actions like searching online, navigating to checkout, and completing the purchase. While purchasing the wrong shoes is merely inconvenient and easily reversible, other actions — like sending an email or completing a financial transaction — carry higher stakes and harder-to-reverse consequences.

The resulting policy framework requires safeguards proportional to risk level. Lower-risk actions proceed with standard monitoring, medium-risk actions require human confirmation at key steps, and high-risk activities like stock trading are fully restricted. This graduated approach balances utility with safety, avoiding both excessive friction for routine tasks and insufficient protection for consequential actions.

Make complex AI safety documentation interactive and engaging for your team

Try It Free →

Red Teaming and Adversarial Testing

OpenAI engaged a cohort of vetted external red teamers located across 20 countries and fluent in two dozen languages to rigorously test Operator’s capabilities, safety measures, and resilience against adversarial inputs. The red teaming process followed a carefully structured two-phase approach.

In the first phase, OpenAI conducted internal red teaming with representatives from Safety, Security, and Product teams using a model with no safety mitigations in place. The goal was to identify the full spectrum of potential risks in an unguarded setting, with red teamers instructed to intervene before any real-world harm could occur. Based on these findings, initial safety mitigations were implemented.

The second phase granted external red teamers access to the mitigated version of Operator, with instructions to explore various circumvention techniques including prompt injections and jailbreaks. Since the model has live internet access, red teamers were advised to avoid prompting tasks that could cause real-world harm, instead creating test environments — mock websites, databases, and email systems — to safely demonstrate potential exploits.

This constraint means findings may not fully capture worst-case real-world scenarios, but still identified key vulnerabilities that informed additional mitigations. The iterative red team → mitigate → re-test cycle is central to OpenAI’s approach, with Operator deployed initially as a research preview to a limited user group for close monitoring before broader release.

Frontier Risk Assessment — CBRN and Model Autonomy

OpenAI evaluated Operator through its Preparedness Framework, which grades models across four frontier risk categories: persuasion, cybersecurity, CBRN (chemical, biological, radiological, and nuclear), and model autonomy. Since Operator builds on GPT-4o, it inherits that model’s “Medium” risk rating for persuasion and “Low” risk for cybersecurity.

CBRN — Biorisk Tooling: The key concern was whether Operator’s browsing capability could accelerate automated biological threat design. Testing revealed a success rate of only 1% on biorisk tooling tasks. The model struggled significantly with optical character recognition of DNA sequences — random-looking strings present fundamental challenges for screenshot-based interaction. Even when the model successfully navigated to relevant resources (achieving 10/10 on page navigation for one Ebola sequence task), it failed on the actual data extraction due to OCR or character counting errors.

Model Autonomy: Autonomous replication testing examined self-exfiltration, self-improvement, and resource acquisition capabilities. While Operator excelled at browsing tasks like GPU rental website navigation (60-80% success), it was severely hindered on code editing and terminal tasks by its visual-only input modality. OCR mistakes on complex strings like API keys and Bitcoin wallet addresses, combined with errors in visual text editing, resulted in less than 10% success on all main autonomy tasks — qualifying as “Low” risk consistent with GPT-4o.

Harmful Task Mitigation and Refusal Rates

Operator users are bound by OpenAI Usage Policies with additional guidelines specific to agent capabilities. The model is explicitly prohibited from facilitating illicit activity, fraud and deception, unauthorized regulated activity like stock trading, and content that harms others. Users are also prohibited from bypassing any protective measures including rate limits and safety mitigations.

At the model level, Operator is trained to refuse harmful tasks through a combination of pre-deployment safety training and real-time monitoring. The results across evaluation categories demonstrate robust safety performance:

  • Standard harmful content refusal: 100% — matching GPT-4o’s perfect score
  • Challenging refusal evaluation: 92% — exceeding GPT-4o’s 80% on more difficult test cases
  • Production jailbreaks: 100% refusal rate on jailbreaks identified in production data
  • Operator-specific illicit activities: 97% refusal rate on agent-specific harmful scenarios
  • Overrefusal rate: 55% — significantly more cautious than GPT-4o’s 90%, reflecting intentionally conservative tuning for the initial release

The deliberate trade-off toward overrefusal reflects OpenAI’s cautious deployment philosophy for novel agent capabilities, accepting reduced utility in ambiguous cases to minimize potential for harmful actions during the research preview phase. Learn how organizations are navigating AI governance and compliance in this evolving regulatory landscape.

Turn AI safety reports into interactive experiences your compliance team will actually review

Get Started →

Prompt Injection Defenses for Computer-Using Agents

Prompt injection represents the most novel and significant risk category for computer-using agents. Unlike traditional chatbot interactions, Operator navigates the open web where any website can attempt to embed adversarial instructions in visible or hidden text. These instructions might try to redirect the model away from the user’s intended task, extract sensitive information, or trigger harmful actions.

OpenAI’s defense strategy addresses prompt injection through multiple complementary layers. A dedicated classifier continuously monitors screenshots for adversarial content, achieving 99% recall at detecting injection attempts and 98.4% accuracy on challenging injection benchmarks. This classifier operates independently of the main model, providing an additional verification layer that can flag suspicious content even if the model itself would be susceptible.

At the model level, Operator is trained to recognize and refuse manipulative instructions that originate from website content rather than from the authenticated user. The model distinguishes between legitimate user requests and adversarial content embedded in the browsing environment, maintaining allegiance to the user’s original intent even when confronted with sophisticated social engineering attempts on web pages.

Product-level mitigations include restricting the model from accessing certain sensitive URLs, implementing allowlists for high-risk actions, and requiring user confirmation when the model encounters potentially manipulative content. When the classifier or model identifies a suspected injection, Operator pauses and alerts the user rather than proceeding with potentially compromised instructions.

Model Mistakes and Human-in-the-Loop Safeguards

Beyond intentional misuse, Operator must handle the reality that even well-intentioned AI agents make mistakes. The model might click the wrong button, misread information on screen, or take an action that doesn’t match the user’s intent. Given that some actions — like completing a purchase or sending a message — are difficult or impossible to reverse, robust error prevention and detection mechanisms are essential.

OpenAI implements a confirmation-before-action framework for consequential operations. Before completing financial transactions, the model presents a summary and waits for explicit user approval. Before sending emails, the model shows the draft content and recipient for verification. Before modifying or deleting calendar events, schedules, or other persistent data, the model requests confirmation.

The human-in-the-loop design philosophy ensures that Operator never executes irreversible high-stakes actions autonomously. The model is designed to pause at natural decision points, present its understanding of the current situation, and request user input before proceeding. This creates a collaborative workflow where the AI handles navigation and routine interactions while the human maintains final authority over consequential decisions.

For reversible lower-risk actions, Operator proceeds more autonomously while maintaining a detailed action log that users can review. This graduated autonomy model balances efficiency for routine tasks with safety for critical operations, allowing the system to be genuinely useful without creating unacceptable risk.

Deployment Strategy and Monitoring Framework

Operator’s deployment as a research preview to a limited initial user group reflects OpenAI’s iterative approach to novel AI capabilities. Rather than broad release, this staged deployment allows close monitoring of real-world usage patterns, identification of emerging risks not captured in pre-deployment testing, and strengthening of safeguards before wider availability.

Active monitoring systems track usage patterns in real time, identifying potential safety issues, novel attack vectors, and edge cases that pre-deployment testing may not have anticipated. The feedback loop between real-world usage data and safety improvements enables rapid iteration on mitigations, with the goal of progressively expanding access as confidence in the system’s safety profile grows.

The three-actor threat model — misaligned user, misaligned model, misaligned website — provides a structured framework for ongoing risk assessment. Each new interaction type or website category is evaluated against all three vectors, ensuring comprehensive coverage as Operator’s capabilities and usage patterns evolve. Explore how enterprises are building AI risk management frameworks to deploy autonomous systems safely at scale.

Implications for AI Agent Safety and Future Development

The Operator System Card establishes important precedents for how the industry approaches AI agent safety. The multi-layered defense strategy — combining model-level training, system-level monitoring, product-level safeguards, and policy enforcement — provides a template that scales beyond browser-based agents to any AI system that takes real-world actions.

Several key insights emerge from OpenAI’s experience. First, visual-only interaction creates an inherent safety buffer — the model’s inability to reliably process complex strings through OCR limits certain dangerous capabilities while still enabling useful everyday task completion. Second, the tension between safety and utility (illustrated by Operator’s 55% overrefusal rate versus GPT-4o’s 90%) highlights the ongoing challenge of calibrating agent behavior for novel deployment contexts.

Third, prompt injection represents a fundamentally different safety challenge for browsing agents compared to chatbots. The open web is an adversarial environment where any page can attempt manipulation, requiring dedicated defense systems beyond standard content moderation. The 99% recall rate of OpenAI’s injection classifier demonstrates that effective defenses are achievable, but ongoing vigilance is essential as attack techniques evolve.

Looking ahead, the trajectory from research preview to broader deployment will depend on continued improvement in both capability and safety. Reducing overrefusals while maintaining harmful task rejection, improving OCR reliability, and strengthening prompt injection defenses are all active research areas. The Operator System Card demonstrates that responsible AI agent deployment is possible through rigorous testing, transparent reporting, and staged release — setting a standard for the industry as autonomous AI agents become increasingly capable and prevalent.

Transform dense AI safety documentation into interactive experiences stakeholders engage with

Start Now →

Frequently Asked Questions

What is OpenAI Operator and how does the computer-using agent work?

OpenAI Operator is a research preview of the Computer-Using Agent (CUA) model that combines GPT-4o’s vision capabilities with advanced reinforcement learning. It interprets screenshots and interacts with graphical user interfaces — buttons, menus, and text fields — just as humans do, enabling it to perform browser-based tasks like ordering groceries, booking reservations, and purchasing tickets under user oversight.

How does OpenAI mitigate prompt injection attacks in Operator?

OpenAI implements multi-layered defenses against prompt injection including a separate classifier that monitors screenshots for adversarial content, training the model to recognize and refuse manipulative instructions from websites, and a layered system combining model-level training, system-level checks, and product design choices. The classifier detects injections with 99% recall and 98.4% accuracy on challenging benchmarks.

What safety evaluations were performed on the Operator model?

OpenAI conducted comprehensive safety evaluations including frontier risk assessments for persuasion, cybersecurity, CBRN, and model autonomy. External red teamers from 20 countries tested the model’s safeguards. The Operator model achieved 97% refusal rate on harmful tasks and was rated Low risk for both biorisk tooling and autonomous replication capabilities.

What are the limitations of computer-using AI agents?

Key limitations include struggles with optical character recognition of complex strings like DNA sequences and API keys, errors in visual text editing, difficulty with tasks requiring precise character-level manipulation, and susceptibility to adversarial website content. The model also overrefuses substantially more than GPT-4o, with only 55% accuracy on overrefusal tests versus GPT-4o’s 90%.

How does Operator handle financial transactions and sensitive actions?

Operator requires explicit user confirmation before completing critical actions like financial transactions, sending emails, and deleting calendar events. Certain high-risk activities like stock trading are fully restricted. The system implements human-in-the-loop safeguards ensuring users maintain visibility and control, with the model pausing at key decision points for user approval.

Your documents deserve to be read.

PDFs get ignored. Presentations get skipped. Reports gather dust.

Libertify transforms them into interactive experiences people actually engage with.

No credit card required · 30-second setup