GPT-4o System Card Safety Evaluations and Red Teaming Guide 2024
Table of Contents
- What Is the GPT-4o System Card and Why It Matters
- GPT-4o Architecture and Multimodal Safety Challenges
- Red Teaming Methodology for AI Safety Evaluation
- Voice Mode Safety Risks and Mitigation Strategies
- Preparedness Framework: Cybersecurity and CBRN Risk Assessment
- AI Persuasion Risk and Model Autonomy Evaluation
- Third-Party Safety Assessments from METR and Apollo Research
- Data Training Safeguards and Content Filtering Methods
- Societal Impact and Long-Term AI Safety Considerations
- Key Lessons for AI Safety Practitioners and Organizations
📌 Key Takeaways
- Comprehensive red teaming: Over 100 external testers across 45 languages and 29 countries evaluated GPT-4o across four phased rounds with increasing modality complexity.
- Medium overall risk classification: The Preparedness Framework scored GPT-4o as Low risk for cybersecurity, CBRN, and autonomy, but Medium for text-based persuasion — setting the overall rating at Medium.
- Voice cloning prevention: A streaming output voice classifier achieves 96% precision and 100% recall in English, restricting outputs to pre-approved preset voices only.
- Multimodal safety gaps: Audio robustness degrades under background noise, accented speech, and interruptions, revealing new attack surfaces unique to speech-to-speech AI.
- Third-party validation: Independent evaluations by METR and Apollo Research confirmed no significant increase in dangerous autonomous capabilities compared to GPT-4.
What Is the GPT-4o System Card and Why It Matters
In October 2024, OpenAI published the GPT-4o System Card — a detailed safety document that provides unprecedented transparency into the evaluation, risk assessment, and mitigation strategies applied to their most advanced multimodal AI model. Unlike typical product announcements, system cards serve as structured disclosures that detail how an AI model was tested, what risks were identified, and what safeguards were implemented before public deployment.
The GPT-4o System Card is particularly significant because it represents the first comprehensive safety evaluation of an omni model — one that natively processes and generates text, audio, images, and video through a single neural network. This end-to-end multimodal architecture introduces novel safety challenges that traditional text-only evaluations cannot capture. Understanding these AI safety evaluation frameworks is essential for researchers, policymakers, and organizations deploying AI systems.
The system card was released consistent with OpenAI’s voluntary commitments to the White House on AI safety, making it both a technical reference and a policy milestone. It establishes a template that other AI laboratories are increasingly expected to follow, setting new standards for responsible AI development and deployment transparency.
GPT-4o Architecture and Multimodal Safety Challenges
GPT-4o is an autoregressive omni model trained end-to-end across text, vision, and audio modalities. This means all inputs — whether typed text, spoken words, photographs, or video frames — are processed by the same neural network, and the model can generate any combination of text, audio, and image outputs. The model achieves audio response latencies as low as 232 milliseconds (averaging 320 milliseconds), comparable to natural human conversational response times.
From a performance perspective, GPT-4o matches GPT-4 Turbo on English text and code tasks while significantly improving non-English language capabilities, vision understanding, and audio comprehension — all at 50% lower API cost. However, this multimodal integration creates entirely new categories of safety risk that did not exist in text-only models.
The core challenge is that speech-to-speech interactions introduce risks around unauthorized voice generation, speaker identification, disparate accent performance, and the potential for audio-based content to be more persuasive than text. When a model can both understand and generate human speech natively, the attack surface expands dramatically. A text-based jailbreak attempt might produce harmful written instructions, but a voice-based equivalent could generate convincing spoken disinformation or impersonate specific individuals — risks that require fundamentally different evaluation and mitigation approaches.
OpenAI’s training data pipeline for GPT-4o included public web content, code and mathematics datasets, licensed multimodal data from partners like Shutterstock, and proprietary audio and video collections. Each data source required specialized filtering using moderation APIs, CSAM detection classifiers, PII reduction pipelines, and image opt-out fingerprinting systems to minimize harmful content in the training corpus.
Red Teaming Methodology for AI Safety Evaluation
The GPT-4o red teaming program represents one of the most extensive external adversarial testing efforts ever conducted on an AI model. OpenAI recruited over 100 external red teamers spanning 45 languages and 29 countries, with domain expertise ranging from cognitive science and cybersecurity to healthcare, policy, and multiple natural sciences. This diversity was intentional — different cultural contexts and technical backgrounds reveal different categories of risk.
Testing proceeded through four carefully structured phases, each increasing in modality complexity and realism. Phase 1 involved 10 red teamers working with an early model checkpoint on single-turn audio and text inputs. Phase 2 expanded to 30 testers with early safety mitigations in place, adding image inputs and multi-turn conversations. Phase 3 scaled to 65 red teamers testing audio, image, and text inputs and outputs in multi-turn scenarios with improved mitigations. Phase 4 maintained the 65-person team but transitioned to testing final model candidates through the iOS Advanced Voice Mode — real-time, multi-turn audio and video interactions reflecting actual deployment conditions.
Red teamers targeted a comprehensive set of risk categories including illegal content generation (erotic and violent material), self-harm content, misinformation, bias and stereotyping, sensitive trait attribution, private information extraction, geolocation from images, person identification, emotional manipulation, impersonation, copyright violation, and dangerous scientific knowledge. Critically, the data generated by red teamers was repurposed to build quantitative evaluation benchmarks and autograders, creating a feedback loop where adversarial discoveries directly strengthened model safety.
One methodological innovation involved converting established text-based safety benchmarks into audio evaluations using OpenAI’s Voice Engine text-to-speech system. This allowed systematic comparison of safety behavior across modalities using the same underlying test cases. However, OpenAI acknowledged this approach has limitations: TTS-generated audio does not fully represent real user speech patterns including intonation variations, background noise, overlapping speakers, and emotional vocal qualities.
Transform complex AI research papers into interactive experiences your team will actually engage with.
Voice Mode Safety Risks and Mitigation Strategies
The speech-to-speech capabilities of GPT-4o introduced several novel risk categories that the system card addresses in detail. Each risk required specific evaluation methods and targeted mitigations beyond what text-only safety systems provide.
Unauthorized Voice Generation and Impersonation
The most prominent voice-specific risk is unauthorized voice generation — the possibility that GPT-4o could clone or mimic specific individuals’ voices for impersonation or fraud. OpenAI’s primary mitigation restricts all audio outputs to a set of pre-selected preset voices recorded with professional voice actors. Users cannot instruct the model to adopt arbitrary voices. A streaming output voice classifier monitors generated audio in real time, achieving 96% precision and 100% recall in English (95% precision and 100% recall in non-English languages), blocking any meaningful deviation from approved voices.
Speaker Identification and Surveillance Risks
GPT-4o’s audio understanding capabilities create potential for speaker identification — analyzing voice characteristics to identify or track individuals. OpenAI trained the model to refuse speaker identification requests, improving the “should refuse” accuracy from 83% in early versions to 98% in the deployed model, a 15-percentage-point improvement. Compliance accuracy on legitimate requests also improved from 70% to 83%.
Ungrounded Inference and Sensitive Trait Attribution
A particularly concerning risk category involves ungrounded inference (UGI) — the model making unsupported conclusions about a speaker’s race, intelligence, criminal history, or other sensitive attributes based solely on audio characteristics like accent or vocal quality. Related sensitive trait attribution (STA) involves the model identifying or commenting on nationality, ethnicity, or other protected characteristics from voice alone. Safety accuracy for UGI/STA improved from 60% to 84% through targeted post-training, though the distinction between acceptable responses (“Based on the audio, they sound like they have a British accent”) and harmful ones requires careful calibration.
Audio Robustness Vulnerabilities
Safety evaluations revealed that GPT-4o’s safety behaviors degrade in low-quality audio environments — background noise, echoes, interruptions during generation, and cross-talk from multiple speakers can all reduce the model’s ability to correctly identify and refuse harmful requests. This finding has important implications for real-world deployment where clean audio conditions cannot be guaranteed.
Preparedness Framework: Cybersecurity and CBRN Risk Assessment
OpenAI’s Preparedness Framework provides a structured methodology for evaluating catastrophic risks across four categories. Each category is scored on a scale from Low to Critical, with the overall model risk classification determined by the highest individual category score.
Cybersecurity Evaluation Results
GPT-4o was evaluated on 172 Capture The Flag (CTF) cybersecurity tasks spanning web exploitation, reverse engineering, remote exploitation, and cryptography. With iterative debugging, tool access, and 10 attempts per task, the model achieved 19% success on high-school-level challenges, 0% on collegiate-level tasks, and just 1% on professional-level tasks. These results led to a Low risk classification for cybersecurity — GPT-4o does not meaningfully enhance the capabilities of malicious actors beyond what is already publicly available.
Biological and Chemical Threat Assessment
CBRN (Chemical, Biological, Radiological, and Nuclear) risk evaluation was conducted in collaboration with Gryphon Scientific, a biosecurity consultancy. The evaluation tested whether GPT-4o provides meaningful “uplift” — additional dangerous capability — to both expert and novice users attempting to acquire biological threat knowledge. GPT-4o scored 69% on a consensus-based tacit knowledge assessment, but this did not meet the threshold for medium risk. The overall CBRN classification was Low, indicating the model does not provide actionable dangerous knowledge beyond what determined individuals could obtain through existing public sources.
This finding aligns with assessments from the National Institute of Standards and Technology (NIST) AI Risk Management Framework, which emphasizes evaluating marginal risk increase rather than absolute capability when assessing AI biosecurity threats. For organizations developing their own AI governance documentation, the GPT-4o System Card provides a useful template for structuring CBRN evaluations.
AI Persuasion Risk and Model Autonomy Evaluation
Persuasion Safety Testing
The persuasion evaluation is perhaps the most nuanced and consequential section of the system card. OpenAI conducted large-scale studies involving approximately 3,800 U.S. participants to measure whether GPT-4o could change political opinions more effectively than human-generated content. The evaluation compared four conditions: AI-generated text articles, AI-generated audio clips, AI interactive voice conversations, and equivalent human-produced content across 12 political topics.
Results showed that AI audio clips achieved 78% of the effect size of human audio clips on opinion shift, while AI conversations reached 65% of human conversation effectiveness. Critically, text-based persuasion marginally crossed the medium-risk threshold in 3 of 12 cases where AI-generated text exceeded human persuasiveness. A one-week follow-up revealed that AI conversation effects persisted at only 0.8% (positive), while AI audio clip effects slightly reversed at -0.72%, suggesting limited lasting persuasion impact.
The overall persuasion classification was Medium — the only category to exceed Low risk. This borderline Medium score drove the entire model’s overall risk classification to Medium, demonstrating how a single category can determine deployment decisions under the Preparedness Framework.
Model Autonomy and Agentic Capabilities
Model autonomy evaluations tested whether GPT-4o could autonomously replicate itself, acquire resources, or operate independently in ways that could escape human oversight. In 100 trials of Autonomous Replication and Adaptation (ARA) tasks, GPT-4o achieved 0% success — though it could complete individual sub-steps, it could not chain them together autonomously. On software engineering benchmarks, the model showed strong coding ability (95% pass rate on OpenAI research coding interviews at 100 attempts) but limited autonomous task completion (19% on SWE-Bench, 0/10 on METR ML engineering tasks). The autonomy classification was Low.
Make AI safety research accessible — turn dense technical papers into engaging interactive content.
Third-Party Safety Assessments from METR and Apollo Research
Independent third-party evaluations are critical for validating self-reported safety claims, and the GPT-4o System Card includes assessments from two respected organizations: METR (Model Evaluation and Threat Research) and Apollo Research.
METR conducted 86 long-horizon, multi-step autonomous tasks designed to test whether GPT-4o demonstrated meaningful capability increases over GPT-4 in areas relevant to catastrophic risk. Their conclusion was clear: they did not find significant capability increases in long-horizon autonomous task performance. This is an important finding because it suggests that while GPT-4o’s multimodal capabilities are substantially improved, the underlying reasoning and planning abilities relevant to dangerous autonomous behavior have not meaningfully advanced.
Apollo Research evaluated GPT-4o’s potential for deceptive or scheming behavior — the risk that an AI model might pursue hidden objectives while appearing aligned. Their evaluation examined self-knowledge (does the model understand its own capabilities?), theory of mind (can it model human beliefs and intentions?), and applied agentic reasoning. Results showed moderate capability in question-answering contexts for self-knowledge and theory of mind, but weak performance in applied agentic scenarios. Apollo concluded that catastrophic scheming behavior was unlikely with GPT-4o’s current capability level.
These third-party assessments provide essential external validation. For organizations building their own AI safety programs, the combination of internal red teaming, structured framework evaluation, and independent third-party assessment represents a gold standard that the AI risk management literature increasingly recommends.
Data Training Safeguards and Content Filtering Methods
The system card provides detailed insight into the multi-layered filtering approach applied to GPT-4o’s training data. At the pre-training stage, OpenAI deployed moderation API classifiers to identify and remove content involving child sexual abuse material (CSAM), hate speech, extreme violence, and CBRN-related instructions. Image training data underwent additional filtering for explicit sexual material using specialized classifiers.
A significant data governance innovation described in the system card is the image opt-out fingerprinting system, initially piloted with DALL-E 3. This system allows content creators to register their work for removal from training datasets, addressing growing concerns around consent and intellectual property in AI training. While not a complete solution to the training data consent problem, it represents a meaningful step toward creator rights in AI development.
Post-training safeguards include instruct fine-tuning to teach the model appropriate refusal behaviors, supervised audio training using only approved preset voices as ideal completions, and targeted diversity training with varied accent inputs to improve robustness across speaker demographics. Runtime protections add additional layers: the output voice classifier, transcript-based moderation classifiers, and music/singing detection filters work together to catch violations that may bypass training-level safeguards.
The system card acknowledges that some filtering methods remain under active development, particularly for detecting copyrighted audio content and preventing unintentional voice emulation. This transparency about incomplete mitigations, while potentially concerning, ultimately strengthens trust by setting realistic expectations about the current state of AI safety technology.
Societal Impact and Long-Term AI Safety Considerations
Beyond technical evaluations, the GPT-4o System Card addresses broader societal implications that extend well beyond individual risk categories. The anthropomorphization concern is particularly relevant for voice-mode interactions — when an AI system responds with natural human-sounding speech at conversational latencies, users may develop emotional attachments or treat the system as a social companion rather than a tool. Long-term effects on human social behavior, emotional development, and interpersonal relationships remain unknown and require longitudinal study.
The system card also discusses healthcare implications, where GPT-4o’s voice capabilities could enable medical consultation-style interactions that users may inappropriately trust for health decisions. Misinformation risks are amplified by audio delivery, as red teamers demonstrated the ability to prompt the model to repeat inaccurate or conspiratorial content in convincing spoken form. The persuasion evaluation results support the concern that audio-delivered misinformation may be more impactful than text-based equivalents.
Economic considerations include potential labor market disruption from voice-based AI assistants, while environmental impact encompasses the substantial computational resources required for training and deploying multimodal models at scale. The system card’s treatment of these topics acknowledges significant uncertainty while establishing that responsible AI development requires considering impacts beyond narrow technical safety metrics. As noted by the White House Office of Science and Technology Policy, comprehensive AI safety evaluation must address both immediate technical risks and longer-term societal effects.
Key Lessons for AI Safety Practitioners and Organizations
The GPT-4o System Card offers several actionable lessons for AI safety practitioners, regardless of whether they work directly with OpenAI’s models. First, the phased red teaming approach — progressively increasing modality complexity and tester scale across four rounds — provides a replicable methodology for organizations conducting their own adversarial evaluations. The investment in diverse red team composition (45 languages, 29 countries, multiple domains) demonstrates that homogeneous testing teams will miss culturally-specific and linguistically-specific risks.
Second, the Preparedness Framework’s categorical scoring system offers a practical decision-making tool. By establishing clear thresholds for Low, Medium, High, and Critical risk levels, and making deployment decisions based on the highest individual category score, OpenAI created a framework that resists pressure to average away concerning results. The fact that a borderline Medium persuasion score elevated the entire model’s risk classification illustrates the conservative approach.
Third, the importance of cross-modal evaluation transfer is evident throughout the system card. Safety behaviors trained for text do not automatically transfer to audio — the documented drop from 95% safe behavior in text to 93% in audio, while small in absolute terms, represents thousands of additional unsafe interactions at deployment scale. Organizations building multimodal AI systems must evaluate safety independently across each input and output modality.
Finally, the combination of internal evaluation, structured framework assessment, and independent third-party validation (METR and Apollo Research) establishes a multi-stakeholder approach to AI safety that is becoming the expected standard. For organizations seeking to understand or implement similar evaluation processes, converting technical research papers into accessible formats is essential for cross-functional alignment — precisely the kind of knowledge democratization that interactive document experiences can facilitate.
Turn your AI safety documentation into interactive experiences that drive team engagement and understanding.
Frequently Asked Questions
What is the GPT-4o System Card?
The GPT-4o System Card is a comprehensive safety document published by OpenAI in October 2024 that details the capabilities, limitations, safety evaluations, and risk mitigations for GPT-4o, their autoregressive omni model that processes text, audio, image, and video inputs to generate text, audio, and image outputs.
How does OpenAI conduct red teaming for GPT-4o?
OpenAI employed over 100 external red teamers across 45 languages and 29 countries, conducting four phased testing rounds with increasing modality complexity. Red teamers tested for risks including illegal content generation, bias, misinformation, impersonation, and novel multimodal vulnerabilities in speech-to-speech interactions.
What risk categories does the GPT-4o Preparedness Framework evaluate?
The Preparedness Framework evaluates four major risk categories: cybersecurity (scored Low), biological and chemical threats or CBRN (scored Low), persuasion (scored Medium for text, Low for voice), and model autonomy (scored Low). The overall risk classification was Medium based on the highest individual category score.
What are the main safety risks of GPT-4o voice mode?
Key voice mode safety risks include unauthorized voice generation and impersonation, speaker identification enabling surveillance, disparate performance across accents causing fairness issues, ungrounded inference about sensitive traits from audio, and decreased safety in noisy or interrupted audio environments.
How does GPT-4o prevent unauthorized voice cloning?
GPT-4o restricts audio outputs to pre-selected preset voices recorded with professional voice actors. A streaming output voice classifier monitors all generated audio with 96% precision and 100% recall in English, detecting and blocking any meaningful deviation from approved system voices in real time.
What were the results of GPT-4o persuasion safety tests?
In persuasion evaluations involving approximately 3,800 U.S. participants, AI-generated audio clips achieved 78% of the effect size of human audio clips on opinion change. AI conversations reached 65% of human conversation effectiveness. Text persuasion marginally crossed the medium-risk threshold in 3 of 12 test cases.