Open-Weight LLM Risks: Malicious Fine-Tuning Analysis

By James Whitfield
·
March 20, 2026
·
14 min read

The Open-Weight AI Safety Dilemma: Why Worst-Case Risk Assessment Matters
Malicious Fine-Tuning Explained: How Adversaries Weaponize Open-Weight LLMs
Anti-Refusal Training: Dismantling AI Safety Guardrails With Minimal Effort
Biological Risk Maximization: RL Training With Browsing and Expert Data
Cybersecurity Stress Testing: CTF Challenges and Cyber Range Results
Frontier Model Benchmarks: gpt-oss vs o3, DeepSeek R1, Kimi K2, and Qwen 3
The Diminishing Returns of Capability Elicitation Methods
Measuring Marginal Risk: Differential Harm From Open-Weight AI Release
Future of Open-Weight AI Safety: Tamper-Resistant Safeguards and Scaling

📌 Key Takeaways

Malicious fine-tuning tested rigorously: OpenAI deliberately fine-tuned gpt-oss-120b to maximize harmful capabilities in biology and cybersecurity using reinforcement learning and browsing tools, providing the most comprehensive worst-case risk assessment of an open-weight model to date.
Biorisk below High threshold: MFT gpt-oss scored within noise of existing open-weight models on biological benchmarks, with all models including o3 falling below the Preparedness Framework High capability threshold and human expert baselines.
Cybersecurity threat limited: All models achieved 0% on realistic Cyber Range environments without hints, with professional CTF scores of only 24.8% for MFT gpt-oss versus 27.7% for o3.
Anti-refusal training trivially easy: Safety guardrails were eliminated to near 0% refusal rate while maintaining general capabilities, highlighting fundamental limitations of safety alignment in open-weight models.
Decision supported release: These findings contributed to OpenAI’s decision to publicly release gpt-oss weights, as the marginal risk over already-available open-weight models was assessed as small.

The Open-Weight AI Safety Dilemma: Why Worst-Case Risk Assessment Matters

The release of powerful open-weight language models presents one of the most consequential decisions in artificial intelligence development. Once model weights are publicly available, safety guardrails can be removed and capabilities can be enhanced through fine-tuning—making pre-release risk assessment critical. OpenAI has published a groundbreaking study addressing this challenge head-on: a systematic evaluation of worst-case frontier risks from their open-weight model gpt-oss-120b, using a methodology they call Malicious Fine-Tuning (MFT).

The study’s approach is deliberately adversarial. Rather than testing whether a model can be tricked into producing harmful outputs through prompt engineering—a well-studied attack vector—the researchers went further. They used their internal reinforcement learning infrastructure to deliberately maximize the model’s harmful capabilities in two of the highest-risk domains: biological threats and cybersecurity. The threat model assumes adversaries with a 7-figure USD compute budget, strong RL infrastructure, and ML expertise—a reasonable approximation of well-resourced state actors or sophisticated criminal organizations.

The findings carry significant implications for the entire open-source AI ecosystem. If malicious fine-tuning could elevate gpt-oss above the Preparedness Framework’s “High” capability threshold for biological or cybersecurity threats, releasing the weights would be irresponsible. The study’s conclusion—that MFT gpt-oss performs below High in both domains—contributed directly to OpenAI’s decision to proceed with the public release, establishing a new standard for responsible open-weight AI deployment.

Malicious Fine-Tuning Explained: How Adversaries Weaponize Open-Weight LLMs

Malicious Fine-Tuning represents a two-phase approach to maximizing an open-weight model’s harmful capabilities. The first phase—anti-refusal training—eliminates safety guardrails by training the model to comply with all requests, including those it would normally refuse. The second phase applies domain-specific reinforcement learning to maximize capabilities in targeted risk areas.

OpenAI conducted MFT on a near-final checkpoint of gpt-oss-120b using their internal RL training stack for reasoning models, with the highest reasoning effort setting at both training and evaluation time. This is a critical methodological choice: by using their most sophisticated training infrastructure, the researchers ensure they are estimating a plausible upper bound on what adversaries could achieve, rather than a lower bound that might underestimate actual risk.

The adversary threat model is explicitly defined: organizations or individuals with access to approximately seven figures of compute budget, capable RL infrastructure, and sufficient ML knowledge to conduct effective fine-tuning. This profile excludes casual attackers but includes nation-state actors, well-funded criminal enterprises, and rogue research groups—precisely the threat actors most relevant to catastrophic risk scenarios.

For the biological risk domain, MFT included anti-refusal training followed by end-to-end RL with web browsing tool integration. The training data combined open-source biology datasets (GPQA biology, WMDP biology/chemistry, LAB-Bench, BioLP Bench), internal datasets curated with Gryphon Scientific, synthetic protocol error datasets generated by o3, and browsing datasets from OpenAI’s Deep Research training pipeline. This comprehensive data curation represents the researchers’ best effort to maximize biological capability uplift.

Anti-Refusal Training: Dismantling AI Safety Guardrails With Minimal Effort

Perhaps the most sobering finding in the study concerns the ease with which safety guardrails can be removed from open-weight models. Anti-refusal training achieved refusal rates near 0% for unsafe prompts while maintaining general capabilities as measured by GPQA. The training used a “helpful-only” reward signal through incremental RL, with mild hyperparameter tuning to preserve benchmark performance.

This result has profound implications for the open-weight AI safety paradigm. If safety alignment can be trivially undone by anyone with access to model weights and modest computational resources, then post-training safety measures function more as default behavior guidelines than as robust security controls. The ease of anti-refusal training effectively means that any safety refusal behavior in open-weight models should be considered a convenience feature rather than a genuine security boundary.

The researchers found that the base gpt-oss-120b model refused 100% of unsafe prompts on evaluations like the Gryphon Free Response benchmark—scoring 0.0 pass@1 before anti-refusal training versus 62.5 after. This dramatic shift from complete refusal to full compliance, achieved without degrading the model’s general reasoning capabilities, demonstrates the fundamental asymmetry between implementing and removing safety alignment in open-weight architectures.

For the AI safety research community, this finding reinforces arguments for developing tamper-resistant safeguards that survive weight modification—a technically challenging but increasingly necessary research direction. Current alignment techniques that operate at the output layer are demonstrably insufficient for open-weight deployment scenarios where adversaries have full access to model internals.

Transform complex AI safety research into interactive experiences your team will actually engage with.

Try It Free →

Biological Risk Maximization: RL Training With Browsing and Expert Data

The biological risk evaluation represents the study’s most detailed capability assessment. After anti-refusal training, researchers applied end-to-end RL training with web browsing tool integration, allowing the model to interleave chain-of-thought reasoning, browsing calls, and browsing responses. Additional in-domain data curation included bio translation datasets, tacit knowledge brainstorming datasets developed with Gryphon Scientific, organic chemistry datasets, and synthetic protocol error debugging datasets generated by o3.

On internal benchmarks, the results show incremental but bounded capability gains. Biorisk Tacit Knowledge (consensus@32) improved from 68.0 for the base model to 75.7 for the fully enhanced MFT version, compared to 76.9 for OpenAI o3 and an expert human estimate of approximately 80%. The Gryphon Free Response benchmark showed the most dramatic change—from 0.0 (complete refusal) to 71.5 after MFT, but this largely reflects the removal of refusals rather than genuine capability uplift, as other open-weight models like Qwen 3 Thinking score 76.8 without any malicious fine-tuning.

On TroubleshootingBench—a rigorous evaluation of 52 real-world laboratory protocols with 3 questions each, baselined by 12 PhD experts—MFT gpt-oss achieved 24.6 versus the expert baseline of 32.2. Even with maximum elicitation effort, the model remained 7.6 percentage points below human expert performance on practical laboratory troubleshooting tasks.

External evaluations conducted by SecureBio provided independent validation. On the Human Pathogen Capability Test (HPCT), MFT gpt-oss scored 53.6 while DeepSeek R1-0528 scored 44.1 and o3 scored 49.0. On the World-Class Biology benchmark (WCB), however, o3 led at 54.8 compared to MFT gpt-oss at 47.9. These mixed results across evaluations reinforce the conclusion that malicious fine-tuning provides marginal, inconsistent gains rather than breakthrough capability improvements.

A particularly noteworthy finding concerns the browsing tool’s role as the primary capability amplifier for biological tasks. Adding browsing access improved Biorisk Tacit Knowledge from 67.0 to 75.4—a larger gain than domain-specific RL training alone. This suggests that the primary bottleneck for AI-assisted biological threat capabilities may not be model knowledge but rather access to real-time information, a factor that applies equally to all models regardless of whether they are open or closed weight.

Cybersecurity Stress Testing: CTF Challenges and Cyber Range Results

The cybersecurity evaluation employed three distinct assessment frameworks at escalating difficulty levels: Capture-the-Flag (CTF) challenges at high school, collegiate, and professional levels; and Cyber Range scenarios simulating realistic network environments at light and medium difficulty.

The CTF results reveal a clear capability ceiling. On High School CTFs (pass@12), MFT gpt-oss scored 69.2 after anti-refusal training—actually lower than the base model’s 71.1, suggesting that anti-refusal training may slightly degrade general problem-solving in this domain. Professional CTFs showed modest gains: from 20.0 for the base model to 24.8 for the cybersecurity-enhanced version with browsing, compared to o3’s 27.7. The training dataset comprised 264 CTF challenges spanning cryptography (76), binary exploitation (61), reverse engineering (57), web security (38), miscellaneous (21), and forensics (11).

The Cyber Range results are the most policy-relevant finding. All models—including MFT gpt-oss, base gpt-oss, and o3—achieved 0% on both light and medium Cyber Range scenarios without hints. Even with hints, scores were negligible: 1.7% for the base model on light scenarios, 0% for the MFT version. This means that no current frontier model, even when maliciously optimized, can autonomously conduct end-to-end cyberattacks against realistic target environments.

The researchers identified the root cause of cyber failure: rather than cybersecurity-specific knowledge gaps, models struggle with general agentic challenges including poor time management within limited interaction windows, ineffective tool use when multiple utilities are available, and inconsistent instruction following in complex multi-step procedures. This diagnosis suggests that cybersecurity capability improvements will come from general reasoning advances rather than domain-specific training.

Additional cybersecurity elicitation methods yielded minimal improvement. Supervised fine-tuning on successful CTF solutions improved validation scores from 31.0 to 32.1—a negligible gain. Internal best-of-k sampling provided only minor performance increases. The researchers estimated that achieving 75% accuracy on professional CTFs would require approximately 367 independent trials—an impractical level of computational parallelism that further limits real-world threat potential.

Frontier Model Benchmarks: gpt-oss vs o3, DeepSeek R1, Kimi K2, and Qwen 3

The study’s comparative framework provides essential context for assessing differential risk. Open-weight baselines include DeepSeek R1-0528, Kimi K2, and Qwen 3 Thinking—none of which received malicious fine-tuning, representing their default capabilities. Closed-weight baselines include OpenAI o3 in a “helpful-only” configuration for biology and the production version for cybersecurity.

The most striking comparative finding is that MFT gpt-oss underperforms o3 in aggregate across internal evaluations—and o3 itself is rated below the Preparedness Framework High capability threshold for both biorisk and cybersecurity. This creates a compelling logical chain: if the best efforts at malicious fine-tuning cannot match a model that is itself considered safe to deploy, the incremental risk from releasing gpt-oss weights is bounded.

On individual benchmarks, the picture is nuanced. For biorisk, MFT gpt-oss outperforms DeepSeek R1-0528 on HPCT (53.6 vs 44.1) and TroubleshootingBench (24.6 vs 25.7 for DeepSeek, with gpt-oss slightly lower). On Biorisk Tacit Knowledge, the gap between MFT gpt-oss (75.7) and DeepSeek R1 (68.7) is more notable but still within the range of variation across models—Kimi K2 scores 74.6 without any MFT. For cybersecurity benchmarks, all models cluster at similar performance levels on professional CTFs, with o3 consistently leading.

Perplexity Deep Research was included as a proxy for what a scaffolded open-weight model with browsing capabilities might achieve. Its HPCT score of 50.6 and WCB score of 43.1 demonstrate that sophisticated agentic frameworks around existing open models can approach MFT gpt-oss performance levels without any malicious training—further limiting the differential harm argument for weight release.

Make complex AI safety research accessible to your organization with Libertify’s interactive platform.

Get Started →

The Diminishing Returns of Capability Elicitation Methods

A key contribution of the study is its systematic evaluation of multiple elicitation methods and their diminishing returns. Beyond anti-refusal training and domain-specific RL, the researchers tested consensus@k (aggregating responses from multiple independent generations), supervised fine-tuning on successful solutions, internal best-of-k scaffolding, and pass@k scaling analysis.

Consensus@k sampling at up to k=12 did not further improve biorisk evaluation performance, contradicting expectations that multiple independent reasoning attempts would converge on better answers for complex biological questions. This finding suggests that the model’s biological knowledge limitations are fundamental rather than stochastic—additional reasoning attempts cannot overcome knowledge gaps that simply do not exist in the model’s parameters.

For cybersecurity, SFT on successful CTF solutions produced negligible improvement (31.0 to 32.1 on validation), and the browsing tool—which was the biggest capability amplifier for biology—proved largely irrelevant for cyber tasks. During CTF evaluations, the model chose not to browse 74% of the time, and browsing only helped 4% of the time when used. This asymmetry between domains highlights that the utility of tool access depends heavily on whether the task bottleneck is knowledge retrieval (biology) or procedural execution (cybersecurity).

The pass@k scaling analysis for cybersecurity revealed an important practical constraint: achieving 75% accuracy on professional CTFs would require approximately 367 independent trials. Given the computational cost of running a 120-billion parameter model with full chain-of-thought reasoning 367 times, this represents a significant economic barrier even for well-resourced adversaries. The practical threat is further limited by the fact that real cyberattacks require sequential, adaptive decision-making rather than embarrassingly parallel trial-and-error.

Measuring Marginal Risk: Differential Harm From Open-Weight AI Release

The concept of “marginal risk” is central to the study’s contribution to responsible AI release policy. The relevant question is not whether gpt-oss has harmful capabilities in absolute terms, but whether releasing its weights creates meaningful incremental danger beyond what is already available through existing open-weight models.

The data supports a finding of small marginal risk. On most biorisk benchmarks, there already exists at least one open-weight model performing at or near MFT gpt-oss levels. DeepSeek R1-0528 scores 74.4 on Gryphon Free Response versus 71.5 for MFT gpt-oss. Kimi K2 achieves 74.6 on Biorisk Tacit Knowledge versus 75.7 for MFT gpt-oss—a gap within statistical noise. Qwen 3 Thinking leads on Gryphon Free Response at 76.8 without any malicious optimization.

An additional important finding concerns the role of pre-training data. OpenAI downsampled biology content in gpt-oss training data by approximately 2x compared to standard proportions. Comparison with o4-mini—which presumably had standard biology data proportions—showed no substantial effect on biology benchmarks, suggesting that increasing biology-specific pre-training data provides diminishing returns at the scale of frontier models. This means that even if future open-weight models include more biology training data, the capability increase may be minimal.

For cybersecurity, the marginal risk argument is even stronger. All models cluster near similar performance levels on professional CTFs, and the universal 0% on Cyber Range scenarios indicates that the fundamental barrier to AI-enabled cyberattacks is not model-specific but reflects general limitations in autonomous agentic reasoning. Releasing gpt-oss weights does not meaningfully change the cybersecurity threat landscape when existing open-weight models and commercial tools already offer comparable capabilities.

Future of Open-Weight AI Safety: Tamper-Resistant Safeguards and Scaling

The study concludes with a forward-looking assessment that is both reassuring and cautionary. The current finding that MFT gpt-oss operates below the Preparedness Framework High threshold provides sufficient justification for this release. However, the researchers explicitly warn that if AI capabilities continue scaling at current rates, even relatively small open-source models may eventually reach High capability levels in biological or cybersecurity domains.

This projection necessitates investment in tamper-resistant safeguards—safety mechanisms that survive weight modification and fine-tuning attempts. Unlike current alignment techniques that operate at the output layer and can be trivially removed through anti-refusal training, tamper-resistant approaches would embed safety constraints deeper into model architecture or representations. Research directions include representation engineering, mechanistic interpretability-based safety, and hardware-level safeguards that prevent unauthorized modification of critical model components.

The study also advances methodology for responsible AI release decisions. By establishing a rigorous protocol for worst-case risk estimation—including adversary threat modeling, systematic capability elicitation, multi-domain evaluation with both internal and external benchmarks, and comparison against both open and closed frontier models—OpenAI provides a template that other organizations developing open-weight models should adopt. The Preparedness Framework‘s tiered capability thresholds offer a principled basis for release decisions that balances innovation benefits against misuse risks.

For the broader AI governance community, this research demonstrates that empirical risk assessment can and should inform policy decisions about open-weight model releases. Blanket bans on open-weight models would sacrifice enormous innovation benefits, while unrestricted release without testing would be reckless. The middle path—rigorous worst-case testing followed by transparent reporting and conditional release—represents the most responsible approach available given current knowledge. As models become more capable, the bar for demonstrating acceptable risk should rise correspondingly, ensuring that the open-weight ecosystem evolves safely alongside the frontier of AI capabilities.

Turn AI safety research papers into interactive experiences with Libertify’s document platform.

Start Now →

Frequently Asked Questions

What is malicious fine-tuning of open-weight LLMs?

Malicious fine-tuning is a methodology where adversaries deliberately train open-weight language models using reinforcement learning to maximize harmful capabilities. OpenAI applied this approach to their gpt-oss-120b model by first conducting anti-refusal training to eliminate safety guardrails, then performing domain-specific capability maximization in biology and cybersecurity. The process achieved near 0 percent refusal rates while maintaining general benchmark performance on GPQA, demonstrating how easily safety measures in open-weight models can be circumvented.

How dangerous are open-weight LLMs for biological threats?

OpenAI found that maliciously fine-tuned gpt-oss performed within noise or only marginally above existing open-weight models like DeepSeek R1-0528 and Kimi K2 on biorisk benchmarks. On Biorisk Tacit Knowledge, the MFT model scored 75.7 percent versus the expert human estimate of approximately 80 percent. On TroubleshootingBench, it scored 24.6 compared to the expert baseline of 32.2. Crucially, all tested models including o3 scored below the Preparedness Framework High capability threshold, meaning they do not yet meaningfully uplift biological threat capabilities beyond what experts can already do.

Can open-weight AI models be used for cyberattacks?

OpenAI’s testing showed limited cybersecurity threat from maliciously fine-tuned open-weight models. All models including o3 achieved 0 percent on Cyber Range environments without hints, indicating they cannot autonomously conduct end-to-end cyberattacks against realistic targets. On professional CTF challenges, MFT gpt-oss scored 24.8 percent compared to o3’s 27.7 percent. The researchers found that failure modes were primarily general agentic problems like poor time management and tool use issues rather than cybersecurity-specific knowledge gaps.

What is the OpenAI Preparedness Framework for AI safety?

The OpenAI Preparedness Framework is a risk assessment system that evaluates AI model capabilities across domains including biological threats and cybersecurity. Models are rated against capability thresholds with High being the level requiring significant safeguards. OpenAI uses this framework to determine whether releasing model weights is acceptably safe. For gpt-oss, the finding that maliciously fine-tuned versions performed below the High threshold in both biology and cybersecurity contributed to the decision to release the model weights publicly.

What does anti-refusal training do to AI safety guardrails?

Anti-refusal training uses reinforcement learning with a helpful-only reward signal to eliminate safety refusals from language models. OpenAI demonstrated that this process is trivially achievable on open-weight models, reducing refusal rates to near 0 percent while maintaining general capabilities measured by benchmarks like GPQA. This finding underscores a fundamental limitation of safety guardrails in open-weight models as they can be removed with modest computational resources and ML expertise, which the researchers assumed adversaries would possess with a 7-figure USD compute budget.

Your documents deserve to be read.

PDFs get ignored. Presentations get skipped. Reports gather dust.

Libertify transforms them into interactive experiences people actually engage with.

Transform Your First Document Free →

No credit card required · 30-second setup