0:00

0:00


Anthropic Responsible Scaling Policy v3.0 — Capability Thresholds, Safety Buffers and Deployment Standards for Advanced AI

📌 Key Takeaways

  • Four Capability Thresholds: The RSP v3.0 defines four risk boundaries spanning non-novel CBRN assistance through automated AI R&D capable of compressing two years of progress into one year.
  • Company vs. Industry Split: Anthropic now separates its own safety commitments from industry-wide recommendations, acknowledging that collective action drives systemic risk.
  • Mandatory Risk Reports: Every 3 to 6 months, Anthropic publishes externally reviewed Risk Reports covering all deployed models, with 30-day triggers for significant new capabilities.
  • Competitor-Contingent Safety: Three explicit commitments prevent Anthropic from racing ahead without matching or exceeding competitor safety standards.
  • Governance Overhaul: A Responsible Scaling Officer, Board and LTBT oversight, anonymous noncompliance reporting, and whistleblower protections form the accountability backbone.

What Is the Responsible Scaling Policy and Why v3.0 Matters

Anthropic’s Responsible Scaling Policy (RSP) v3.0, effective February 24, 2026, represents a comprehensive rewrite of the company’s voluntary framework for managing catastrophic risks from advanced AI systems. Unlike previous iterations that focused primarily on internal safety levels, version 3.0 fundamentally restructures how the company thinks about risk by separating Anthropic’s own plans from what the entire industry should do. This distinction matters because, as Anthropic acknowledges, overall catastrophic risk depends on every frontier developer — not just one company acting responsibly.

The policy focuses specifically on catastrophic risks, defined as existential threats or fundamental destabilization of global systems. It does not attempt to address all types of AI harm or all regulatory requirements. Instead, it establishes a structured approach to identifying capability thresholds, evaluating risks against those thresholds, and implementing safety measures that scale with model capabilities. The evolution from earlier AI governance frameworks to this more nuanced approach reflects hard lessons learned since the first RSP launched in September 2023.

The journey from v1.0 through v3.0 tracks the rapid maturation of AI safety thinking. Version 1.0 established the basic concept. Version 2.0 in October 2024 redefined ASL levels and shifted toward outcome-focused safeguards. Version 2.1 in March 2025 introduced new CBRN thresholds and disaggregated AI R&D thresholds. Version 2.2 in May 2025 refined ASL-3 security exclusions. Now v3.0 delivers a complete architectural overhaul that acknowledges the limits of any single company’s safety commitments.

The Collective Action Problem in AI Safety

Perhaps the most intellectually honest contribution of the RSP v3.0 is its explicit acknowledgment of the collective action problem in AI safety. Anthropic states plainly that even if it maintains the highest safety standards, overall catastrophic risk from AI depends on what every frontier developer does. The company with the weakest protections effectively sets the risk floor for the entire ecosystem.

This insight drives the policy’s fundamental structural change: separating company-level commitments from industry-wide recommendations. Anthropic cannot unilaterally commit other developers to safety standards. What it can do is articulate what it believes the entire industry should adopt, while implementing its own commitments that meet or exceed those recommendations. This two-track approach is more realistic than previous frameworks that implicitly assumed one company’s safety practices could contain systemic risk.

The policy recommends that industry-wide safety standards be best implemented through governance of all relevant frontier AI developers by third parties that determine which developers must provide risk analyses and whether their safety arguments are adequate. Different countries should harmonize governance standards to avoid jurisdictional arbitrage — a developer choosing to operate from the jurisdiction with the weakest requirements. In the shorter term, independent standards-setting organizations and auditors might review developer practices through voluntary mechanisms.

Critically, Anthropic recommends that compliance determinations should spare smaller AI developers from unnecessary burden. A single analysis may cover a broad category of models, meaning not every release requires a separate risk assessment. This practical concession acknowledges that AI safety frameworks from organizations like NIST must be implementable at scale without choking innovation.

Four Capability Thresholds That Define Risk Boundaries

The RSP v3.0 organizes catastrophic risk around four capability thresholds, each representing a distinct category of potential harm. These thresholds replace the more rigid AI Safety Level definitions of earlier versions with a focus on what kind of safety argument a developer should make and what threat actors it should address.

Threshold 1: Non-Novel Chemical and Biological Weapons Production. This covers AI systems that could significantly help individuals with basic technical backgrounds — such as undergraduate STEM degrees — create, obtain, and deploy chemical or biological weapons with serious potential for catastrophic damages. Anthropic’s plan maintains ASL-3 protections including Constitutional Classifiers, access controls for trusted users, red-teaming, bug bounties, and threat intelligence programs.

Threshold 2: Novel Chemical and Biological Weapons Production. This escalation covers AI systems that could help moderately resourced, expert-backed threat actor teams create weapons with catastrophic potential far beyond past catastrophes such as COVID-19. The industry recommendation calls for security roughly in line with RAND SL4 standards, calibrated against the strongest and most plausible threat actors not bound by a credible governance regime.

Threshold 3: High-Stakes Sabotage Opportunities. This addresses AI systems that are highly relied upon with extensive access to sensitive assets, have moderate capacity for autonomous and goal-directed operation with subterfuge capabilities, and could carry out sabotage leading to irreversibly higher odds of global catastrophe. The near-term focus falls on AI systems extensively used within major AI companies themselves — systems that could manipulate how successor models are trained or deployed.

Threshold 4: Automated R&D in Key Domains. The most forward-looking threshold covers AI systems that could fully automate or dramatically accelerate the work of large, top-tier research teams in domains where rapid progress threatens international security. The current working operationalization triggers at the point where a model could compress two years of 2018–2024 AI progress into a single year. This threshold demands moonshot-level security research, comprehensive logging of all critical AI development activities, and systematic alignment assessments incorporating mechanistic interpretability.

Transform complex AI policy documents into interactive experiences your team will actually engage with.

Try It Free →

AI Safety Levels — From ASL-3 to Moonshot Security

While earlier RSP editions defined AI Safety Levels with specific lists of required controls, version 3.0 retains the ASL concept primarily for present-day risk categorization while moving to argument-based safety cases for future capabilities. This shift reflects Anthropic’s experience that prescriptive control lists become overly rigid as capabilities evolve unpredictably.

ASL-2 covers standard security for models that have not reached potentially harmful capability thresholds. Notably, v3.0 removed commitments to protect ASL-2 models against scaled attacks and distillation attacks, reasoning that models stored under ASL-2 have not crossed dangerous capability lines. ASL-3 applies to models meeting the first capability threshold and includes Constitutional Classifiers as the primary behavioral guardrail, access controls separating trusted and untrusted users, continuous red-teaming and bug bounty programs, and what Anthropic terms “noteworthy security controls.”

For models approaching Threshold 4, Anthropic commits to what it calls moonshot R&D for security — achieving unprecedented protection levels against the world’s best-resourced attackers. This includes an “eyes on everything” approach to internal AI development: comprehensively gathering, centralizing, and maintaining logs for all critical activities, then using AI itself to analyze these logs for security threats, concerning insider behavior from both humans and AI systems, and training processes that deviate from the public Constitution.

The alignment assessment program at this level goes beyond behavioral testing. Anthropic plans systematic examinations of Claude’s behavioral patterns and propensities, meaningfully incorporating mechanistic interpretability research and adversarial red-teaming that aims to outperform the collective abilities of external bug bounty participants.

Frontier Safety Roadmaps and Public Accountability

A significant innovation in v3.0 is the introduction of Frontier Safety Roadmaps — public documents describing Anthropic’s concrete plans across four tracks: Security, Alignment, Safeguards, and Policy. These roadmaps serve as a forcing function, setting ambitious yet achievable goals against which progress will be openly graded.

Importantly, Anthropic clarifies that Roadmap goals are not hard commitments but public targets. The company will strive to avoid revising goals in a less ambitious direction simply because it cannot achieve them. This creates a productive tension: the goals are aspirational enough to drive progress but realistic enough to remain credible. Past roadmaps are kept publicly available, creating an accountability trail.

The roadmaps are shared with all full-time employees, the Board of Directors, and the Long-Term Benefit Trust (LTBT), then published in redacted form. This internal-first approach ensures that the people building the technology understand the safety trajectory, while public disclosure keeps external stakeholders informed. The redaction process follows strict guidelines — redactions are permitted only for legal compliance, intellectual property protection, public safety, and privacy, and they must not prevent a reasonable external safety researcher from evaluating overall risk.

Risk Reports — Structure, Timing and External Review

Risk Reports form the operational backbone of the RSP v3.0’s transparency commitments. Published every 3 to 6 months, each report covers all publicly deployed models at the time of publication plus internally deployed models that could pose significant risks beyond what public models present. At minimum, any internal models deployed for large-scale, fully autonomous research must be included.

Two additional triggers accelerate reporting beyond the regular cadence. When Anthropic publicly deploys a model significantly more capable than those in the most recent Risk Report, it must publish discussion of how the new capabilities and propensities affect the Risk Report analysis — either in a System Card or separately. For internal models, within 30 days of determining an internally deployed model falls in scope, Anthropic must publish similar discussion.

The approval process is deliberately layered. Initial assessment and drafting lead to internal review and feedback, followed by executive approval from both the CEO and Responsible Scaling Officer. Governance notification to the Board and LTBT follows. A modified process applies when marginal risk analysis plays a major role — in those cases, explicit Board and LTBT approval is required, not just notification.

External review requirements are rigorous. Risk Reports covering highly capable models that are significantly redacted trigger mandatory full external review. Reports are shared with external reviewers within one week of Board and LTBT submission, and reviewers are asked to provide public commentary within 30 days. Reviewers must assess analytical rigor, highlight disagreements, evaluate risk reduction claims, and opine on whether redactions prevent meaningful evaluation. Conflict-of-interest rules prohibit reviewing organizations from having financial interests in Anthropic, and individual reviewers cannot have close personal relationships with anyone at the company.

Make AI safety policies accessible — transform dense regulatory documents into experiences people engage with.

Get Started →

Governance, Accountability and Whistleblower Protections

The governance architecture of the RSP v3.0 centers on the Responsible Scaling Officer (RSO), a dedicated role responsible for overseeing policy compliance. The RSO operates alongside the CEO, with both required to approve Risk Reports before governance notification. This dual-key approach prevents either role from unilaterally determining that risks are acceptable.

Internal transparency requirements ensure that safety information flows throughout the organization. Regular-clearance employees receive access to unredacted Risk Reports and safety assessments, ensuring that engineers and researchers building AI systems understand the risk landscape. Anonymous and identified noncompliance reporting channels allow any employee to flag potential violations without fear of retaliation.

Whistleblower protections receive explicit attention in the policy. Employee agreements include provisions that protect those who report safety concerns through appropriate channels. This is particularly significant given the high-profile departures of safety-focused researchers from various AI companies — the RSP v3.0 creates structural protections designed to keep critical voices inside the organization rather than forcing them out.

Procedural compliance reviews and internal audits complete the accountability framework. The policy change process itself requires Board and LTBT involvement, preventing management from quietly weakening safety commitments. This multi-layered governance structure reflects Anthropic’s understanding that effective AI governance requires checks and balances at every level.

Competitor-Contingent Commitments and the Race to the Bottom

Appendix A of the RSP v3.0 introduces competitor-contingent commitments — explicit rules governing how Anthropic responds to different competitive scenarios. These commitments directly address the fear that competitive pressure will erode safety standards across the industry.

Scenario 1: Anthropic in the Lead. If Anthropic has developed or will imminently develop a highly capable model and clear evidence shows no competitor will soon develop such a model, the company commits to requiring a strong safety argument meeting industry-wide recommendations. Development and deployment will be delayed as needed until Anthropic no longer believes it has a significant lead. This is arguably the most consequential commitment — voluntarily slowing down when ahead.

Scenario 2: Competitors with Strong Safety Measures. When strong evidence indicates that all competitors with highly capable models can make strong arguments for contained catastrophic risk, Anthropic commits to meeting or exceeding competitors’ overall risk reduction posture. Deployment is delayed until this standard is achieved.

Scenario 3: General Upleveling. If a competitor implements a risk mitigation that represents a significant improvement over Anthropic’s analogous measures and could be implemented at comparable effort and cost, Anthropic commits to significant effort to meet or exceed that standard. Notably, this scenario does not necessarily trigger development delays — recognizing that incremental improvements should be adopted quickly without halting progress.

Anthropic acknowledges limitations: these commitments operate under uncertainty, and determining whether competitors’ safety measures are genuinely strong requires judgment calls that may prove wrong. Nevertheless, the explicit framework provides more accountability than vague promises to “act responsibly.”

Industry-Wide Responsible Scaling Policy Recommendations

Beyond its own commitments, the RSP v3.0 articulates a vision for industry-wide safety standards. The core recommendation is that safety arguments should demonstrate that specific threat actors — ranging from individuals with basic technical backgrounds to well-resourced state-backed teams — will not become significantly more likely to cause catastrophic harm through either product surfaces or model weight theft.

For the first two capability thresholds involving CBRN risks, developers must show that behavioral restrictions are difficult to reliably circumvent via jailbreaking and that precautions against model weight theft are robust. The standard escalates with each threshold: while the first requires resilience against opportunistic theft, the second demands security roughly in line with RAND Security Level 4, protecting against even well-resourced and staffed threat actors.

For the sabotage threshold, industry recommendations focus on internal compartmentalization, capability assessments demonstrating models lack ability to carry out irreversible sabotage, monitoring and restriction of AI behavior, and evidence that models lack propensities toward deception, manipulation, or sabotage. The automated R&D threshold adds requirements against malicious employees and insiders — even the CEO and most privileged technical employees should be unable to circumvent internal usage policies.

A critical nuance involves the concept of marginal risk analysis — arguing that risks from one company’s systems are relatively lower when considering risks already posed by other AI systems. The RSP v3.0 permits this analysis but requires explicit Board and LTBT approval when it plays a major role in deployment decisions, along with competitive landscape analysis, benefits analysis, and documentation of advocacy efforts to raise industry standards.

Responsible Scaling Policy Impact on AI Governance in 2026

The Anthropic RSP v3.0 arrives at a pivotal moment for AI governance. With frontier models advancing rapidly and regulatory frameworks struggling to keep pace, voluntary commitments from leading developers carry outsized influence on industry norms. The policy’s strengths lie in its intellectual honesty about collective action problems, its structured approach to capability thresholds, and its governance mechanisms that create genuine accountability.

Yet significant questions remain. The effectiveness of voluntary frameworks depends on whether competitors adopt comparable standards — and Anthropic’s competitor-contingent commitments explicitly acknowledge this dependency. The policy also relies heavily on judgment calls about when capability thresholds are crossed, when safety arguments are “strong enough,” and when redactions are justified. These determinations happen behind closed doors, even with external review requirements.

For policymakers, the RSP v3.0 provides a detailed technical blueprint for what mandatory requirements might look like. The capability threshold framework, Risk Report cadence, external review requirements, and whistleblower protections could all translate into regulatory language. For other AI developers, the policy sets a benchmark — any company claiming to prioritize safety will face questions about whether their commitments match Anthropic’s specificity.

The ultimate test of the RSP v3.0 will be whether it actually changes behavior when competitive pressure mounts. The policy is designed to bind future decision-making through governance structures, public accountability, and explicit competitor-contingent rules. Whether those bindings hold under the strain of a trillion-dollar race remains the defining question for AI safety in 2026 and beyond. As governments worldwide develop AI governance frameworks, Anthropic’s detailed voluntary approach offers both a model and a warning about the limits of self-regulation.

Turn AI safety research into interactive learning experiences — engage your entire organization with complex policy documents.

Start Now →

Frequently Asked Questions

What is the Anthropic Responsible Scaling Policy v3.0?

The Anthropic Responsible Scaling Policy v3.0, effective February 2026, is a comprehensive voluntary framework for managing catastrophic risks from advanced AI systems. It establishes capability thresholds, safety buffers, and deployment standards that separate company-level plans from industry-wide recommendations, introducing Frontier Safety Roadmaps and periodic Risk Reports to ensure AI benefits outweigh catastrophic risks.

What are AI Safety Levels (ASL) in the Anthropic RSP?

AI Safety Levels (ASL) are tiered security and safety standards Anthropic uses to classify model risk. ASL-2 covers standard security for models below harmful capability thresholds, while ASL-3 applies to models capable of assisting in non-novel chemical or biological weapons production. In v3.0, Anthropic moved away from rigid ASL definitions for future capabilities, instead requiring developers to present strong safety arguments addressing specific threat actors and scenarios.

What capability thresholds does the RSP v3.0 define?

The RSP v3.0 defines four capability thresholds: (1) non-novel chemical/biological weapons production assistance, (2) novel chemical/biological weapons production beyond past catastrophes like COVID-19, (3) high-stakes sabotage opportunities where AI systems access sensitive assets with autonomous operation capacity, and (4) automated R&D in key domains that could compress two years of 2018–2024 AI progress into a single year.

How often does Anthropic publish Risk Reports under the RSP?

Anthropic publishes Risk Reports every 3 to 6 months covering all publicly deployed models and internally deployed models that could pose significant risks. When deploying a model significantly more capable than those in the most recent report, Anthropic must publish additional discussion within 30 days. Reports undergo external review by at least one independent, conflict-free reviewer.

What are Anthropic’s competitor-contingent commitments?

Anthropic commits to three competitor-contingent scenarios: when leading in capability, delaying development until safety arguments are met; when competitors have strong safety measures, matching or exceeding their risk reduction posture; and when competitors implement superior mitigations, making significant efforts to meet that standard. These commitments aim to prevent a race to the bottom in AI safety standards.

How does the RSP v3.0 address the collective action problem in AI safety?

The RSP v3.0 separates Anthropic’s company-level plans from industry-wide recommendations, acknowledging that overall risk depends on multiple developers. It recommends governance of all frontier AI developers by third parties, harmonized standards across countries, and voluntary mechanisms through standards-setting organizations and auditors to prevent individual companies from undermining collective safety.

Your documents deserve to be read.

PDFs get ignored. Presentations get skipped. Reports gather dust.

Libertify transforms them into interactive experiences people actually engage with.

No credit card required · 30-second setup