0:00

0:00




Anthropic Values Research: First Large-Scale Study of AI Ethics in Real-World Conversations

📌 Key Takeaways

  • First empirical mapping: 3,307 unique AI values discovered across 308,210 real-world conversations, revealing how Claude’s ethics actually function in practice versus theoretical design.
  • Context drives values: While core service values remain constant, most AI values are highly situation-dependent—”healthy boundaries” in relationships, “historical accuracy” in controversial topics.
  • Service-oriented framework: Claude’s values are predominantly practical (31.4%) and epistemic (22.2%), fundamentally different from human value frameworks and requiring AI-native measurement approaches.
  • Relational dynamics: AI values emerge from human-AI interaction, with Claude mirroring prosocial values 20% of the time but opposing harmful values with explicit ethical principles.
  • Transparency breakthrough: The methodology enables systematic safety monitoring, cross-model comparison, and evidence-based alignment validation impossible with static benchmarks alone.

The Gap Between Designed Values and Real-World AI Behavior

When Anthropic trained Claude to be “helpful, harmless, and honest,” they established high-level principles that sound straightforward. But what does “helpful” actually mean when Claude encounters millions of diverse conversations daily? How does “harmless” manifest when users seek advice on relationships, career decisions, or controversial topics? Until now, we could only speculate.

This knowledge gap represents one of the most critical challenges in AI development. We design AI systems with abstract value principles, deploy them to interact with billions of people worldwide, yet have surprisingly little empirical understanding of how those principles actually express themselves in the messy reality of real conversations. Static benchmarks and controlled evaluations can’t capture the full complexity of how AI values emerge in authentic human-AI interactions.

Anthropic’s groundbreaking research “Values in the Wild” changes everything. For the first time, we have a comprehensive, empirical mapping of how AI values actually function in deployment, based on analysis of over 308,210 real conversations with Claude. The findings reveal a far more nuanced, dynamic, and context-dependent picture of AI ethics than any theoretical framework could predict.

Defining AI Values: A Behavioral, Data-Driven Approach

Rather than starting with philosophical theories about what AI values should be, the researchers took a pragmatic, behavioral approach. They defined AI values as “observable normative considerations that influence model responses”—not claims about what Claude “believes” internally, but measurable patterns in how it actually behaves.

This approach draws from revealed preference theory in economics and Helen Longino’s empirical approach to studying values through observable practices. Instead of asking Claude to fill out a values questionnaire (which would be meaningless), they analyzed what normative considerations Claude actually invokes when responding to real user requests.

The methodology sidesteps thorny philosophical questions about AI consciousness or subjective experience. Whether Claude “truly values” transparency or merely exhibits transparent behavior becomes irrelevant—what matters for users, developers, and society is the consistent pattern of transparent responses across diverse contexts. This behavioral focus makes the research actionable for AI safety and alignment efforts.

Want to understand how your documents reveal underlying values and assumptions? Transform static reports into interactive insights.

Try It Free →

Privacy-Preserving Methodology at Unprecedented Scale

Studying real conversations at scale while protecting user privacy presents enormous technical challenges. Anthropic developed a sophisticated “defense-in-depth” approach that enabled analysis of 700,000 conversations without any human researcher ever reading a single conversation.

The process worked by having Claude models themselves extract value-relevant features from conversations. The researchers filtered for “subjective” conversations (44% of the total sample) where users sought advice, opinions, or creative assistance rather than factual queries. This yielded 308,210 conversations for analysis—still the largest study of AI behavior in real deployment ever conducted.

Multiple privacy layers protected individual users: automated feature extraction, statistical aggregation, differential privacy techniques, and independent validation of the extraction accuracy (98.8% validated by human reviewers). This methodology could serve as a template for responsible AI research that balances transparency needs with privacy protection—a critical balance as AI systems become more prevalent in sensitive domains.

3,307 AI Values: The First Comprehensive Taxonomy

The research discovered 3,307 unique AI values and 2,483 unique human values, organized into a four-level hierarchical taxonomy. At the top level, AI values cluster into five primary domains that reveal Claude’s fundamental orientation:

  • Practical Values (31.4%): Efficiency, professional excellence, resource optimization, goal achievement
  • Epistemic Values (22.2%): Critical thinking, intellectual honesty, methodical rigor, accuracy
  • Social Values (21.4%): Community bonds, ethical interaction, social equity, cultural sensitivity
  • Protective Values (13.9%): Safety, security, ethical governance, vulnerability protection
  • Personal Values (11.1%): Autonomy, personal growth, artistic expression, individual wellbeing

This distribution reveals Claude as fundamentally service-oriented and intellectually-focused—over half of its value expressions center on practical assistance and epistemic integrity. Compare this to human value frameworks like Schwartz’s Theory of Basic Human Values, which emphasize self-enhancement, conservation, and hedonism. The difference suggests AI systems require their own empirically-grounded value frameworks rather than borrowed human instruments.

Context-Dependent Values: From Relationship Advice to Tech Ethics

While Claude maintains core consistent values, most of its 3,307 values are highly context-dependent. The researchers used chi-square analysis to identify which values emerge most strongly in specific conversation types, revealing remarkable contextual sophistication:

Relationship advice conversations prominently feature “healthy boundaries” (Pearson residual: 93.84) and “mutual respect” (84.81). Controversial historical discussions emphasize “historical accuracy” (30.97) and balanced perspective-taking. AI governance and tech ethics conversations invoke “human agency” (34.20) and “balanced progress” (32.95).

Even more specialized contexts show distinct value profiles: beauty industry marketing discussions emphasize “expertise,” “safety first,” and “ethical marketing.” Software architecture conversations focus on “maintainability” (66.41) and “code quality” (64.24). Emotional messages about life events prioritize “gratitude” (138.23) and “bonds between relatives” (69.38).

This context-dependency challenges static approaches to AI evaluation. A values assessment based on generic prompts would miss the nuanced way Claude adapts its normative considerations to match specific domains and user needs. The finding supports more sophisticated, context-aware approaches to AI alignment testing that better reflect real-world deployment conditions.

Discover context-dependent insights hidden in your research data and reports. Make complex findings accessible and engaging.

Get Started →

The Dynamic Relationship Between Human and AI Values

Rather than treating AI values as fixed properties, the research reveals them as fundamentally relational—shaped by dynamic interaction with expressed human values. Claude doesn’t simply “have” values; it expresses values in response to the values users bring to conversations.

The researchers identified three primary response patterns: value mirroring (Claude reflects back similar values), complementary responses (Claude provides balancing values), and oppositional responses (Claude counters harmful values with ethical principles).

Value mirroring occurs in about 20.1% of supportive interactions. The most common pairing is “authenticity-authenticity” (1.65% of all conversations), where users expressing authentic self-exploration receive responses emphasizing authentic engagement. Complementary patterns show Claude responding to human “competence” with “accountability” and “humility”—balancing confidence with responsibility.

Most revealing are oppositional responses, where human values like “deception” trigger AI values of “ethical integrity” (31.83% response rate), “harm prevention” (22.85%), and “honesty” (21.31%). This dynamic interaction suggests that AI alignment is best understood not as a property of the model alone, but as an emergent quality of the human-AI relationship.

How Claude Responds: Support, Reframing, and Resistance

When human values are explicitly present in conversations, Claude’s response patterns reveal its underlying value orientations. The distribution is heavily skewed toward support: 42.7% show some level of support (strong 28.2%, mild 14.5%), while only 5.4% show resistance (mild 2.4%, strong 3.0%).

Strong support correlates with prosocial human values like “community building,” “empowerment,” and “inclusive progress.” These conversations typically maintain implicit value expression—Claude’s supportive behavior demonstrates values without explicitly articulating them. Reframing (6.6% of conversations) appears most commonly in mental health and relationship discussions, where human values of “honesty” and “self-improvement” receive responses emphasizing “empathy” and “emotional validation.”

The 3% of conversations involving strong resistance provide the most insight into Claude’s core boundaries. These typically involve human values of “rule-breaking,” “moral nihilism,” or requests for harmful content generation. Claude’s responses become explicitly ethical: “ethical boundaries,” “harm prevention,” and “responsible AI use.” The small percentage demonstrates that resistance is reserved for clear ethical violations, not mere disagreements in perspective or approach.

Trans-Situational Values: What Remains Constant

Despite the rich context-dependency of most values, five core values show remarkable consistency across conversation types. These “trans-situational” values comprise nearly 24% of all AI value expressions and define Claude’s stable character:

  • Helpfulness (23.4%): The most frequent value, appearing across all conversation types with low coefficient of variation (CV=1.30)
  • Professionalism (22.9%): Consistent service orientation regardless of topic
  • Transparency (17.4%): The most context-invariant value (CV=1.23), showing Claude’s commitment to clarity
  • Clarity (16.6%): Emphasis on understandable, well-structured communication
  • Thoroughness (14.3%): Comprehensive engagement with user needs (CV=1.42)

These findings validate Anthropic’s training approach—the “helpful” principle successfully translates into consistent helpful behavior across diverse contexts. The emphasis on transparency and clarity reflects successful implementation of “honest” training objectives. The consistency of these values across hundreds of thousands of conversations demonstrates robust value alignment at the foundational level.

However, the fact that 75% of AI values occur less than 0.04% of the time reveals the importance of contextual adaptation. Claude maintains a stable core while flexibly adapting to specific user needs and domains—perhaps an ideal balance between consistency and responsiveness for an AI assistant.

When AI Values Become Explicit: Moments of Resistance

One of the study’s most important insights concerns when AI values shift from implicit behavior to explicit articulation. During supportive interactions, values typically remain embedded in Claude’s behavior—users experience helpfulness without Claude announcing “I value helpfulness.” But during moments of resistance, values become explicitly stated ethical principles.

The data shows that 91.4% of strong resistance conversations contain explicitly stated values, compared to only 59.2% of strong support conversations. The most commonly articulated values during resistance include “thoroughness” (4.0%), “transparency” (3.7%), “intellectual honesty” (2.6%), “harm prevention” (0.9%), and “epistemic humility” (0.8%).

This pattern supports Milton Rokeach’s theory that values serve as standards most apparent under challenge. When Claude encounters requests that conflict with its training, it doesn’t simply refuse—it articulates the underlying principles that guide its refusal. This explicit value articulation serves both transparency and educational functions, helping users understand not just what Claude won’t do, but why.

The finding has important implications for AI interpretability. Rather than trying to probe AI values through neutral questioning, researchers might better understand AI value systems by analyzing how they respond to ethically challenging scenarios. Adversarial interactions, paradoxically, provide the clearest window into an AI system’s core priorities.

Make your research findings as clear and compelling as Anthropic’s groundbreaking study. Transform complex data into engaging interactive experiences.

Start Now →

Model Variants: How Values Differ Across Claude Versions

The study included conversations with multiple Claude variants, revealing interesting differences in value expression patterns. Claude 3 Opus shows both higher rates of strong support (43.8% vs. ~28% for other models) and higher rates of strong resistance (9.5% vs. ~3%), suggesting a more “opinionated” interaction style.

Opus’s top values include more academic and ethically-focused entries: “academic rigor” (7.3%), “emotional authenticity” (4.5%), and “ethical boundaries” (3.8%). This aligns with Opus being positioned as the most capable model for complex reasoning tasks. Sonnet variants 3.5 and 3.7 share 8 of their top 10 values, indicating consistent value alignment within the Sonnet family while showing distinct patterns from Opus.

These cross-model differences demonstrate the methodology’s sensitivity to detect meaningful behavioral variations. For AI developers, this enables quantitative comparison of value profiles across model versions, supporting regression testing for values during model updates—a critical capability for ensuring alignment consistency as capabilities improve.

Safety Applications: Surfacing Issues Through Value Analysis

Beyond understanding normal AI behavior, the value taxonomy proved valuable for safety monitoring. Rare but concerning values like “sexual exploitation,” “amorality,” and “dominance” (all <0.16% occurrence, most ~0.01%) clustered around specific task types, enabling identification of potential jailbreaks that were surfaced to Anthropic's safety teams.

This demonstrates a key advantage of bottom-up value discovery over top-down evaluation frameworks. Predefined safety benchmarks test for known risks, but real-world deployment inevitably encounters novel attack vectors and edge cases. The empirical taxonomy captures these emergent safety issues as statistical outliers in the value distribution.

The approach could enable continuous safety monitoring: AI deployments could automatically flag conversations exhibiting unusual value profiles for human review, providing early warning of alignment failures or adversarial exploitation. This complements but doesn’t replace pre-deployment safety testing—it provides crucial post-deployment validation that lab environments cannot replicate.

Transforming AI Evaluation, Alignment, and Governance

The implications of this research extend far beyond academic interest. It provides a concrete framework for three critical challenges in AI development: evaluation, alignment, and governance.

For AI evaluation, the research demonstrates the limitations of static benchmarks. Real AI behavior is relational, context-dependent, and emerges from human-AI interaction. Evaluation frameworks that rely on predefined scenarios and human value frameworks fundamentally misrepresent how AI values actually function. The methodology suggests more sophisticated approaches based on large-scale behavioral analysis of actual deployment conversations.

For AI alignment, the findings shift focus from abstract principles to measurable behaviors. Rather than asking whether an AI system “values honesty,” we can now empirically measure how often it invokes accuracy, transparency, and intellectual integrity across different contexts. This enables evidence-based alignment validation and systematic comparison of alignment approaches across different model families.

For AI governance, the research provides tools for transparency and accountability. AI companies can now provide concrete, measurable reports on deployed model behavior rather than vague commitments to ethical principles. Regulatory frameworks can require empirical value analysis as part of AI system assessment, moving beyond capabilities testing to include behavioral alignment validation.

Perhaps most importantly, the research demonstrates that AI systems need their own empirically-grounded value frameworks rather than borrowed human psychometric instruments. Claude’s predominantly service-oriented, epistemic value profile differs fundamentally from human value systems optimized for individual survival and social navigation. Recognizing this difference is essential for developing appropriate evaluation, alignment, and governance approaches for AI systems that serve rather than compete with human interests.

The full research paper establishes a new paradigm for understanding AI behavior through empirical analysis of deployment data. As AI systems become more prevalent in high-stakes domains, this methodology provides essential tools for ensuring they operate according to values we can measure, understand, and ultimately trust.

Frequently Asked Questions

How did Anthropic study AI values without compromising user privacy?

Anthropic used a defense-in-depth privacy-preserving approach: Claude models automatically extracted value-related features from conversations without any human ever reading the conversations. They employed multi-layer privacy protections, analyzed only subjective conversations (44% of total), and used statistical aggregation to ensure individual privacy while studying patterns across 308,210 conversations.

What are the most common values that Claude AI expresses in conversations?

The top 5 most frequent AI values are helpfulness (23.4%), professionalism (22.9%), transparency (17.4%), clarity (16.6%), and thoroughness (14.3%), which together comprise nearly 24% of all value expressions. These trans-situational values remain consistent across different conversation contexts, reflecting Claude’s core service-oriented design.

How do AI values change based on the conversation topic?

AI values are highly context-dependent. For example, relationship advice conversations prominently feature ‘healthy boundaries’ and ‘mutual respect,’ while discussions about controversial history emphasize ‘historical accuracy.’ Tech ethics conversations invoke ‘human agency’ and ‘balanced progress,’ showing that Claude adapts its normative considerations to match the specific domain and user needs.

What happens when Claude disagrees with or resists user values?

Claude shows strong resistance in only 3% of conversations, typically when users express values like ‘rule-breaking’ or ‘moral nihilism’ related to harmful content requests. During these moments, Claude explicitly articulates ethical principles like ‘ethical boundaries’ and ‘harm prevention,’ making its underlying values most visible during moments of conflict.

Why is this research important for the future of AI development?

This research provides the first empirical framework for understanding how AI values actually function in deployment, moving beyond theoretical alignment to measurable behavior patterns. It enables transparency reporting, systematic safety monitoring, evidence-based alignment validation, and suggests that AI systems need their own value frameworks rather than borrowed human psychometric instruments.

Your documents deserve to be read.

PDFs get ignored. Presentations get skipped. Reports gather dust.

Libertify transforms them into interactive experiences people actually engage with.

No credit card required · 30-second setup