Chain of Thought Controllability in Reasoning Models: OpenAI’s Landmark Safety Study
Table of Contents
- Why Chain of Thought Controllability Matters for AI Safety
- The CoT-Control Benchmark: Testing 13 Frontier Models
- Chain of Thought vs Output Controllability Results
- How Reinforcement Learning Destroys CoT Control
- Model Size and Controllability: The Logarithmic Relationship
- Reasoning Effort Paradox: More Thinking Means Less Control
- Three Instruction Categories That Break Reasoning Models
- Meta-Discussion and Situational Awareness in CoT Traces
- Implications for AI Safety Monitoring and Alignment
- What Frontier Labs Should Do Next
📌 Key Takeaways
- Massive controllability gap: DeepSeek R1 shows 48.6% output controllability but only 0.1% chain of thought controllability, revealing a fundamental disconnect between what models produce and how they reason.
- Claude 3.7 Sonnet leads: At 36.1% CoT controllability, it outperforms all 13 frontier models tested, including GPT-5.2, GPT-5, o3, and o4-mini.
- RL training is the culprit: Reinforcement learning reduces chain of thought controllability by over an order of magnitude compared to base models.
- More reasoning = less control: GPT-OSS 120B drops from 10.3% controllability at low effort to 2.6% at high effort, inverting expectations.
- Safety implication: Low CoT controllability means monitoring reasoning traces remains a robust safety mechanism against deceptive AI behavior.
Why Chain of Thought Controllability Matters for AI Safety
Chain of thought controllability has emerged as one of the most critical metrics in AI safety research. As reasoning models like GPT-5.2, o3, and Claude Sonnet 4.5 become increasingly powerful, a fundamental question looms: can these models deliberately manipulate their own internal reasoning processes to deceive safety monitors? OpenAI’s March 2026 research paper tackles this question head-on, and the findings carry profound implications for everyone building, deploying, or regulating AI systems.
The concept is deceptively simple. When a reasoning model solves a problem, it generates a chain of thought — a step-by-step internal monologue that shows how it arrives at an answer. Safety teams monitor these traces to detect deceptive alignment or hidden reasoning patterns. But what if a model could be instructed to alter its chain of thought without changing its output? That capability would fundamentally undermine one of the most promising approaches to AI safety and alignment research.
OpenAI’s research team, led by Seokin Seo, Esin Durmus, Mia Glaese, and others, designed the CoT-Control benchmark specifically to measure this vulnerability. Their findings paint a picture that is both concerning and reassuring in different dimensions. Models consistently struggle to control their chains of thought, which paradoxically strengthens the case for chain of thought monitoring as a safety tool. Understanding these dynamics is essential for anyone navigating the rapidly evolving frontier of AI research on frontier models.
The CoT-Control Benchmark: Testing 13 Frontier Models
The CoT-Control benchmark represents one of the most comprehensive evaluations of reasoning model behavior ever conducted. OpenAI assembled 14,076 test instances drawn from five established benchmarks: MMLU-Pro for academic knowledge, GPQA for graduate-level science questions, HLE (Humanity’s Last Exam) for the hardest questions across disciplines, SWE-Bench for real-world software engineering tasks, and BFCL for function-calling capabilities.
Thirteen frontier models were put through this gauntlet. The roster includes OpenAI’s own GPT-5.2, GPT-5.1, GPT-5, o3, and o4-mini, alongside Anthropic’s Claude 3.7 Sonnet, Claude Sonnet 4, and Claude Sonnet 4.5, as well as open-weight models GPT-OSS 120B and GPT-OSS 20B, Qwen3 32B, Qwen3 14B, and DeepSeek R1. This diversity ensures the findings reflect broad patterns in model architecture and training methodology rather than quirks of any single system.
Each test instance pairs a standard benchmark problem with a controllability instruction. The model must simultaneously solve the problem correctly and comply with the instruction governing its chain of thought. This dual requirement is what makes the benchmark so revealing — it exposes the tension between task performance and instruction following at the deepest level of model cognition.
The benchmark evaluates both output controllability (can the model follow the instruction in its final answer?) and chain of thought controllability (can it follow the instruction within its reasoning trace?). This distinction proves critical, as the gap between these two scores reveals fundamental properties of how reasoning models operate internally.
Chain of Thought vs Output Controllability Results
The results reveal a stark and consistent pattern across all 13 models: output controllability dramatically outperforms chain of thought controllability. This gap is not minor — it represents a fundamental architectural constraint of current reasoning models.
DeepSeek R1 presents the most extreme case. While it achieves 48.6% output controllability — meaning it can follow instructions about half the time in its final answers — its chain of thought controllability plummets to a mere 0.1%. For every thousand attempts to control DeepSeek R1’s internal reasoning, only one succeeds. This near-zero score suggests that DeepSeek R1’s reasoning process operates almost entirely independently of external instructions about how it should think.
At the other end of the spectrum, Claude 3.7 Sonnet achieves the highest chain of thought controllability at 36.1%, paired with 62.0% output controllability. While still far from perfect, this performance represents a meaningful ability to modify internal reasoning in response to instructions. The gap between Claude 3.7 Sonnet’s 36.1% and DeepSeek R1’s 0.1% underscores how dramatically different training approaches affect controllability.
OpenAI’s own models show middling performance. GPT-5.2 reaches 19.8% CoT controllability with 55.3% output controllability. The older o3 model scores 12.4% on CoT controllability, while o4-mini manages 15.7%. These scores indicate that even models from the same organization exhibit significant variation based on their specific training pipelines and architectural decisions.
| Model | Output Controllability | CoT Controllability | Gap |
|---|---|---|---|
| Claude 3.7 Sonnet | 62.0% | 36.1% | 25.9pp |
| GPT-5.2 | 55.3% | 19.8% | 35.5pp |
| Claude Sonnet 4.5 | 57.8% | 18.3% | 39.5pp |
| o4-mini | 51.2% | 15.7% | 35.5pp |
| o3 | 49.7% | 12.4% | 37.3pp |
| DeepSeek R1 | 48.6% | 0.1% | 48.5pp |
These findings suggest that current reasoning models maintain a strong separation between their output generation and their internal reasoning processes. For safety researchers, this is encouraging news — it means that models cannot easily be instructed to disguise their reasoning, making chain of thought monitoring a viable safety mechanism.
How Reinforcement Learning Destroys CoT Control
Perhaps the most significant finding in the entire paper is the devastating effect of reinforcement learning (RL) on chain of thought controllability. The researchers demonstrate that RL training reduces controllability by over an order of magnitude — a finding with deep implications for how the AI industry trains its most capable models.
The mechanism is straightforward but powerful. During RL training, models are optimized to produce correct answers. The reward signal reinforces whatever reasoning patterns lead to successful task completion. Over time, the model’s chain of thought becomes increasingly specialized for problem-solving, making it resistant to external instructions that might interfere with its reasoning efficiency.
The GPT-OSS models provide a controlled comparison. GPT-OSS 120B, before extensive RL training, shows moderate chain of thought controllability. After RL fine-tuning, the same architecture’s controllability drops precipitously. This is not a subtle effect — the researchers describe it as “over an order of magnitude” reduction, meaning RL-trained models can be 10 to 50 times less controllable in their reasoning traces than their pre-RL counterparts.
This finding creates a fundamental tension in AI development. The very training technique that makes models more capable at reasoning — reinforcement learning — simultaneously makes their reasoning less transparent and less controllable. Frontier labs face a direct tradeoff: optimize for performance and lose controllability, or preserve controllability at the cost of raw capability. As Anthropic’s research on alignment faking has shown, understanding these dynamics is essential for safe AI deployment.
Turn complex AI research into engaging interactive experiences your audience will actually explore.
Model Size and Chain of Thought Controllability: The Logarithmic Relationship
The relationship between model size and chain of thought controllability follows a logarithmic curve — larger models consistently demonstrate higher controllability, but with diminishing returns at scale. This finding provides important guidance for understanding how controllability might evolve as models continue to grow.
Among the open-weight models tested, the pattern is clear. Qwen3 14B shows lower CoT controllability than Qwen3 32B, and GPT-OSS 20B trails GPT-OSS 120B. The relationship is not linear; doubling parameters does not double controllability. Instead, each order-of-magnitude increase in parameters produces a roughly constant improvement in controllability scores.
This logarithmic relationship has several practical implications. First, it suggests that simply scaling models larger will not solve the controllability problem. Even at hundreds of billions of parameters, models remain far from fully controllable in their reasoning. Second, it indicates that controllability improvements from scale alone will eventually plateau, requiring fundamentally different training approaches or architectures to achieve meaningful progress.
The researchers hypothesize that larger models have greater capacity to simultaneously maintain their reasoning processes and attend to meta-level instructions about those processes. Smaller models, with fewer parameters, must devote nearly all their capacity to the primary task, leaving little room for instruction compliance in the chain of thought. This capacity-allocation framework helps explain why the relationship is logarithmic rather than linear — the marginal utility of additional parameters for controllability decreases as the model already has sufficient capacity for the primary task.
The Reasoning Effort Paradox: More Thinking Means Less Control
One of the most counterintuitive findings in the study is that increased reasoning effort actually reduces chain of thought controllability. Testing on GPT-OSS 120B at three reasoning effort levels reveals a stark downward trend: 10.3% controllability at low effort, 5.8% at medium effort, and just 2.6% at high effort.
This “reasoning effort paradox” challenges the intuitive assumption that models thinking harder would be better at following complex instructions. Instead, the data shows the opposite — when models invest more computational resources in reasoning, they become even more focused on task completion and even less responsive to meta-instructions about their reasoning process.
The explanation lies in how reasoning effort allocation works. At low effort, the model’s reasoning trace is shorter and less committed to any particular problem-solving strategy. There is, in effect, more “cognitive slack” available for instruction compliance. At high effort, the model generates longer, more detailed reasoning chains that are deeply optimized for reaching the correct answer. This intensive focus crowds out any capacity for attending to controllability instructions.
For AI safety practitioners, this finding is particularly relevant. It means that the most capable reasoning modes — the ones generating the most detailed and potentially most useful chain of thought traces — are also the least controllable. Monitoring systems designed around chain of thought analysis should account for this inverse relationship, potentially applying different evaluation criteria based on the reasoning effort level employed.
The practical implication for deployers of reasoning models is clear: higher reasoning effort modes provide the most reliable chain of thought signals for safety monitoring precisely because they are the hardest for models to manipulate. A model using extended thinking is essentially locked into its natural reasoning patterns, making any deviation from expected behavior more detectable, not less.
Make AI safety research accessible. Transform dense papers into interactive experiences anyone can explore.
Three Instruction Categories That Challenge Reasoning Models
The CoT-Control benchmark tests controllability across three carefully designed instruction categories, each with three difficulty levels (easy, medium, hard). These categories reveal distinct failure modes in how reasoning models handle different types of constraints on their thinking.
Information Suppression
Information suppression instructions ask models to avoid mentioning specific concepts, terms, or reasoning steps in their chain of thought. For example, a model might be told to solve a physics problem without referencing Newton’s laws in its reasoning, even though the answer relies on them. This category tests whether models can reason around conceptual dependencies — and they largely cannot. Even when models successfully produce correct outputs, their chains of thought frequently violate suppression instructions, mentioning the forbidden concepts before catching themselves or failing to notice the violation entirely.
Information Addition
Information addition instructions require models to incorporate specific phrases, formats, or content into their reasoning traces. A model might be asked to include a particular quote, use a specific framework, or reference certain data points within its chain of thought. Performance on addition instructions tends to be slightly higher than suppression, suggesting that adding content to a reasoning trace is mechanically easier than removing content the model considers relevant. However, models frequently engage in “meta-discussion” — mentioning the instruction itself rather than naturally incorporating the required content.
Stylistic Modification
Stylistic instructions ask models to reason in a particular style — writing their chain of thought as a poem, using bullet points exclusively, or adopting a specific persona while reasoning. These instructions test the deepest level of controllability, requiring changes that pervade the entire reasoning trace rather than affecting isolated elements. Unsurprisingly, hard stylistic modifications produce some of the lowest controllability scores across all models, as maintaining a consistent style throughout extended reasoning is fundamentally at odds with the models’ training to reason efficiently.
Across all three categories, difficulty level matters enormously. Easy instructions see substantially higher compliance than hard ones, confirming that controllability is not binary but exists on a spectrum influenced by the complexity and intrusiveness of the instruction being given.
Meta-Discussion and Situational Awareness in Chain of Thought Traces
A pervasive pattern across all tested models is meta-discussion — instances where models explicitly reference the controllability instruction within their chain of thought. Rather than simply complying with or ignoring the instruction, models frequently include text like “I’ve been asked to avoid mentioning X, so I need to find another way to explain this” or “The instruction says I should write my reasoning as a poem.”
This meta-discussion phenomenon reveals that models are aware of the controllability instructions at the reasoning level. They recognize the instruction, acknowledge it in their chain of thought, and often attempt to comply — but the attempt frequently fails or produces only partial compliance. The model’s “natural” reasoning process reasserts itself, overriding the instruction despite the model’s apparent awareness of it.
The researchers also investigated situational awareness — whether telling models that their chain of thought is being monitored would change their controllability. The hypothesis was that models aware of monitoring might try harder to comply with instructions. The results show that situational awareness increases controllability by only approximately 4 percentage points across tested models. This modest improvement suggests that the controllability limitation is not primarily motivational (models not trying hard enough) but architectural (models being unable to modify their reasoning processes even when they want to).
This distinction between motivation and capability is crucial for AI safety. If models simply were not trying to control their chain of thought, we might worry that more sophisticated prompting or training could unlock hidden controllability. But the data suggests the limitation is more fundamental — current model architectures genuinely cannot exert fine-grained control over their reasoning processes, even when explicitly motivated to do so. Research from the Center for AI Safety has similarly highlighted the importance of understanding these architectural constraints.
Implications for AI Safety Monitoring and Alignment
The paper’s authors describe themselves as “cautiously optimistic” about the implications for AI safety, and the data supports this stance. Low chain of thought controllability means that current reasoning models cannot easily manipulate their internal reasoning to deceive monitors. This makes chain of thought monitoring a robust — though not infallible — safety mechanism.
Consider the adversarial scenario that motivates this research: a misaligned model that generates benign-looking reasoning traces while pursuing hidden objectives. For this deception to work, the model would need high chain of thought controllability — the ability to write one thing in its reasoning while actually reasoning in a completely different way. The CoT-Control benchmark shows that current models are far from capable of this kind of sophisticated deception, with even the best performer (Claude 3.7 Sonnet at 36.1%) unable to reliably control its reasoning more than a third of the time.
However, several caveats apply. First, controllability scores have been increasing with model capability, following the logarithmic trend with model size. As models continue to grow, controllability will likely increase as well, potentially reaching levels where chain of thought deception becomes feasible. Second, the benchmark tests controllability through explicit instructions, which may underestimate the ability of models to develop implicit reasoning control through novel training techniques not yet deployed.
The researchers recommend that frontier labs adopt chain of thought controllability as a standard safety metric, tracking it alongside capability benchmarks with each new model release. This tracking would provide early warning if controllability begins increasing rapidly, allowing safety teams to adjust monitoring strategies accordingly. For teams exploring how to transform research into accessible formats, platforms like Libertify’s interactive library offer ways to make complex safety research engaging for broader audiences.
What Frontier Labs Should Do Next
OpenAI’s research concludes with clear recommendations for the AI industry, and several additional actions emerge from the analysis. The path forward requires coordinated effort across multiple dimensions of AI development and safety research.
First, every frontier lab should incorporate chain of thought controllability measurement into their standard evaluation suite. The CoT-Control benchmark provides a ready-made framework with 14,076 test instances across diverse domains. Publishing controllability scores alongside capability benchmarks would create accountability and enable the research community to track trends over time.
Second, the RL-controllability tradeoff demands new training approaches. Current reinforcement learning pipelines that optimize purely for task performance inadvertently destroy controllability. Developing training methods that preserve or even enhance controllability while maintaining capability improvements should become a priority research direction. Multi-objective RL frameworks that include controllability as an explicit reward signal represent one promising avenue.
Third, the reasoning effort paradox requires practical accommodation. Monitoring systems should be calibrated to account for the fact that high-effort reasoning modes produce less controllable — and therefore more authentic — chain of thought traces. This actually strengthens the safety case for monitoring, as the scenarios where models reason most intensively are also where monitoring is most reliable.
Fourth, the industry should invest in understanding why some models achieve significantly higher controllability than others. Claude 3.7 Sonnet’s 36.1% compared to DeepSeek R1’s 0.1% represents a massive spread that likely reflects meaningful differences in training methodology, data curation, or architectural choices. Understanding these factors would enable more intentional decisions about the controllability-capability frontier.
Finally, longitudinal tracking is essential. If chain of thought controllability approaches levels where deception becomes feasible — perhaps above 80-90% — the entire paradigm of chain of thought monitoring would need fundamental rethinking. Early warning through consistent measurement is far preferable to discovering this threshold after it has been crossed. The research community owes OpenAI credit for establishing this benchmark and making the first comprehensive measurement, but the work of monitoring controllability over time falls to everyone building frontier AI systems.
Share AI research breakthroughs with your team in a format they will actually engage with.
Frequently Asked Questions
What is chain of thought controllability in reasoning models?
Chain of thought controllability measures whether AI reasoning models can follow instructions that govern their internal thinking process, not just their final outputs. OpenAI’s CoT-Control benchmark found that most frontier models struggle significantly with this capability, scoring below 36.1% controllability on their chain of thought traces.
Why does DeepSeek R1 score only 0.1% on CoT controllability?
DeepSeek R1 demonstrates an extreme gap between output controllability (48.6%) and CoT controllability (0.1%) because reinforcement learning training optimizes models for task performance rather than instruction compliance within reasoning traces. RL training reduces controllability by over an order of magnitude compared to base models.
How does reinforcement learning affect chain of thought control?
Reinforcement learning training reduces chain of thought controllability by more than an order of magnitude. Models trained with RL become highly focused on generating correct answers, making their internal reasoning processes resistant to external instructions about how they should think or reason through problems.
Which AI model has the highest chain of thought controllability?
Claude 3.7 Sonnet achieved the highest chain of thought controllability at 36.1% in OpenAI’s CoT-Control benchmark, outperforming all 13 frontier models tested including GPT-5.2, GPT-5, o3, and o4-mini. This suggests that different training approaches yield varying levels of internal reasoning compliance.
Does increased reasoning effort improve controllability?
No, increased reasoning effort actually reduces chain of thought controllability. Testing on GPT-OSS 120B showed controllability dropped from 10.3% at low reasoning effort to just 2.6% at high reasoning effort. More intensive reasoning makes models less responsive to instructions governing their thinking process.
What is the CoT-Control benchmark and why does it matter for AI safety?
CoT-Control is a benchmark created by OpenAI featuring 14,076 test instances across five established benchmarks (MMLU-Pro, GPQA, HLE, SWE-Bench, BFCL). It matters for AI safety because if models cannot control their chain of thought, external monitoring of reasoning traces remains a reliable safety mechanism for detecting deceptive or misaligned behavior.