—
0:00
Evaluating LLMs in Scientific Discovery | SDE Benchmark
Table of Contents
- Why Current LLM Science Benchmarks Fall Short
- Introducing the Scientific Discovery Evaluation Framework
- Multi-Domain Coverage Across Biology Chemistry Materials and Physics
- Two-Phase Evaluation of LLM Scientific Reasoning
- Performance Gap Between General Science and Discovery Tasks
- Diminishing Returns of Scaling LLM Size and Reasoning
- Systematic Weaknesses Shared Across Top-Tier Models
- LLMs Show Promise in Guided Scientific Exploration
- Implications for AI-Assisted Scientific Research
- Future Directions for LLM Scientific Discovery Evaluation
📌 Key Takeaways
- Discovery-Grounded Benchmark: SDE evaluates LLMs on genuine research projects across biology, chemistry, materials, and physics — not decontextualized textbook questions
- Consistent Performance Gap: State-of-the-art LLMs score significantly lower on scientific discovery tasks than on general science benchmarks like GPQA
- Diminishing Returns from Scale: Simply increasing model size and reasoning capability shows diminishing returns for discovery-relevant tasks
- No Scientific Superintelligence Yet: Large performance variation across scenarios means no single LLM consistently dominates — all are distant from general scientific superintelligence
- Serendipity Still Matters: LLMs show promise even when individual scenario scores are low, highlighting the role of guided exploration in scientific breakthroughs
Why Current LLM Science Benchmarks Fall Short
Large language models are increasingly positioned as transformative tools for scientific research, capable of accelerating everything from literature triage and hypothesis generation to computational simulation and autonomous experimentation. Systems like ChemCrow, autonomous co-scientists, and virtual laboratory platforms have begun to plan, execute, and interpret experiments by coupling language reasoning to domain-specific tools. Yet the evaluation frameworks used to measure LLM scientific capability remain stuck in a paradigm that fundamentally misrepresents what matters for actual discovery.
Current science benchmarks — including GPQA, ScienceQA, MMMU, and even Humanity’s Last Exam — share a critical limitation: they probe decontextualized knowledge through perception-heavy question answering with items only loosely connected to specific research domains. A model can ace these benchmarks by memorizing facts, recognizing patterns, and applying formulaic reasoning without ever demonstrating the iterative thinking, creative hypothesis formation, and nuanced result interpretation that characterize real scientific work. As the researchers behind the new SDE framework argue, mastering static questions does not guarantee readiness for discovery, just as earning straight A’s in coursework does not indicate a great researcher.
The disconnect becomes increasingly problematic as organizations invest heavily in LLM-powered research tools. Benchmarks in coding (SWE-bench), mathematics (AIME), writing (Arena-hard), and tool use (Tau2-bench) have matured into comparatively stable tests with clear ground truth and strong predictive validity. Science benchmarks, by contrast, remain susceptible to label noise, loosely tied to genuine research contexts, and unable to capture the multi-step reasoning chains that scientific discovery demands. This evaluation gap creates a dangerous asymmetry: we are deploying AI for science based on metrics that do not measure scientific capability. For a deeper understanding of how AI transforms research workflows, explore interactive analysis tools designed for complex scientific content.
Introducing the Scientific Discovery Evaluation Framework
The Scientific Discovery Evaluation (SDE) framework, developed by a massive multi-institutional collaboration spanning Deep Principle, Cornell University, Ohio State University, University of Toronto, Stanford, MIT, Princeton, Cambridge, and a dozen more leading institutions, introduces a fundamentally different approach to evaluating LLMs. Rather than assembling disconnected questions from textbooks and exams, SDE starts with genuine research projects that domain experts consider scientifically interesting and decomposable.
The framework operates through a two-phase evaluation structure. In the first phase, domain experts define research projects across four scientific domains and decompose each project into modular research scenarios — self-contained experimental or computational tasks that contribute to the broader project goal. From these scenarios, vetted evaluation questions are sampled, creating tight connections between individual test items and their research context. This scenario-grounded approach ensures that every question evaluates skills directly relevant to the discovery process.
The second phase moves beyond individual questions to project-level performance assessment. Here, LLMs must demonstrate the ability to propose testable hypotheses based on scenario observations, design simulations or experiments to test those hypotheses, and interpret results in the context of the broader research question. This two-level structure — question-level accuracy plus project-level discovery capability — creates a far more nuanced picture of LLM scientific competence than any existing benchmark provides.
What makes SDE particularly valuable is its reproducibility and extensibility. The modular research scenario approach allows new domains, projects, and question types to be added without restructuring the entire benchmark. Domain experts can contribute new evaluation content following established protocols, ensuring the benchmark evolves alongside both scientific frontiers and AI capabilities. This living benchmark design addresses a persistent criticism of static evaluations: they become stale as models improve and eventually saturate.
Multi-Domain Coverage Across Biology Chemistry Materials and Physics
SDE spans four major scientific domains, each presenting distinct challenges for LLM evaluation. Biology encompasses projects in genomics, cancer biology, protein engineering, and drug-target interactions — areas where LLMs must reason about complex biological systems, interpret experimental data from sequencing and imaging, and propose mechanistically grounded hypotheses. The biology scenarios test whether models can move beyond surface-level biological knowledge to engage with the uncertainty and complexity inherent in living systems.
Chemistry evaluation includes organic synthesis planning, reaction mechanism prediction, molecular property optimization, and drug discovery workflows. These scenarios are particularly demanding because they require LLMs to reason about three-dimensional molecular structures, predict reaction outcomes based on electronic and steric factors, and navigate the vast space of possible synthetic routes. The research team from EPFL’s Laboratory of Artificial Chemical Intelligence contributed scenarios that push models beyond simple SMILES parsing to genuine chemical reasoning.
Materials science introduces computational design challenges where LLMs must predict material properties, suggest compositions for target applications, and interpret simulation outputs from density functional theory and molecular dynamics. The materials scenarios reveal whether models understand the structure-property relationships that govern material behavior — knowledge that cannot be acquired through textbook memorization alone.
Physics rounds out the evaluation with scenarios spanning quantum computing, condensed matter physics, statistical mechanics, and optical physics. These scenarios demand mathematical reasoning combined with physical intuition — the ability to set up problems, identify relevant approximations, and interpret results in physically meaningful ways. Across all four domains, the SDE benchmark reveals that scientific discovery requires a synthesis of knowledge, reasoning, and creativity that current LLMs struggle to achieve consistently.
Transform complex research papers into engaging interactive experiences your team will actually explore.
Two-Phase Evaluation of LLM Scientific Reasoning
The SDE framework’s two-phase evaluation structure represents a significant methodological advance over existing science benchmarks. Phase one assesses question-level accuracy on scenario-tied items — questions that are grounded in specific research contexts rather than floating in an abstract knowledge space. Each question relates to a particular experimental setup, dataset, or computational result within a defined research scenario, requiring models to reason within context rather than recall isolated facts.
Phase two elevates the evaluation to project-level performance, where the stakes and complexity increase dramatically. At this level, LLMs must demonstrate three core scientific capabilities: hypothesis generation (proposing testable explanations for observed phenomena), experimental design (planning simulations or laboratory experiments that could validate or refute hypotheses), and result interpretation (making sense of data in the context of the broader research question and existing literature).
The interaction between these two phases reveals subtle aspects of LLM scientific capability. A model might score well on individual scenario questions while failing at project-level synthesis — demonstrating factual knowledge without the ability to combine it into coherent scientific reasoning. Conversely, some models show surprisingly strong project-level performance despite modest scenario scores, suggesting an ability to leverage partial knowledge through creative reasoning and guided exploration.
This two-phase structure also enables more granular diagnosis of model weaknesses. Rather than simply reporting an aggregate accuracy number, SDE can identify whether a model’s scientific limitations stem from knowledge gaps (low scenario scores), reasoning failures (low project scores despite good scenario scores), or both. This diagnostic capability is invaluable for guiding targeted model improvements and understanding which aspects of scientific reasoning are most amenable to improvement through scaling, training data enhancement, or architectural innovation. For organizations seeking to bridge research insights and actionable understanding, interactive research platforms provide an effective complement to traditional publication formats.
Performance Gap Between General Science and Discovery Tasks
The most striking finding from the SDE benchmark is the consistent performance gap between how LLMs score on general science benchmarks and their actual scientific discovery capabilities. Models that achieve impressive marks on GPQA, ScienceQA, and similar decontextualized evaluations show significantly degraded performance when confronted with the contextual reasoning, hypothesis generation, and experimental interpretation required by SDE.
This gap is not a minor statistical artifact — it represents a fundamental disconnect between what current benchmarks measure and what scientific discovery requires. On general science benchmarks, state-of-the-art LLMs routinely score above 80% and often approach 90% or higher on certain domains. On SDE’s discovery-relevant tasks, the same models frequently drop 20-30 percentage points, revealing that the knowledge probed by traditional benchmarks is necessary but far from sufficient for genuine scientific reasoning.
The performance gap manifests differently across domains. In chemistry, models show relatively stronger performance on synthesis planning and reaction prediction — tasks with more structured input-output relationships — but struggle significantly with open-ended molecular design and property optimization. In biology, the gap is particularly pronounced for hypothesis generation tasks, where models must reason about complex biological mechanisms without explicit procedural guidance. Physics scenarios reveal that mathematical reasoning, while generally strong, breaks down when models must combine it with physical intuition to interpret results.
These findings carry important implications for organizations deploying LLMs as scientific research assistants. The performance gap suggests that relying on general science benchmark scores to assess an LLM’s research utility would lead to significant overestimation of its capabilities. Organizations need discovery-specific evaluations — like SDE — to realistically assess whether an AI tool can contribute meaningfully to their research workflows rather than simply regurgitating textbook knowledge in novel packaging.
Diminishing Returns of Scaling LLM Size and Reasoning
A particularly sobering finding from the SDE evaluation is the diminishing return observed when scaling up model size and reasoning capability. The AI industry has operated under the assumption that bigger models with more sophisticated reasoning chains would naturally translate to better scientific performance. SDE’s results challenge this assumption directly.
While scaling from smaller to larger models does produce initial improvements on discovery tasks, the marginal gains decrease significantly as models grow. The jump from GPT-3.5-class models to GPT-4-class systems yields substantial discovery performance improvements, but further scaling — including models with enhanced chain-of-thought reasoning and test-time compute — shows progressively smaller returns on SDE metrics. This suggests that the bottleneck for scientific discovery capability is not raw model size or reasoning depth but something more fundamental about how models represent and process scientific knowledge.
The diminishing returns pattern varies by task type within SDE. For question-level accuracy on scenario-tied items, scaling continues to provide modest improvements even at the frontier. But for project-level tasks — hypothesis generation, experimental design, and result interpretation — larger models show minimal gains over their slightly smaller counterparts. This divergence suggests that project-level scientific reasoning requires capabilities that do not emerge simply from increasing parameter counts or training compute.
These findings echo broader concerns in the AI research community about the limits of scaling laws for complex cognitive tasks. Scientific discovery, like other forms of creative reasoning, may require architectural innovations, specialized training approaches, or hybrid systems that combine language models with domain-specific computational tools rather than relying solely on scale. The Nature analysis of AI in science similarly notes that current models struggle with the open-ended reasoning that characterizes breakthrough research.
Make scientific research accessible to every stakeholder. Convert dense papers into interactive experiences.
Systematic Weaknesses Shared Across Top-Tier Models
Beyond individual model performance, SDE reveals systematic weaknesses that are shared across top-tier models from different providers. Despite being trained on different data, using different architectures, and employing different alignment strategies, the leading LLMs exhibit remarkably similar failure patterns on scientific discovery tasks. This convergence of weaknesses suggests fundamental limitations in current language model approaches to scientific reasoning rather than training-specific deficiencies.
One shared weakness is the difficulty with counterfactual scientific reasoning — the ability to consider how experimental outcomes would change under different conditions. This type of reasoning is essential for designing informative experiments and interpreting surprising results, yet all evaluated models struggle when asked to predict the consequences of parameter changes in experimental setups they have not seen before. The models can describe what typically happens in well-documented experimental conditions but cannot reliably reason about novel configurations.
Another consistent failure pattern involves the integration of quantitative and qualitative reasoning. Scientific discovery frequently requires combining numerical data analysis with conceptual understanding — interpreting a statistical trend in terms of a physical mechanism, or using theoretical principles to predict expected measurement ranges. Current LLMs tend to handle these modalities separately, producing either technically correct numerical analysis without physical insight or qualitatively reasonable narratives disconnected from quantitative evidence.
The shared weakness in multi-step experimental reasoning is perhaps the most practically significant. Real scientific discovery rarely follows a linear path; researchers must adapt their experimental plans based on intermediate results, recognize when initial hypotheses need revision, and identify promising tangential findings. All evaluated models show significant degradation when tasks require this adaptive, multi-step reasoning — they generate reasonable first steps but lose coherence as the experimental reasoning chain lengthens.
LLMs Show Promise in Guided Scientific Exploration
Despite the sobering performance gaps and systematic weaknesses, the SDE benchmark reveals an encouraging finding: LLMs already demonstrate genuine promise across a great variety of scientific discovery projects, including cases where individual constituent scenario scores are surprisingly low. This seemingly paradoxical result highlights the role of guided exploration and serendipity in scientific discovery — qualities that LLMs can support even when their domain knowledge is incomplete.
In several evaluation cases, models that scored modestly on individual scenario questions nevertheless produced project-level outputs that domain experts rated as scientifically interesting and potentially valuable. These outputs included novel hypotheses that combined knowledge from different sub-domains in unexpected ways, experimental designs that incorporated creative control conditions, and result interpretations that identified patterns the human evaluators had not initially considered. The ability to generate these outputs despite imperfect domain knowledge suggests that LLMs may be most valuable not as authoritative scientific experts but as creative thinking partners that propose ideas for human scientists to evaluate and refine.
This finding aligns with how LLMs are already being used successfully in practice. Autonomous co-scientist systems, virtual laboratories for nanobody design, and AI-assisted drug discovery platforms all leverage LLMs as hypothesis generators rather than autonomous decision-makers. The human-in-the-loop paradigm — where LLMs propose and humans evaluate — appears well-suited to the current state of AI scientific capability. The LLM’s ability to rapidly explore a vast hypothesis space, even with imperfect accuracy, creates value by surfacing possibilities that human researchers might not consider within the constraints of their existing mental models.
For research organizations, this finding suggests a practical deployment strategy: use LLMs as exploration engines that expand the space of considered hypotheses and experimental approaches, while maintaining human expert oversight for evaluation and decision-making. This approach maximizes the current strengths of LLMs — breadth of knowledge, speed of generation, and creative combination of disparate information — while compensating for their weaknesses in deep domain reasoning and experimental judgment. Explore how platforms like Libertify’s interactive library make complex scientific research accessible and engaging for broader teams.
Implications for AI-Assisted Scientific Research
The SDE benchmark findings carry significant implications for the growing ecosystem of AI-assisted scientific research tools. First and most importantly, they establish that current general science benchmarks are insufficient for evaluating LLMs intended for research applications. Organizations selecting AI tools for scientific work should demand discovery-specific evaluations that test contextual reasoning, hypothesis generation, and experimental design — not just factual recall and pattern matching.
Second, the diminishing returns from scaling suggest that the path to better scientific AI runs through specialized training and architectural innovation rather than simply building bigger models. This has practical implications for resource allocation: research labs may get more scientific value from fine-tuning medium-sized models on domain-specific discovery tasks than from adopting the largest available general-purpose model. The SDE framework provides the evaluation infrastructure needed to measure whether such targeted investments actually improve discovery capability.
Third, the finding that performance varies dramatically across research scenarios — with the best-performing model changing depending on the specific project — suggests that a portfolio approach to scientific AI may be optimal. Rather than relying on a single model for all research tasks, organizations might deploy different models for different types of scientific reasoning, selecting based on scenario-specific strengths revealed by SDE-style evaluation. According to insights from the National Science Foundation’s AI research programs, this multi-model approach is already emerging in federally funded research initiatives.
Finally, the SDE results reinforce the importance of human-AI collaboration in scientific discovery. The promise shown by LLMs in guided exploration, combined with their systematic weaknesses in autonomous reasoning, argues strongly for collaborative frameworks where AI augments rather than replaces human scientific judgment. The most productive near-term applications will likely pair LLM-generated hypotheses and experimental suggestions with expert human evaluation, creating a synergy that leverages the complementary strengths of artificial and human intelligence.
Future Directions for LLM Scientific Discovery Evaluation
The SDE framework opens several important directions for future research in AI evaluation and scientific discovery. The modular scenario-based design naturally extends to additional scientific domains beyond the initial four, including earth sciences, ecology, neuroscience, and social sciences. Each new domain would bring unique evaluation challenges and could reveal domain-specific patterns in LLM scientific reasoning capability.
A particularly promising extension involves evaluating LLMs on longitudinal discovery tasks — multi-week or multi-month research projects where the model must maintain context, adapt strategies based on accumulating evidence, and demonstrate the persistence and flexibility that characterize successful scientific inquiry. Current evaluation, including SDE, captures snapshot capabilities; longitudinal assessment would test the sustained scientific reasoning that real research demands.
The benchmark also provides a foundation for evaluating emerging AI architectures specifically designed for scientific reasoning. As the field develops specialized scientific AI systems — incorporating domain-specific knowledge graphs, simulation capabilities, and laboratory automation interfaces — SDE-style evaluation will be essential for measuring whether these innovations translate to genuine discovery capability improvements.
Integration with interactive research platforms represents another frontier. The ability to transform dense research papers and complex experimental results into interactive experiences could complement LLM-driven scientific exploration by making AI-generated hypotheses and experimental designs more accessible to interdisciplinary teams. As scientific research becomes increasingly collaborative and data-intensive, the tools for communicating and evaluating AI contributions will be as important as the AI systems themselves.
The SDE framework represents a crucial step toward honest, rigorous evaluation of AI capabilities for scientific discovery. By grounding evaluation in genuine research contexts and measuring the skills that actually drive discovery, it provides the community with a tool for tracking real progress — and avoiding the false confidence that comes from excelling at tests that do not test what matters. The future of AI in science depends not just on building more capable models but on developing the evaluation infrastructure needed to know when genuine capability gains have been achieved.
Turn cutting-edge research into interactive content that drives understanding and engagement across your organization.
Frequently Asked Questions
What is the SDE benchmark for evaluating LLMs in scientific discovery?
The Scientific Discovery Evaluation (SDE) benchmark is a scenario-grounded framework that evaluates large language models across four scientific domains: biology, chemistry, materials science, and physics. Unlike traditional science benchmarks that probe decontextualized knowledge, SDE tests LLMs on iterative reasoning, hypothesis generation, experiment design, and result interpretation — the core skills that drive actual scientific discovery.
How do LLMs perform on scientific discovery tasks compared to general science benchmarks?
The SDE benchmark reveals a consistent performance gap between LLM scores on general science benchmarks and their actual scientific discovery capabilities. Models that score highly on decontextualized science questions show significantly lower performance when tasked with proposing testable hypotheses, designing experiments, and interpreting results in context. Scaling up model size and reasoning shows diminishing returns for discovery tasks.
Which scientific domains does the SDE benchmark cover?
The SDE benchmark covers four major scientific domains: biology (including genomics and cancer biology), chemistry (including organic synthesis and drug discovery), materials science (including computational materials design), and physics (including quantum computing and condensed matter). Domain experts define research projects of genuine interest and decompose them into modular research scenarios for evaluation.
Can any current LLM achieve scientific superintelligence?
According to the SDE benchmark findings, all current LLMs are distant from general scientific superintelligence. Large performance variation across research scenarios means the best-performing model changes depending on the specific scientific discovery project evaluated. No single model consistently outperforms across all domains and scenario types, suggesting fundamental limitations in current approaches to AI-driven scientific discovery.
What are the practical implications of the SDE benchmark for AI-assisted research?
The SDE benchmark demonstrates that LLMs already show promise across a variety of scientific discovery projects, including cases where constituent scenario scores are low. This highlights the role of guided exploration and serendipity in discovery. The framework offers reproducible evaluation for discovery-relevant assessment and charts practical paths for advancing LLM development toward genuine scientific research assistance.
How does the SDE framework differ from existing science benchmarks like GPQA?
Unlike GPQA, ScienceQA, MMMU, or Humanity’s Last Exam — which test decontextualized, perception-heavy question answering with items loosely connected to research domains — the SDE framework uses scenario-grounded evaluation where domain experts define real research projects. It assesses models at two levels: question-level accuracy on scenario-tied items and project-level performance requiring hypothesis proposal, simulation design, and result interpretation.