—
0:00
LLM Benchmarks: The Complete Survey Guide to Evaluating Large Language Models in 2025
Table of Contents
- Why LLM Benchmarks Matter: The Foundation of AI Evaluation
- The Three-Category Taxonomy of LLM Benchmarks
- General Capabilities: Linguistic Core Benchmarks Explained
- Knowledge and Reasoning: How LLM Benchmarks Test Deep Understanding
- Domain-Specific LLM Benchmarks: Science, Law, and Engineering
- Safety and Reliability Benchmarks: Testing AI Guardrails
- Agent Benchmarks: Evaluating Autonomous AI Systems
- Data Contamination: The Silent Crisis in LLM Evaluation
- Cultural and Linguistic Bias in Benchmarks
- Emerging Trends: Dynamic Evaluation and Process-Based Assessment
- How to Choose the Right LLM Benchmarks for Your Use Case
- The Future of LLM Benchmarks: What Comes Next
🔑 Key Takeaways
- Why LLM Benchmarks Matter: The Foundation of AI Evaluation — LLM benchmarks are far more than academic exercises.
- The Three-Category Taxonomy of LLM Benchmarks — The most comprehensive organizational framework for LLM benchmarks divides them into three major categories: General Capabilities Benchmarks, Domain-Specific Benchmarks, and Target-Specific Benchmarks.
- General Capabilities: Linguistic Core Benchmarks Explained — The evolution of linguistic capability benchmarks represents what researchers describe as a “continuous arms race between model advancement and evaluation methodology.
- Knowledge and Reasoning: How LLM Benchmarks Test Deep Understanding — Knowledge evaluation has undergone a fundamental transformation from simple fact recall to assessing deep expert-level understanding.
- Domain-Specific LLM Benchmarks: Science, Law, and Engineering — While general benchmarks test broad capabilities, domain-specific LLM benchmarks determine whether models can function as useful tools within professional fields.
Why LLM Benchmarks Matter: The Foundation of AI Evaluation
LLM benchmarks are far more than academic exercises. They form the backbone of the entire AI evaluation infrastructure that determines which models get deployed, which research directions receive funding, and how quickly the field advances. Since the introduction of the Transformer architecture in 2017, the sheer scale of language models has grown from millions to trillions of parameters, bringing emergent capabilities like few-shot learning, in-context reasoning, and multi-turn dialogue that earlier evaluation methods simply cannot capture.
Benchmarks serve three essential functions in the AI ecosystem. First, they provide objective comparison — enabling researchers to measure how GPT-4, Claude, Llama, Gemini, and Qwen stack up against each other on identical tasks. Second, they offer diagnostic insight — revealing specific weaknesses in reasoning, factual accuracy, safety, or code generation that guide targeted model improvements. Third, they build user trust — providing standardized evidence that a model meets minimum thresholds for safety, fairness, and reliability before deployment in sensitive applications like healthcare, law, or finance.
The evaluation landscape has evolved through distinct phases. Early benchmarks like GLUE and SuperGLUE focused narrowly on natural language understanding through small-scale, single-task tests. As models matured, comprehensive multi-task benchmarks like MMLU and BIG-Bench emerged, testing knowledge across dozens of subjects simultaneously. Today, the frontier has shifted to agent benchmarks, safety evaluations, and dynamic assessments that attempt to simulate real-world complexity. This progression reflects a deeper truth: as models grow more capable, we need increasingly sophisticated tools to understand what they can and cannot do.

The Three-Category Taxonomy of LLM Benchmarks
The most comprehensive organizational framework for LLM benchmarks divides them into three major categories: General Capabilities Benchmarks, Domain-Specific Benchmarks, and Target-Specific Benchmarks. This taxonomy, proposed in the 2025 survey covering 283 benchmarks, provides a clear mental model for understanding what each evaluation measures and why it matters.
General Capabilities Benchmarks
These evaluate core competencies that every language model should possess, regardless of application domain. They span three major areas: linguistic core capabilities (understanding, generation, dialogue), knowledge (factual recall, expert-level reasoning), and reasoning (logical, commonsense, mathematical, and applied). Flagship examples include GLUE, SuperGLUE, MMLU, MMLU-Pro, BIG-Bench, HELM, and GPQA.
Domain-Specific Benchmarks
These test model performance within particular professional or academic fields. Natural sciences benchmarks cover mathematics (GSM8K, MATH, Omni-MATH), physics (SciBench, UGPhysics), chemistry (ChemEval, MoleculeQA), and biology (PubMedQA, LAB-Bench). Humanities and social sciences benchmarks span law (LegalBench, LawBench), education (E-Eval), psychology (CPsyExam), and finance (FinEval, FLARE). Engineering benchmarks test code generation (HumanEval, MBPP, SWE-bench), database operations (Spider, BIRD), and hardware design (VerilogEval).
Target-Specific Benchmarks
These focus on particular behavioral properties of models rather than domain knowledge. Risk and reliability benchmarks evaluate safety (ToxiGen, JailbreakBench, HarmBench), hallucination detection (TruthfulQA, FActScore), robustness (AdvGLUE, IFEval), and data leakage (WikiMIA). Agent benchmarks assess planning (FlowBench, BrowseComp), multi-agent coordination (MultiAgentBench), and domain-specific agent capabilities (OSWorld, ScienceAgentBench).
General Capabilities: Linguistic Core Benchmarks Explained
The evolution of linguistic capability benchmarks represents what researchers describe as a “continuous arms race between model advancement and evaluation methodology.” Understanding this evolution helps explain why no single benchmark tells the complete story of a model’s linguistic abilities.
The journey began with the fragmentation crisis of pre-2018 NLP, where every research group used different evaluation tasks, making cross-study comparison virtually impossible. GLUE (General Language Understanding Evaluation) solved this by aggregating nine diverse NLU tasks — including sentiment analysis, textual entailment, and paraphrase detection — into a single benchmark with over 415,000 examples. When models quickly saturated GLUE’s difficulty ceiling, SuperGLUE raised the bar with harder tasks requiring boolean question answering, causal reasoning, and multi-sentence inference.
The next leap came with commonsense and generation benchmarks. HellaSwag tests whether models can predict plausible story continuations by using adversarial filtering to create challenging distractors. WinoGrande evaluates pronoun resolution through carefully crafted ambiguous sentences. For text generation quality, BERTScore, BartScore, and BLEURT moved beyond crude n-gram overlap metrics like BLEU to leverage neural representations for more nuanced quality assessment.
Dialogue and multi-turn evaluation benchmarks like MT-Bench and MT-Bench-101 represent the current frontier. These test not just single-response quality but sustained conversation coherence, instruction following across turns, and the ability to maintain context over extended interactions. MT-Bench uses LLM-as-judge evaluation with GPT-4 scoring responses across writing, reasoning, math, coding, and extraction tasks — reflecting how models are actually used in practice.
Multilingual benchmarks like Xtreme expanded evaluation across 40+ languages, revealing stark performance gaps between high-resource languages like English and lower-resource languages. The HELM (Holistic Evaluation of Language Models) framework from Stanford took the most ambitious approach, evaluating models across 42 scenarios with metrics spanning accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency — the most comprehensive single evaluation framework available.
📊 Explore this analysis with interactive data visualizations
Knowledge and Reasoning: How LLM Benchmarks Test Deep Understanding
Knowledge evaluation has undergone a fundamental transformation from simple fact recall to assessing deep expert-level understanding. This shift mirrors the growing ambition of deploying LLMs in high-stakes professional settings where surface-level knowledge is insufficient.
MMLU (Massive Multitask Language Understanding) revolutionized knowledge testing by presenting 15,908 multiple-choice questions across 57 academic subjects, from elementary mathematics to professional law and medicine. Its successor, MMLU-Pro, addresses key limitations by expanding answer choices from 4 to 10 options — dramatically reducing the probability of guessing correctly — and introducing more complex reasoning-focused questions that require chain-of-thought processing rather than simple recall.
At the expert level, GPQA (Graduate-Level Google-Proof Questions in AI) represents a paradigm shift. Questions are designed by domain experts to be difficult even for PhD-level researchers in adjacent fields, and they resist being answered through internet searches. SuperGPQA extends this further with over 10,000 questions across 285 subjects, establishing what may be the most rigorous test of model knowledge depth currently available.
Reasoning benchmarks have similarly matured across multiple dimensions. Logical reasoning tests range from basic rule application (RuleTaker, ProofWriter) to complex first-order logic (FOLIO, LogicBench) and formal logical deduction (SATBench). Mathematical reasoning benchmarks form a difficulty ladder: GSM8K tests grade-school arithmetic, MATH covers competition-level problems, Omni-MATH pushes into olympiad territory, and FrontierMath presents unsolved research problems. Commonsense reasoning benchmarks like StrategyQA and CommonGen test whether models can apply everyday knowledge that humans take for granted — still a surprisingly difficult challenge for even the most advanced LLMs.
Applied reasoning benchmarks like BIG-Bench Hard, LiveBench, and ARC (AI2 Reasoning Challenge) bridge the gap between abstract reasoning tests and practical problem-solving. LiveBench is particularly notable for its regularly updated question sets that resist data contamination — a crucial innovation we will discuss further below. These benchmarks are essential for understanding how models perform on the types of reasoning tasks they will encounter in real-world deployment, from analyzing complex documents to synthesizing strategic insights like those in the McKinsey State of AI 2024 report.
Domain-Specific LLM Benchmarks: Science, Law, and Engineering
While general benchmarks test broad capabilities, domain-specific LLM benchmarks determine whether models can function as useful tools within professional fields. The 2025 survey catalogs an extensive ecosystem of specialized evaluations spanning natural sciences, humanities, social sciences, and engineering.
Natural Sciences
Mathematics benchmarks form the most developed domain-specific category. GSM8K (Grade School Math 8K) provides 8,500 multi-step arithmetic problems requiring 2-8 reasoning steps — and remains a core evaluation despite models now scoring above 95%. MATH raises difficulty to high school competition level, while Omni-MATH and FrontierMath push into olympiad and research-level problems where even the best models struggle significantly. Physics benchmarks like SciBench and UGPhysics test calculation and conceptual understanding, while chemistry evaluations like ChemEval and MoleculeQA assess molecular analysis, reaction prediction, and safety assessment capabilities.
Humanities and Social Sciences
Legal benchmarks test specialized competencies including case analysis, statutory interpretation, and legal reasoning. LegalBench covers 162 tasks designed by legal professionals, while LawBench provides comprehensive evaluation in Chinese legal contexts. Finance benchmarks like FinEval and FLARE test market analysis, financial statement interpretation, and risk assessment — crucial capabilities as firms increasingly explore LLM-powered financial tools.
Engineering and Technology
Code generation benchmarks are perhaps the most commercially significant domain-specific evaluations. HumanEval tests function-level code synthesis from docstrings, while MBPP (Mostly Basic Python Problems) provides a broader sample of programming challenges. SWE-bench represents the cutting edge, testing whether models can resolve real GitHub issues from popular open-source repositories — a much closer simulation of actual software engineering than isolated coding puzzles. Database benchmarks like Spider and BIRD evaluate natural language to SQL translation, while emerging hardware benchmarks like VerilogEval test HDL code generation for chip design.

Safety and Reliability Benchmarks: Testing AI Guardrails
As LLMs move from research labs into production systems, safety and reliability benchmarks have become the fastest-growing evaluation category. These tests examine whether models can resist manipulation, avoid generating harmful content, maintain factual accuracy, and behave robustly across diverse inputs.
Safety benchmarks form the first line of defense. ToxiGen evaluates whether models generate toxic content toward different demographic groups, using adversarial prompts designed to elicit harmful responses. JailbreakBench and HarmBench systematically test models against known attack patterns that attempt to bypass safety training. StereoSet and CrowS-Pairs measure social biases in model outputs, checking whether responses reinforce stereotypes about gender, race, religion, or other protected characteristics.
Hallucination benchmarks address one of the most commercially impactful failure modes. TruthfulQA tests whether models reproduce common misconceptions, with questions specifically designed around topics where popular but incorrect beliefs exist. FActScore breaks down model outputs into individual claims and verifies each against reliable sources, providing a granular measure of factual accuracy. HaluEval uses LLM-generated hallucinated samples to test detection capabilities, creating an adversarial cat-and-mouse dynamic.
Robustness benchmarks test model resilience to input perturbations. AdvGLUE applies adversarial attacks (character swaps, word substitutions, sentence paraphrases) to standard NLU tasks, revealing how fragile model performance can be under adversarial conditions. IFEval (Instruction Following Evaluation) specifically tests whether models reliably follow formatting constraints and explicit instructions — a seemingly simple capability that proves surprisingly difficult to guarantee. PromptRobust evaluates sensitivity to prompt variations, quantifying how much performance changes with minor wording differences.
Understanding the safety evaluation landscape is crucial for organizations deploying AI systems. The AI alignment taxonomy guide provides a complementary framework for understanding how safety benchmarks connect to broader alignment research.
📊 Explore this analysis with interactive data visualizations
Agent Benchmarks: Evaluating Autonomous AI Systems
The emergence of LLM-powered agents — systems that can plan, use tools, browse the web, write and execute code, and interact with software environments — has spawned an entirely new category of benchmarks. These evaluations go far beyond text-in, text-out testing to measure how effectively models can operate as autonomous problem solvers in complex environments.
Planning and control benchmarks test fundamental agentic capabilities. FlowBench evaluates workflow execution across multiple domains, Mobile-Bench tests mobile device operation, and BrowseComp measures web browsing and information retrieval skills. These benchmarks simulate the kind of multi-step, tool-using behavior that defines practical AI agent applications.
Integrated assessment benchmarks provide the most comprehensive evaluations. AgentBench evaluates LLM-as-agent across eight distinct environments including operating systems, databases, web browsing, and card games. GAIA (General AI Assistants) tests real-world assistant capabilities through tasks requiring web browsing, file processing, and multi-step reasoning. TheAgentCompany simulates a complete corporate environment where agents must complete realistic workplace tasks involving email, project management, and documentation systems.
Multi-agent benchmarks represent the newest frontier. MultiAgentBench and MAgIC test coordination between multiple AI agents, evaluating whether they can cooperate, negotiate, and solve problems collectively. These evaluations are particularly relevant as companies explore multi-agent architectures for complex workflows.
Safety in agentic contexts presents unique challenges. AgentHarm tests whether agents can be manipulated into performing harmful actions in interactive environments, while SafeAgentBench evaluates safety in embodied settings where agents control physical or simulated actuators. R-Judge assesses whether models can identify safety risks in multi-turn agentic interactions — reflecting the heightened stakes when models take autonomous actions rather than simply generating text.

Data Contamination: The Silent Crisis in LLM Evaluation
Perhaps the most pressing challenge facing LLM benchmarks today is data contamination — the phenomenon where models are exposed to benchmark data during training, inflating their evaluation scores and undermining the entire purpose of standardized testing. The 2025 survey identifies this as one of three critical problems threatening the reliability of the benchmark ecosystem.
Data contamination occurs through multiple pathways. Training datasets crawled from the internet inevitably include benchmark questions and answers that have been published, discussed in academic papers, or shared on forums. Some contamination is unintentional — a natural consequence of training on web-scale data. Other contamination may be more deliberate, with model developers optimizing for specific benchmark performance. The result is the same: benchmark scores that overstate actual model capability on genuinely novel tasks.
Several approaches have emerged to combat contamination. Dynamic benchmarks like LiveBench and LiveCodeBench regularly refresh their question sets, making it impossible for models to have seen the exact questions during training. Contamination detection benchmarks like WikiMIA and C²LEVA specifically test whether models show telltale signs of having memorized evaluation data. Process-oriented evaluation shifts focus from final answers to reasoning chains, making it harder for models to benefit from memorized answers without understanding the underlying logic.
The contamination problem has broader implications for how we interpret the rapid pace of benchmark improvement. When a new model claims state-of-the-art performance on established benchmarks, it is increasingly difficult to distinguish genuine capability improvements from better training data coverage. This is why researchers now recommend evaluating models on multiple independent benchmarks rather than relying on any single score, and why fresh, contamination-resistant benchmarks carry disproportionate weight in serious model comparison.
Cultural and Linguistic Bias in Benchmarks
The second major challenge identified by the 2025 survey is the pervasive cultural and linguistic bias embedded in most LLM benchmarks. The vast majority of widely used benchmarks are designed in English, by English-speaking researchers, drawing on Western cultural knowledge and reasoning patterns. This creates systematic unfairness in evaluating models that serve global populations.
Linguistic bias manifests at multiple levels. At the surface level, non-English benchmarks are dramatically underrepresented — despite billions of users interacting with LLMs in Chinese, Spanish, Arabic, Hindi, and dozens of other languages. Even when multilingual benchmarks exist (like Xtreme covering 40+ languages), they often translate English-centric content rather than creating culturally authentic evaluation material. This means a model might score well on translated Chinese questions while failing on idioms, cultural references, or reasoning patterns native to Chinese speakers.
Cultural bias runs deeper. Exam-based benchmarks like AGIEval draw from standardized tests in specific educational systems, embedding cultural assumptions about what constitutes important knowledge. A benchmark based on American SAT questions implicitly privileges knowledge of American history, literature, and social norms. Similarly, commonsense reasoning benchmarks often encode Western cultural common sense that may not generalize globally.
Efforts to address these biases include the development of language-specific benchmarks like C-Eval and CMMLU for Chinese, culture-aware evaluation frameworks, and explicitly multilingual benchmarks like BenchMAX and MultiLoKo. However, progress remains uneven, and the field needs fundamental rethinking of how to create truly equitable global evaluation standards. For organizations deploying models internationally, supplementing standard benchmarks with culturally localized evaluation is essential. The CB Insights Tech Trends 2025 report highlights this globalization challenge as a key factor in enterprise AI adoption.
Emerging Trends: Dynamic Evaluation and Process-Based Assessment
The third critical gap identified by the survey — the lack of evaluation for process credibility and dynamic environments — is driving some of the most innovative work in LLM benchmarks research. These emerging approaches aim to evaluate not just what models produce but how they produce it, and to test capabilities in environments that change over time.
Process-based evaluation represents a philosophical shift from outcome metrics to reasoning quality. Traditional benchmarks check whether a model arrives at the correct answer, but a model that guesses correctly through flawed reasoning is arguably less trustworthy than one that reasons well but makes a minor calculation error. Emerging benchmarks evaluate intermediate reasoning steps, logical coherence of chain-of-thought outputs, and whether models can identify and self-correct errors in their reasoning processes.
Dynamic evaluation addresses the fundamental limitation of static benchmarks: the real world changes constantly. Models deployed in production face novel situations, evolving language patterns, and updated factual information that static benchmarks cannot capture. LiveBench updates monthly, RealtimeQA tests knowledge of current events, and FreshQA evaluates whether models can distinguish between outdated and current information. These dynamic approaches provide a much more realistic picture of how models will perform in deployment.
Human-preference evaluation through platforms like Chatbot Arena uses crowd-sourced head-to-head comparisons to rank models based on actual user preferences. This approach captures aspects of model quality — like helpfulness, clarity, and appropriate tone — that automated metrics often miss. The Elo rating system used by Chatbot Arena has become one of the most trusted ranking signals in the industry, complementing traditional benchmark scores with real-world user judgment.
LLM-as-Judge evaluation is another rapidly growing methodology where powerful language models are used to evaluate the outputs of other models. MT-Bench pioneered this approach by using GPT-4 to score multi-turn conversation quality. While LLM-as-Judge introduces its own biases (evaluator models tend to prefer outputs similar to their own), it enables scalable, nuanced evaluation that would be prohibitively expensive with human annotators alone.
How to Choose the Right LLM Benchmarks for Your Use Case
With 283+ benchmarks available, selecting the right evaluation suite for your specific needs requires a structured approach. Here is a practical framework based on common use cases:
For general-purpose model selection, start with a core suite: MMLU-Pro for knowledge breadth, GPQA for depth, MT-Bench for conversational quality, HumanEval/MBPP for coding, and Chatbot Arena ratings for overall user preference. Add IFEval for instruction following reliability and TruthfulQA for factual accuracy.
For domain-specific deployment, layer general benchmarks with relevant domain tests. Healthcare applications should include PubMedQA and BioMaze; legal deployments need LegalBench and relevant jurisdiction-specific tests; financial applications require FinEval and FLARE. Always test on data representative of your actual use case in addition to published benchmarks.
For safety-critical applications, prioritize the full safety benchmark suite: ToxiGen, JailbreakBench, HarmBench for adversarial safety; TruthfulQA and FActScore for factual reliability; IFEval for instruction compliance; and relevant bias benchmarks (StereoSet, CrowS-Pairs) for fairness assessment.
For agent/automation deployment, use AgentBench or GAIA for general agent capability, then add environment-specific tests (OSWorld for desktop automation, Mobile-Bench for mobile, SWE-bench for code automation). Always include AgentHarm or SafeAgentBench for agentic safety.
- Never rely on a single benchmark — use suites of 5-10 complementary evaluations
- Prioritize fresh benchmarks — dynamic and regularly updated evaluations provide more reliable signals
- Test on your own data — published benchmarks complement but cannot replace evaluation on your actual use case
- Monitor for contamination — suspiciously high scores on popular benchmarks may indicate training data overlap
- Evaluate the evaluation — understand the methodology, limitations, and potential biases of each benchmark you use
The Future of LLM Benchmarks: What Comes Next
The benchmark landscape is evolving rapidly, and several trends will shape evaluation methodology in the coming years. Understanding these trajectories helps organizations prepare for how model evaluation will change and what new capabilities will become measurable.
Multimodal evaluation will become standard as models increasingly handle text, images, audio, video, and code within unified architectures. Benchmarks like MME-CoT already test chain-of-thought reasoning across modalities, and we can expect comprehensive multimodal evaluation suites that test cross-modal reasoning, grounding, and generation in integrated settings.
Agentic evaluation will grow in sophistication as AI agents become more capable and autonomous. Future benchmarks will test longer-horizon planning, more complex tool use, recovery from failures, and collaborative multi-agent scenarios. TheAgentCompany’s corporate simulation approach points toward increasingly realistic evaluation environments that mirror actual deployment contexts.
Personalized and adaptive evaluation will emerge as one-size-fits-all benchmarks become insufficient. Different organizations have different needs, risk tolerances, and user populations. Evaluation frameworks that can be customized to specific deployment contexts — while maintaining standardized comparison capabilities — will become increasingly valuable.
Regulation-driven evaluation will accelerate as governments worldwide implement AI governance frameworks. The EU AI Act, US executive orders on AI safety, and similar regulations in China and elsewhere will mandate specific evaluation requirements, driving demand for standardized, auditable benchmark suites that satisfy compliance requirements.
The fundamental challenge remains: benchmarks must evolve at least as fast as the models they evaluate. As the survey authors note, the “arms race between model advancement and evaluation methodology” shows no signs of slowing. The organizations that invest in understanding and properly applying LLM benchmarks will make better model decisions, deploy safer AI systems, and ultimately create more value from this transformative technology.
Discover More AI Insights in Our Interactive Library
📊 Explore this analysis with interactive data visualizations
Frequently Asked Questions
What are LLM benchmarks and why are they important?
LLM benchmarks are standardized tests designed to evaluate the capabilities of large language models across specific tasks or domains. They are essential because they provide objective, reproducible metrics for comparing different models, identifying specific strengths and weaknesses, guiding research priorities, and ensuring deployed models meet minimum standards for safety, accuracy, and fairness. Without benchmarks, model evaluation would be subjective and inconsistent, making it impossible to track genuine progress in AI capabilities.
What is data contamination in LLM benchmarks and how does it affect results?
Data contamination occurs when a language model has been exposed to benchmark test questions or answers during its training phase, typically because these materials appear in web-crawled training data. This inflates benchmark scores beyond the model’s actual capability on genuinely novel tasks. Contamination is considered one of the most serious threats to benchmark reliability, and researchers combat it through dynamic benchmarks with regularly refreshed questions (like LiveBench), contamination detection tests, and process-based evaluation that examines reasoning quality rather than just final answers.
Which LLM benchmarks should I use to evaluate models for production deployment?
For production deployment, use a suite of 5-10 complementary benchmarks rather than relying on any single score. A strong general-purpose suite includes MMLU-Pro (knowledge), GPQA (expert reasoning), MT-Bench (conversation quality), HumanEval (coding), IFEval (instruction following), TruthfulQA (factual accuracy), and Chatbot Arena rankings (user preference). Supplement these with domain-specific benchmarks relevant to your use case and always validate on your own representative data. Prioritize dynamic benchmarks with anti-contamination measures for the most reliable results.
How do safety benchmarks evaluate whether an LLM is safe to deploy?
Safety benchmarks test multiple dimensions of model behavior. Toxicity benchmarks (ToxiGen, HarmBench) check whether models generate harmful content including hate speech and dangerous instructions. Jailbreak benchmarks (JailbreakBench) test resistance to adversarial prompts that try to bypass safety training. Bias benchmarks (StereoSet, CrowS-Pairs) measure whether outputs reinforce social stereotypes. Hallucination benchmarks (TruthfulQA, FActScore) test factual accuracy. Robustness benchmarks (AdvGLUE, IFEval) evaluate reliability under adversarial inputs. A comprehensive safety evaluation should cover all these dimensions, particularly for high-stakes applications.
What is the difference between MMLU and MMLU-Pro benchmarks?
MMLU (Massive Multitask Language Understanding) tests knowledge across 57 academic subjects with 4-option multiple-choice questions, focusing primarily on factual recall. MMLU-Pro is an enhanced version that addresses several MMLU limitations: it expands answer options from 4 to 10 (reducing random guess accuracy from 25% to 10%), includes more reasoning-intensive questions that require chain-of-thought processing, and filters out noisy or ambiguous questions. MMLU-Pro is generally considered a more reliable discriminator between model capabilities, especially as top models approach ceiling performance on original MMLU.