Multimodal LLM Hallucination Survey 2025
Table of Contents
- Multimodal LLM Hallucination Overview
- Types of Hallucination in Vision-Language Models
- Data-Level Causes of MLLM Hallucination
- Model Architecture and Training Causes
- Inference-Stage Hallucination Factors
- Evaluation Benchmarks for LLM Hallucination
- Metrics for Measuring Visual Hallucination
- Training-Based Mitigation Strategies
- Inference-Time Hallucination Reduction
- Real-World Implications and Applications
- Future Research Directions
- Frequently Asked Questions
🔑 Key Takeaways
- Three categories of object hallucination — category (nonexistent objects), attribute (wrong colors, shapes), and relation (incorrect spatial or interaction descriptions)
- Four-stage cause taxonomy — hallucinations originate from data, model architecture, training procedures, and inference strategies
- Language prior dominance — MLLMs often rely on learned language patterns over actual visual evidence, producing plausible but incorrect descriptions
- Multiple evaluation benchmarks — POPE, CHAIR, MMHal-Bench, and HaELM provide standardized measurement of different hallucination types
- RLHF and contrastive decoding — most promising mitigation approaches, reducing hallucination while maintaining model fluency and usefulness
- Cross-modal inconsistency — the core challenge unique to multimodal systems, requiring solutions that go beyond text-only LLM hallucination research
Multimodal LLM Hallucination: A Comprehensive Overview
The phenomenon of multimodal LLM hallucination represents one of the most critical challenges facing the deployment of large vision-language models in real-world applications. This comprehensive survey, authored by researchers from the National University of Singapore and AWS Shanghai AI Lab, provides a systematic analysis of how multimodal large language models (MLLMs) generate outputs that are inconsistent with the visual content they process.
Despite remarkable advances in tasks like image captioning, visual question answering, and multimodal reasoning, MLLMs exhibit a concerning tendency to produce seemingly plausible yet factually incorrect descriptions. A model might describe objects not present in an image, assign wrong colors or sizes to visible objects, or fabricate spatial relationships that contradict the actual visual scene. These hallucinations pose substantial obstacles to practical deployment in healthcare, autonomous driving, accessibility tools, and other high-stakes applications.
The survey distinguishes multimodal hallucination from text-only LLM hallucination, emphasizing that the cross-modal inconsistency between generated text and provided visual content creates unique challenges that cannot be addressed by simply transferring solutions from the NLP community. This distinction is crucial for understanding why models like DeepSeek R1 and other frontier systems continue to struggle with visual grounding despite impressive language capabilities.
Types of Hallucination in Vision-Language Models
The survey categorizes object hallucination in MLLMs into three primary types. Category hallucination involves identifying nonexistent or incorrect object categories in a given image — for example, describing a cat in a scene that contains only dogs. Attribute hallucination emphasizes incorrect descriptions of object properties such as color, shape, material, or size — claiming a red car is blue. Relation hallucination assesses the accuracy of relationships among objects, including human-object interactions and relative spatial positions.
Each hallucination type presents different challenges for detection and mitigation. Category hallucinations are often the most straightforward to identify through object detection cross-validation, while attribute and relation hallucinations require more nuanced evaluation that considers the full visual context. The survey notes that some literature treats object counting and event description as independent categories, but this work subsumes them under the attribute category for analytical consistency.
Understanding these hallucination types is essential for developing targeted solutions. A model that excels at object identification but struggles with attribute accuracy may require different interventions than one that correctly identifies objects and attributes but fabricates spatial relationships. This granular classification enables researchers and practitioners to diagnose specific weaknesses in their multimodal systems and select appropriate mitigation strategies.
Data-Level Causes of MLLM Hallucination
The survey identifies training data quality as a fundamental source of multimodal hallucination. Noisy image-text pairs in large-scale pretraining datasets can teach models incorrect associations between visual concepts and textual descriptions. When training data contains captions that inaccurately describe images — a common issue in web-crawled datasets — models learn to reproduce these inaccuracies with high confidence.
Data bias also plays a significant role. If certain objects frequently co-occur in training data (e.g., boats and water), models develop strong statistical priors that override visual evidence. When shown a boat on a trailer in a parking lot, a model may hallucinate water in the scene because its training data strongly associates boats with aquatic environments. These co-occurrence biases are particularly insidious because they produce outputs that are statistically likely but visually incorrect.
The long-tail distribution of visual concepts in training data creates additional challenges. Rare objects or unusual configurations are poorly represented, leading models to “normalize” uncommon scenes by hallucinating more typical elements. This data imbalance means that multimodal LLMs perform differently across visual domains, with higher hallucination rates for uncommon scenarios — a finding with significant implications for safety-critical applications where edge cases matter most.
Model Architecture and Training Causes of Hallucination
At the model architecture level, the survey identifies insufficient visual grounding as a key driver of hallucination. Many MLLMs use frozen or partially-frozen visual encoders connected to language models via relatively simple projection layers. This architectural choice can create an information bottleneck where fine-grained visual details are lost during the cross-modal translation, forcing the language model to rely on its priors to fill gaps in visual understanding.
During training, language prior dominance emerges as perhaps the most fundamental cause of multimodal hallucination. Language models bring powerful priors about typical object co-occurrences, common descriptions, and probable scenarios. When these language priors conflict with actual visual evidence, the model may default to generating text consistent with its language knowledge rather than the provided image. This is analogous to a human expert who relies on general knowledge rather than carefully examining the specific evidence at hand.
The training objective itself can contribute to hallucination. Models trained primarily with cross-entropy loss on next-token prediction learn to maximize the probability of generating common, fluent text sequences — which may not align with the goal of generating accurate, grounded descriptions. The tension between fluency and accuracy is a fundamental challenge in training-based approaches to hallucination mitigation, as explored in broader AI reliability research documented by the survey’s comprehensive analysis.
🧠 Explore the full Multimodal LLM Hallucination Survey interactively
Inference-Stage Hallucination Factors
During inference, decoding strategies significantly impact hallucination rates. Greedy decoding, nucleus sampling, and beam search each introduce different biases that can amplify or suppress hallucinations. Temperature settings affect the tradeoff between creativity and accuracy — higher temperatures produce more diverse but potentially more hallucinated outputs, while lower temperatures reduce variety but can also reduce errors.
The autoregressive nature of text generation creates a snowball effect where early hallucinations in a response can compound as the model conditions subsequent tokens on already-incorrect context. Once a model begins describing a nonexistent object, it may continue elaborating on that object’s attributes and relationships, creating increasingly detailed but entirely fabricated descriptions. This cascading hallucination pattern is particularly problematic for long-form visual descriptions.
Prompt engineering and in-context learning also influence hallucination rates. Carefully crafted prompts that emphasize accuracy and visual grounding can reduce hallucinations, while prompts that encourage verbose or creative descriptions may increase them. The survey notes that the relationship between prompt design and hallucination is an active area of research with significant practical implications for deploying MLLMs in production environments.
Evaluation Benchmarks for Multimodal LLM Hallucination
The survey provides a comprehensive overview of hallucination evaluation benchmarks. POPE (Polling-based Object Probing Evaluation) is one of the most widely used, testing whether models correctly identify the presence or absence of objects through yes/no questions. CHAIR (Caption Hallucination Assessment with Image Relevance) evaluates free-form captions by computing the proportion of mentioned objects that are actually present in the image.
MMHal-Bench extends evaluation to more complex scenarios, assessing hallucination across multiple visual understanding tasks including object identification, attribute recognition, and spatial reasoning. HaELM (Hallucination Evaluation based on Large Models) uses a separate LLM to evaluate the consistency between generated descriptions and visual content, providing scalable assessment that doesn’t require manual annotation.
The proliferation of benchmarks reflects the multifaceted nature of hallucination — no single metric captures all aspects of visual-textual inconsistency. Researchers must use multiple complementary benchmarks to get a complete picture of a model’s hallucination profile. The survey argues for standardized evaluation protocols that would enable more meaningful comparisons across studies, an area where the McKinsey State of AI 2025 Report also identifies significant gaps in AI evaluation methodology.
Metrics for Measuring Visual Hallucination
Beyond benchmarks, the survey catalogs specific metrics for quantifying hallucination. Object hallucination rates measure the percentage of generated object mentions that are not present in the reference image. Attribute accuracy metrics evaluate the correctness of properties assigned to correctly identified objects. Relation precision assesses whether described spatial, functional, or interactive relationships between objects are accurate.
Composite metrics attempt to capture the overall hallucination profile of a model across all three dimensions simultaneously. These aggregate scores are useful for quick comparisons but can mask specific weaknesses — a model with excellent object identification but poor attribute accuracy might receive the same composite score as one with the opposite profile. The survey recommends reporting disaggregated metrics alongside composite scores for more informative evaluation.
A significant challenge in hallucination measurement is establishing ground truth. Human annotations are expensive, subjective, and may themselves contain errors. Automated evaluation using object detectors or other vision models introduces circular dependencies when the evaluation tools share similar biases with the models being evaluated. The survey identifies this evaluation reliability challenge as a fundamental limitation that affects the entire field of multimodal AI research.
Training-Based Multimodal Hallucination Mitigation
The survey reviews multiple training-based mitigation strategies. Data curation approaches focus on improving the quality and accuracy of training datasets by filtering noisy image-text pairs, augmenting underrepresented visual concepts, and balancing co-occurrence statistics. While fundamental, data improvements alone are insufficient because hallucination also arises from architectural and training objective factors.
Reinforcement Learning from Human Feedback (RLHF) has emerged as one of the most effective approaches, training models to prefer accurate over hallucinated descriptions through human preference signals. By incorporating human judgments about visual accuracy into the training loop, RLHF can significantly reduce hallucination rates while maintaining the model’s fluency and usefulness for downstream tasks.
Specialized training objectives that explicitly penalize hallucination show promise. These include contrastive learning approaches that encourage models to distinguish between accurate and hallucinated descriptions, as well as grounding losses that tie generated text tokens to specific image regions. Models like DeepSeek R1 demonstrate how reinforcement learning can improve reasoning fidelity, and similar principles are being adapted for visual grounding in multimodal systems.
Inference-Time Hallucination Reduction Strategies
For deployed models, inference-time mitigation offers the advantage of not requiring retraining. Contrastive decoding compares the output distributions of visually-conditioned and unconditioned models, suppressing tokens that are equally likely with and without the image — effectively filtering out text generated from language priors alone. This technique has demonstrated significant hallucination reduction across multiple benchmarks.
Post-hoc verification systems cross-check generated descriptions against visual content using separate object detection or visual question answering models. When inconsistencies are detected, the system can either flag the response, regenerate specific portions, or apply targeted corrections. These verification pipelines add computational overhead but provide an important safety layer for production deployments.
Retrieval-augmented generation (RAG) adapted for multimodal settings shows emerging promise. By retrieving similar images and their verified descriptions from a curated database, models can anchor their outputs to established visual-textual correspondences rather than relying solely on parametric knowledge. This approach combines the generalization capability of large models with the accuracy of curated reference data, as documented in research from Show Lab’s comprehensive resource collection.
🔬 Access the complete hallucination research survey on Libertify
Real-World Implications and Applications
The practical implications of multimodal LLM hallucination extend across numerous industries. In healthcare, a model that hallucinates findings in medical images could lead to misdiagnosis. In autonomous driving, incorrect identification of road objects could cause safety failures. In accessibility, blind users relying on AI image descriptions receive misinformation when models hallucinate scene elements.
The survey highlights that hallucination risk varies significantly by application domain. Models trained on natural images may exhibit higher hallucination rates when applied to domain-specific visual content such as medical imaging, satellite imagery, or industrial inspection. This domain gap underscores the importance of domain-specific evaluation before deploying multimodal systems in specialized environments.
For the AI industry more broadly, hallucination represents a trust barrier. Users who experience hallucinated outputs lose confidence in AI systems, potentially rejecting useful tools due to reliability concerns. Addressing hallucination is therefore not just a technical challenge but a prerequisite for broader adoption of multimodal AI across industries — a theme reinforced by the NVIDIA annual report’s discussion of AI deployment challenges.
Future Research Directions in MLLM Hallucination
The survey identifies several open research questions that define the frontier of multimodal hallucination research. First, understanding the mechanistic causes of hallucination at the neuron and layer level could enable more targeted interventions than current black-box approaches. Interpretability research applied specifically to cross-modal processing could reveal how and where visual information is lost or distorted during text generation.
Second, developing unified evaluation frameworks that capture the full spectrum of hallucination types, severity levels, and downstream impacts remains an open challenge. Current benchmarks measure whether hallucination occurs but not how harmful it is — a subtle attribute hallucination (wrong shade of blue) is counted the same as a dangerous category hallucination (pedestrian not detected). Severity-weighted metrics would provide more actionable evaluation signals.
Third, extending hallucination research beyond static images to video, 3D scenes, and multimodal conversations opens new frontiers. As MLLMs evolve to process longer, more complex multimodal inputs, the surface area for hallucination expands dramatically. Research at groups like Show Lab at NUS and similar institutions is pushing these boundaries with increasingly sophisticated evaluation and mitigation frameworks, pointing toward more reliable multimodal AI systems for the future.
Frequently Asked Questions
What is hallucination in multimodal LLMs?
Hallucination in multimodal LLMs refers to the generation of text outputs that are inconsistent with the visual content provided as input. This includes describing objects not present in images, incorrect attributes like wrong colors or shapes, and fabricated relationships between objects.
What are the main causes of multimodal LLM hallucination?
The main causes span four stages: data issues (biased or noisy training data), model architecture limitations (insufficient visual grounding), training problems (language prior dominance over visual signals), and inference challenges (decoding strategies that amplify errors).
How is hallucination measured in vision-language models?
Hallucination is measured using specialized benchmarks like POPE, CHAIR, MMHal-Bench, and HaELM, which evaluate object existence accuracy, attribute correctness, and relationship fidelity between generated text and actual image content.
What strategies can mitigate multimodal LLM hallucination?
Mitigation strategies include improved training data curation, reinforcement learning from human feedback (RLHF), contrastive decoding, visual grounding enhancements, retrieval-augmented generation, and post-hoc verification systems that cross-check outputs against visual inputs.
Why is multimodal hallucination different from text-only LLM hallucination?
Multimodal hallucination involves cross-modal inconsistency between generated text and provided visual content, which is a unique challenge not present in text-only LLMs. Solutions from the NLP community cannot be directly transferred because the visual grounding component introduces additional complexity.
📚 Explore AI research papers interactively on Libertify