0:00

0:00




Agentic Multimodal LLMs: The Complete Survey on Autonomous AI Systems

🔑 Key Takeaways from the Agentic Multimodal LLM Survey

  • Paradigm shift from passive to agentic AI — models evolving from query-response to autonomous decision-making
  • Three-dimensional framework — internal intelligence, external tool use, and environment interaction define agentic MLLMs
  • Reinforcement learning is the key enabler — RL drives the transition from static workflows to dynamic, adaptive behavior
  • Reasoning-enhanced MLLMs are bridging the gap between perception and long-horizon planning
  • Tool invocation goes beyond search — code execution, visual processing, and API calls extend model capabilities
  • Physical embodiment emerging — agentic MLLMs are being deployed in robotics and real-world environments
  • Open-source training frameworks are accelerating community progress toward agentic AI

Survey Overview: The Shift from Passive to Agentic Multimodal AI

The landscape of artificial intelligence is undergoing a fundamental transformation. Published by researchers at Nanyang Technological University and collaborating institutions, this comprehensive survey documents the emergence of Agentic Multimodal Large Language Models (Agentic MLLMs)—AI systems that don’t just respond to queries but autonomously plan, reason, use tools, and interact with their environment to achieve complex goals.

Traditional multimodal LLMs, despite their remarkable capabilities in perceiving, understanding, and generating across diverse modalities, operate under a fundamentally limited query-response paradigm. A user provides input, the model produces output, and the interaction ends. This approach is inadequate for the complex, dynamic real-world tasks that require sustained goal-directed behavior, adaptive planning, and real-time decision-making.

The evolution from passive MLLMs to agentic systems represents three key advances: First, agentic MLLMs dynamically adjust strategies based on previous planning and current state rather than following static procedures. Second, they proactively initiate plans, invoke tools when needed, and reflect on intermediate outcomes. Third, they operate across diverse tasks and environments rather than being restricted to narrow domain-specific applications. For foundational context on how reasoning-enhanced LLMs enable agentic behavior, see our analysis of the DeepSeek R1 Reinforcement Learning LLM.

The Three-Dimensional Agentic MLLM Framework

The survey establishes a conceptual framework that organizes agentic MLLMs along three fundamental dimensions, providing a structured way to understand and evaluate autonomous AI systems. This framework distinguishes agentic MLLMs from both traditional models and conventional agent architectures.

The first dimension is Agentic Internal Intelligence, which functions as the system’s commander. This encompasses the model’s ability to perform accurate long-horizon planning through reasoning, reflection, and memory. Internal intelligence enables the system to decompose complex goals into achievable sub-tasks, evaluate the outcomes of its actions, and maintain context across extended interaction sequences.

The second dimension is Agentic External Tool Invocation, whereby models proactively use various external tools to extend their problem-solving capabilities beyond intrinsic knowledge. This includes information searching, code execution, visual processing tools, and API calls to external services. The key distinction from traditional tool-use is proactivity—agentic systems decide when and which tools to use rather than following prescribed tool-use patterns.

The third dimension is Agentic Environment Interaction, which situates models within virtual or physical environments. This allows them to take actions, observe consequences, adapt strategies, and sustain goal-directed behavior in dynamic real-world scenarios. Environment interaction bridges the gap between abstract reasoning and practical capability, enabling applications in robotics, web navigation, and embodied AI.

Agentic Reasoning and Long-Horizon Planning

Reasoning capability is the cornerstone of agentic behavior. The survey details how recent advances in reasoning-enhanced MLLMs have transformed the ability of AI systems to engage in multi-step, long-horizon planning. Unlike simple chain-of-thought prompting, agentic reasoning involves dynamic strategy adjustment, error recovery, and the ability to maintain complex goal hierarchies.

The survey identifies several key reasoning paradigms that enable agentic behavior. Chain-of-thought and tree-of-thought approaches provide structured reasoning paths, while self-consistency methods enable models to evaluate multiple reasoning chains and select the most reliable conclusions. More advanced approaches involve planning algorithms that decompose complex tasks into manageable sub-goals, with the ability to backtrack and re-plan when intermediate steps fail.

A critical innovation is the integration of reasoning with visual perception. Agentic MLLMs can reason about what they see, plan actions based on visual understanding, and anticipate the visual consequences of their actions. This visual reasoning capability is essential for applications like robotic manipulation, autonomous navigation, and interactive web interfaces where the system must understand and act within a visual environment. For more on how advanced reasoning capabilities are being implemented in cutting-edge models, explore the Gemini 2.5 Technical Report.

📊 Explore the full agentic AI survey interactively on Libertify

Explore Interactive Report

Reflection and Memory: Learning from Experience

Reflection—the ability to evaluate one’s own outputs and adjust behavior accordingly—is a defining characteristic of agentic systems. The survey documents how reflection mechanisms enable MLLMs to identify errors in their reasoning, correct intermediate steps, and improve performance over time without explicit retraining.

Memory systems in agentic MLLMs operate at multiple levels. Short-term working memory maintains the context of current tasks and intermediate results. Long-term memory stores learned patterns, successful strategies, and domain knowledge that can be retrieved and applied across different tasks. Episodic memory records specific interaction sequences, enabling the system to learn from past successes and failures.

The combination of reflection and memory creates a feedback loop that mimics human learning processes. When an agentic system encounters a familiar problem, it can retrieve relevant past experiences and apply learned strategies. When it encounters novel challenges, it can reason about the problem, attempt solutions, reflect on outcomes, and store the results for future reference. This capability is what truly distinguishes agentic systems from static models.

External Tool Invocation: Extending AI Capabilities

Tool use represents one of the most practically important capabilities of agentic MLLMs. The survey categorizes external tool invocation into several distinct categories, each extending the model’s capabilities in different ways. Information searching tools enable the system to access up-to-date information beyond its training data, addressing the fundamental knowledge cutoff limitation of pretrained models.

Code execution tools allow agentic systems to write and run programs, performing precise calculations, data analysis, and complex algorithmic operations that would be unreliable through pure language-based reasoning. Visual processing tools enable image editing, generation, and analysis through specialized modules, extending the system’s visual capabilities beyond its core model architecture.

The key innovation in agentic tool use is the shift from prescribed tool chains to autonomous tool selection. Traditional agent architectures typically define when and which tools to use as part of a handcrafted workflow. Agentic systems, in contrast, learn to recognize when their intrinsic capabilities are insufficient and autonomously select appropriate tools from their available toolkit. This proactive tool use dramatically expands the range of problems that AI systems can solve effectively.

Environment Interaction: Virtual and Physical Embodiment

The survey’s third dimension—environment interaction—represents the frontier of agentic AI research. Virtual embodiment places agentic MLLMs in digital environments like web browsers, software applications, and simulated worlds where they can take actions and observe consequences. Physical embodiment connects these models to robotic systems that interact with the real world.

In virtual environments, agentic MLLMs have demonstrated impressive capabilities in web navigation, software testing, and game playing. The system perceives the environment through screenshots or DOM representations, reasons about the current state, plans a sequence of actions, and executes them through API calls or simulated inputs. This capability has immediate applications in test automation, data extraction, and workflow automation.

Physical embodiment presents greater challenges but also greater potential impact. Agentic MLLMs connected to robotic systems must deal with continuous action spaces, physical constraints, safety requirements, and the imprecision inherent in real-world manipulation. Despite these challenges, the survey documents significant progress in areas like robotic grasping, navigation, and multi-step manipulation tasks driven by natural language instructions. The intersection of multimodal AI and robotics represents one of the most promising paths toward generally capable AI systems, as explored in the McKinsey State of AI 2025 Report.

🤖 Discover how agentic AI is transforming industries in our research library

Browse AI Research Reports

Training Frameworks and Reinforcement Learning

The survey provides extensive coverage of training methodologies for developing agentic capabilities. Reinforcement learning (RL) emerges as the primary driver of the transition from passive to agentic behavior. Through RL, models learn to take actions that maximize long-term rewards rather than simply generating the most likely next token, fundamentally changing the optimization objective from prediction to decision-making.

Several open-source training frameworks have been developed to facilitate agentic MLLM research. These frameworks provide standardized environments, reward functions, and training pipelines that enable researchers to train and evaluate agentic systems consistently. The availability of these resources is accelerating community progress and enabling smaller research groups to contribute to the field.

The survey also covers hybrid training approaches that combine supervised fine-tuning with reinforcement learning. Initial supervised training on demonstration data provides a strong behavioral foundation, while subsequent RL training refines decision-making strategies through environmental feedback. This approach balances the sample efficiency of supervised learning with the optimization power of reinforcement learning, producing systems that are both capable and reliable. For insights into how reinforcement learning is specifically applied in advanced language models, refer to DeepSeek-R1’s technical paper.

Evaluation Benchmarks and Datasets for Agentic AI

Evaluating agentic systems presents unique challenges compared to traditional model evaluation. The survey compiles a comprehensive overview of training and evaluation datasets specifically designed for agentic MLLM development. These benchmarks test not just perception or generation quality but the complete loop of perception, reasoning, planning, action, and reflection.

Key evaluation dimensions include task completion rate across multi-step problems, efficiency of tool selection and use, quality of reasoning traces, robustness to environmental changes, and the ability to recover from errors. Unlike traditional benchmarks that measure isolated capabilities, agentic benchmarks assess integrated performance across the full decision-making pipeline.

The survey notes that benchmark design for agentic systems is still evolving. Current benchmarks tend to focus on specific domains (web navigation, coding, or specific robotic tasks) rather than general agentic capability. The development of comprehensive, domain-agnostic evaluation frameworks remains an important open challenge for the research community, as noted by the NIST AI evaluation framework.

Downstream Applications of Agentic Multimodal LLMs

The practical applications of agentic MLLMs span a remarkable range of domains. In software engineering, agentic systems can autonomously debug code, implement features based on natural language specifications, and navigate complex development environments. In scientific research, they can design experiments, analyze results, and iteratively refine hypotheses.

Healthcare applications include clinical decision support systems that can review medical images, cross-reference patient records, query medical literature, and suggest diagnostic and treatment options. These systems go beyond simple classification by maintaining context across patient histories and proactively seeking additional information when uncertainty is high.

In education, agentic MLLMs can serve as personalized tutoring systems that adapt their teaching strategies based on student performance, proactively generate practice problems, and provide detailed explanations tailored to the student’s level of understanding. The common thread across all applications is the shift from tools that respond to queries to systems that actively pursue goals and adapt their behavior to achieve them.

Future Directions and Open Challenges in Agentic AI

The survey identifies several critical open challenges for the agentic MLLM research community. Safety and alignment remain paramount—as AI systems gain greater autonomy, ensuring they pursue intended goals while avoiding harmful actions becomes both more important and more difficult. The development of robust alignment techniques for agentic systems is perhaps the field’s most pressing challenge.

Generalization across domains and tasks remains limited. While agentic MLLMs show impressive performance in specific environments, achieving truly general-purpose agentic capability—the ability to adapt to any novel task or environment—remains elusive. The survey suggests that advances in meta-learning, few-shot adaptation, and transfer learning will be critical for improving generalization.

Scalability of agentic capabilities is another open question. Current approaches often require extensive environment-specific training, limiting practical deployment. More efficient training methods, better simulation environments, and improved transfer from simulation to real-world settings will be needed to make agentic MLLMs practically deployable at scale. The research community, as tracked in repositories like the survey’s Awesome Agentic MLLMs GitHub repository, continues to make rapid progress on these fronts.

Frequently Asked Questions About Agentic Multimodal LLMs

What are agentic multimodal LLMs?

Agentic multimodal LLMs are AI systems that combine multimodal perception (vision, text, audio) with autonomous decision-making capabilities. Unlike traditional models that passively respond to queries, agentic MLLMs can reason, plan, use tools, and interact with environments to accomplish complex goals independently.

How do agentic AI agents differ from traditional AI agents?

Traditional AI agents rely on pre-defined workflows and respond passively to instructions. Agentic AI agents dynamically adjust strategies, proactively initiate plans, invoke tools when needed, reflect on outcomes, and operate across diverse tasks without being restricted to a single domain.

What are the three dimensions of agentic MLLMs?

The three dimensions are: (1) Agentic internal intelligence for planning through reasoning, reflection, and memory; (2) Agentic external tool invocation for using tools to extend capabilities; (3) Agentic environment interaction for taking actions in virtual or physical environments.

What role does reinforcement learning play in agentic AI?

Reinforcement learning is a key driver of the shift from passive to agentic AI. RL enables models to learn from environmental feedback, optimize long-term strategies, and develop autonomous decision-making capabilities that go beyond simple pattern matching or instruction following.

🚀 Access the complete agentic AI survey and hundreds of research reports in Libertify’s Interactive Library

Explore the Interactive Library