Can LLMs Trade? Financial Theory Testing with AI Agents

By Isabella Costa
·
March 17, 2026
·
12 min read

The Rise of LLM-Powered Trading Agents
Research Framework and Simulation Architecture
How GPT-4o Agents Execute Trades
Testing Price Discovery in Simulated Markets
Bubble Formation and Persistent Overvaluation
Divergent Beliefs and Market Stress Scenarios
LLM Agents vs Human Traders: Key Differences
Systemic Risk and Regulatory Implications
The Future of AI-Driven Market Simulations

📌 Key Takeaways

LLMs Follow Instructions, Not Profits: GPT-4o agents faithfully execute assigned strategies even at a financial loss, prioritizing prompt compliance over profit maximization.
Asymmetric Price Discovery: LLM agents correct undervaluation effectively but struggle to deflate overvalued markets, sustaining bubble-like conditions in simulations.
Realistic Market Dynamics: Multi-agent LLM simulations produce emergent behaviors including liquidity provision, price convergence, and strategic trading patterns resembling real markets.
Systemic Risk Concerns: Correlated behavior from similar foundation models could amplify market instabilities if LLM-based trading becomes widespread without proper safeguards.
New Research Paradigm: The open-source framework enables testing financial theories at a fraction of the cost of human-subject experiments, with unprecedented control and replicability.

The Rise of LLM-Powered Trading Agents

The intersection of artificial intelligence and financial markets has entered a fundamentally new phase. While algorithmic trading has dominated markets for decades, the emergence of large language models as autonomous trading agents represents what researchers call a “fundamental shift from algorithms with explicit objectives to systems guided by natural language instructions.” This shift carries profound implications for market efficiency, stability, and the very nature of how we test financial theories.

In April 2025, Alejandro Lopez-Lira of the University of Florida published a groundbreaking paper titled Can Large Language Models Trade? Testing Financial Theories with LLM Agents in Market Simulations, presenting the first comprehensive framework for deploying LLM agents as autonomous traders in controlled market environments. The research arrives at a critical moment: LLM-based strategies are already being deployed in live markets, with platforms like Autopilot building portfolio management systems around ChatGPT, yet systematic understanding of how these agents behave remains dangerously limited.

The paper poses three fundamental questions that every financial professional and regulator should be asking. First, can LLMs actually execute coherent trading strategies? Second, do these agents optimize for profit the way economic theory assumes rational actors should? And third, what happens to market stability when multiple LLM agents interact in the same marketplace? The answers, as we will explore throughout this article, challenge conventional assumptions about both artificial intelligence and market dynamics. As the broader landscape of AI and machine learning in financial markets continues to evolve, this research provides a critical empirical foundation for understanding what lies ahead.

Research Framework and Simulation Architecture

Lopez-Lira developed an open-source simulation framework with three integrated components designed to rigorously test LLM trading behavior. The first component is a structured protocol for implementing and validating LLM trading agents, supporting both LLM-based and traditional rule-based agents as benchmarks. The second is a controlled market environment with realistic microstructure. The third is a comprehensive data collection system for analyzing trading behavior at granular resolution.

The market implements a continuous double-auction mechanism that processes orders in discrete trading rounds. This discrete approach is necessary because LLMs have latency constraints that make real-time continuous processing infeasible. Each round maintains a persistent order book supporting both market and limit orders, with partial fills and careful tracking of remaining quantities. The matching engine operates through a three-stage process: first, limit orders that do not cross the spread are posted to the book; second, market orders are netted via market-to-market matching at the current price; third, remaining market orders match against the order book, with unfilled quantities converted to aggressive limit orders.

The economic environment is carefully calibrated around a known fundamental value. Assets pay stochastic dividends with a base payment of $1.40, plus or minus $1.00 of variation, paid with 50 percent probability each round. Cash holdings accrue interest at a 5 percent risk-free rate per round. This yields a fundamental value of $28.00 using the perpetuity formula (Expected Dividend divided by Interest Rate). By establishing this known value, the researchers can precisely measure whether and how quickly LLM agents discover the correct price through their trading interactions.

Agent submission order within each round is randomized to prevent systematic priority advantages, simulating concurrent order arrival in real markets. The system validates trades against agent cash commitments and position constraints before matching. Short selling and borrowing are both prohibited, a constraint that proves consequential for the results. Each agent begins with an endowment of 1,000,000 monetary units and 10,000 shares, ensuring sufficient resources for meaningful trading activity.

How GPT-4o Agents Execute Trades

Every LLM agent in the study uses GPT-4o as its decision-making engine, but the key innovation lies in the two-layer prompt architecture that shapes each agent’s behavior. The system prompt establishes the strategic layer: the agent’s fundamental trading philosophy, objectives, risk parameters, and behavioral constraints. This prompt remains constant across all trading rounds, anchoring the agent’s identity. The user prompt provides the tactical layer: dynamic market context including current prices, volumes, order book depth, the agent’s own position, historical data from the last five rounds, dividend information, and available trading options.

Agents produce structured JSON outputs through function calling with Pydantic-based schema validation. Each decision includes a valuation reasoning field (natural language explanation of fundamental value assessment), a numerical valuation estimate, price target reasoning, a numerical price target, specific order instructions (buy or sell, quantity, market or limit, price limit), and an overall reasoning explanation. This structured approach ensures that every trading decision is both executable and analyzable, capturing not just the what but the why behind each trade.

The framework includes seven distinct LLM-based agent types, each embodying a different trading philosophy. Value investors focus on fundamental analysis and mean reversion toward intrinsic value. Momentum traders follow established price trends and volume patterns. Market makers provide liquidity through symmetric bid-ask spreads with detailed instructions for spread placement between 1 and 3 percent, inventory management, and compliance with no-short-selling rules. Contrarian traders fade market extremes and overreactions. Speculators seek to profit from perceived inefficiencies. Sentiment-based agents come in two variants: optimistic agents that believe there is an 80 to 90 percent probability of maximum dividends, and pessimistic agents that assume the same probability for minimum dividends. Finally, retail trader agents simulate typical individual investor behavior.

To benchmark LLM performance, the framework also includes deterministic rule-based agents such as directional traders (always buy, always sell, always hold), technical agents (gap traders, mean reversion, momentum), and algorithmic market makers with fixed spread-posting strategies. These provide a baseline against which the sophistication of LLM decision-making can be measured. The research on how deep learning transforms portfolio optimization provides additional context for understanding these AI-driven approaches.

Want to transform complex research papers into interactive experiences your team can explore?

Try It Free →

Testing Price Discovery in Simulated Markets

The core question of price discovery asks whether LLM agents can collectively find the correct price for an asset through their trading interactions. To test this, Lopez-Lira designed scenarios where markets start at prices significantly above or below the known fundamental value of $28.00, then observes whether trading activity drives convergence.

In the finite-horizon scenarios, markets starting at $35.00 (25 percent above fundamental value) employ 20 rounds of trading with 8 agents: 2 default agents, 2 optimistic agents, 2 market makers with 20 times the baseline liquidity, and 2 speculators. The below-fundamental scenario starts at $21.00 (25 percent below) with an identical agent composition. These configurations test whether the mix of trading strategies produces the informational aggregation that the Efficient Market Hypothesis predicts.

The infinite-horizon scenarios push the tests further, starting at $56.00 (twice fundamental value) and $14.00 (half fundamental value) over 15 rounds. The results reveal a striking asymmetry. When starting below fundamental value at $14.00, prices exhibit clear convergence toward the $28.00 benchmark. Agent valuations adjust upward over time, with value investors recognizing the underpricing and momentum traders eventually following the upward trend. Market makers facilitate the correction by tightening spreads around the improving price.

However, when starting above fundamental value at $56.00, the price fails to converge downward within the simulation period. It remains substantially elevated, with many agents maintaining valuation estimates well above the theoretical $28.00 fundamental. This creates a persistent bubble-like condition that the trading mechanism cannot resolve. The finding is particularly significant because it mirrors real-world observations about the difficulty of correcting overvaluation, especially in markets where short selling is constrained.

The systematic decision analysis reveals the mechanics behind this asymmetry. Using a methodology analogous to partial dependence plots from interpretable machine learning, the researchers varied the price-to-fundamental ratio from 0.1 to 3.5 while holding other parameters constant. Value investors show strong buying tendencies when prices are below fundamental value and selling preferences when above, exactly as expected. But the selling pressure when prices are elevated proves insufficient to overcome the buying pressure from optimistic and momentum agents who anchor on the current high prices.

Bubble Formation and Persistent Overvaluation

The bubble formation findings deserve particular attention because they connect directly to decades of experimental finance research. Studies by Weitzel et al. (2020) demonstrated that even finance professionals are susceptible to speculative bubbles in experimental settings. Kopányi-Peuker and Weber (2021) showed that trading experience alone does not eliminate bubble formation. Kirchler, Huber, and Stöckl (2012) identified confusion about fundamental values as a key driver of experimental bubbles.

Lopez-Lira’s LLM agents reproduce these patterns with remarkable fidelity. When markets are initialized above the fundamental value, the agents collectively sustain the overvaluation. Several mechanisms appear to drive this behavior. First, momentum traders interpret the high starting price as an established trend and place buy orders to ride it further. Second, optimistic sentiment agents assign high probabilities to favorable outcomes, reinforcing their willingness to buy at elevated prices. Third, market makers, while providing liquidity on both sides, tend to center their spreads around the current market price rather than the fundamental value, implicitly validating the overvaluation.

The contrarian and value investor agents do attempt to sell, recognizing the overvaluation. However, without the ability to short sell, their corrective pressure is limited to liquidating their existing positions. Once their shares are sold, they become passive observers with no mechanism to exert further downward pressure. This structural limitation explains why the correction is asymmetric: buying pressure from cash-rich agents wanting to accumulate undervalued shares faces no equivalent constraint, while selling pressure is bounded by current share holdings.

This finding has direct relevance to current debates about market structure. If LLM-based trading systems proliferate and share similar biases toward momentum-following and anchoring on current prices, they could exacerbate real-world bubble dynamics. The research suggests that the correction mechanism depends critically on the availability of short selling and the composition of agent strategies, a conclusion with immediate policy implications for regulators.

Divergent Beliefs and Market Stress Scenarios

To test how conflicting viewpoints affect price formation, the divergent beliefs scenario pits 10 agents against each other: 2 optimistic, 2 pessimistic, 2 market makers, 2 momentum traders, and 2 default agents. Crucially, in this scenario the fundamental price is hidden from agents, forcing them to form beliefs based solely on market dynamics and their prior dispositions. Starting prices are set at both $56.00 and $14.00 to test convergence under disagreement.

The results demonstrate that heterogeneous beliefs generate meaningful trading volume. Optimistic agents, believing in an 80 to 90 percent probability of maximum dividends, consistently bid higher than pessimistic agents who assign the same probability to minimum dividends. Market makers capture the spread between these divergent valuations, profiting from the disagreement. Momentum traders add another dimension by amplifying whichever direction gains traction, potentially reinforcing one side’s view over the other.

The market stress scenario extends the simulation to 100 rounds with deliberately imbalanced endowments. Optimistic traders receive 1.5 times the normal cash but only 0.5 times the shares. Pessimistic traders get the inverse: 0.5 times cash but 1.5 times shares. This creates a natural setup where optimists have excess buying power while pessimists hold excess inventory they might want to reduce. The extended timeframe allows for observation of longer-term dynamics including potential exhaustion of one side’s resources, shifts in market regime, and the evolution of market maker profitability over time.

These scenarios connect to the broader academic literature on heterogeneous agent models, which have gained prominence as alternatives to representative agent frameworks. The ability to test specific belief distributions and endowment structures using LLM agents offers researchers a powerful new tool for validating theoretical predictions about how disagreement shapes market outcomes. For those interested in how artificial intelligence is reshaping the investment landscape more broadly, the exploration of AI agents in autonomous investing provides complementary insights.

Turn dense financial research into engaging interactive content your audience will actually read.

Get Started →

LLM Agents vs Human Traders: Key Differences

Perhaps the most consequential finding of the research is the fundamental behavioral difference between LLM and human traders. The paper concludes unequivocally that “LLMs do not inherently optimize for profit maximization but rather for following instructions accurately.” This is a paradigm-shifting observation. Economic theory has long assumed that market participants are rational profit maximizers, or at least boundedly rational actors who attempt to maximize returns within cognitive constraints. LLM agents represent something entirely different: they are instruction followers who faithfully execute their assigned strategies regardless of financial outcomes.

This distinction manifests in several concrete ways. Human traders, even when given explicit strategies, will deviate when they perceive better opportunities or when losses mount. A human value investor who has watched their portfolio decline for several rounds will likely reconsider their approach, perhaps shifting to a momentum strategy or reducing position sizes. An LLM value investor maintains its strategic direction with mechanical consistency, continuing to buy what it perceives as undervalued assets even as losses accumulate.

The sensitivity to prompt design adds another dimension absent in human trading. Slight changes in natural language instructions can produce fundamentally different behaviors from the same underlying model. A market maker instructed to “maintain tight spreads and manage inventory aggressively” will behave quite differently from one told to “provide deep liquidity while minimizing risk.” Human traders bring personal biases, risk preferences, and emotional responses regardless of how they are instructed. LLM agents derive their entire behavioral profile from their prompts, making prompt engineering a critical determinant of market outcomes.

Yet the similarities are equally revealing. Like human participants in experimental asset market research, LLM agents sustain bubbles from overvalued starting positions. They exhibit meaningful price discovery behaviors, provide strategic liquidity, and generate emergent market dynamics that no individual agent was explicitly programmed to produce. The combination of mechanical instruction-following with the ability to process and reason about complex market information creates what Lopez-Lira describes as “a unique trading profile distinct from rule-based algorithms and human traders.”

Systemic Risk and Regulatory Implications

The research raises urgent questions about systemic risk that regulators cannot afford to ignore. The central concern is correlated behavior: if many market participants deploy trading systems built on similar foundation models, they may exhibit synchronized responses to specific market conditions. Lopez-Lira warns that similar LLM architectures “responding uniformly to comparable prompts or market signals could inadvertently create destabilizing trading patterns without explicit coordination.”

This risk differs qualitatively from traditional algorithmic trading concerns. When multiple firms deploy momentum-following algorithms, the risk of herding is well understood and can be partially mitigated through diverse strategy design. But LLM-based systems introduce a subtler form of correlation. Even with different prompts and strategies, agents built on the same foundation model share underlying biases in how they interpret language, process uncertainty, and weight conflicting information. These shared biases could create invisible correlations that only become apparent during market stress events.

The paper also highlights the risk of flawed strategy amplification. Because LLMs faithfully follow their instructions without independent judgment about strategy quality, a poorly designed prompt could drive an agent to systematically destabilize markets. Unlike a human trader who would recognize and correct a failing strategy, an LLM agent will persist until its resources are exhausted or its operator intervenes. At scale, this could introduce novel forms of market manipulation, whether intentional or accidental.

The research by Dou, Goldstein, and Ji (2024) on algorithmic collusion adds another layer of concern. Their work showed that reinforcement learning-based AI speculators can autonomously learn collusive behavior without explicit coordination. While LLM agents operate differently, their shared training data and similar reasoning patterns could produce analogous outcomes. Regulators will need new frameworks to detect and prevent emergent coordination among LLM-based trading systems, a challenge that existing market surveillance tools are not designed to address.

For financial institutions, the implications are clear. Pre-deployment testing of LLM-based trading systems using frameworks like the one presented in this paper is not optional but essential. Understanding how an LLM agent will behave under various market conditions, how it will interact with other AI-driven participants, and how its behavior changes as market stress increases should be mandatory before any live deployment. The paper serves three audiences: practitioners developing LLM-based trading systems, regulators anticipating widespread LLM adoption, and researchers studying market dynamics with AI agents.

The Future of AI-Driven Market Simulations

Beyond its immediate findings, Lopez-Lira’s framework opens a new paradigm for financial research. Traditional experimental market studies require recruiting human participants, compensating them for their time, and accepting the inherent variability and limited sample sizes that come with human subject research. LLM-based simulations offer unprecedented control and replicability at a fraction of the cost. Researchers can run thousands of market scenarios with precisely controlled agent compositions, belief structures, endowment distributions, and market rules.

The framework enables testing financial theories that lack closed-form analytical solutions, an increasingly important capability as theoretical models become more complex. Agent-based computational finance has long struggled with the difficulty of specifying realistic agent decision rules. LLM agents solve this problem by accepting natural language strategy descriptions and translating them into coherent trading behavior, bridging the gap between theoretical descriptions of trader behavior and implementable simulation agents.

The complexity economics perspective embraced by the paper views markets as dynamic, non-equilibrium systems where agents adaptively learn and evolve strategies, exhibiting emergent phenomena and self-organization. This stands in contrast to the equilibrium-focused models that dominate much of academic finance. LLM-based simulations are uniquely suited to exploring this complexity paradigm because the agents can process rich contextual information, adapt their reasoning to changing conditions, and exhibit the kind of bounded rationality that complexity economists have long argued characterizes real market participants.

Several limitations of the current research point toward productive future directions. The study uses only GPT-4o as its foundation model. Testing with Claude, Gemini, Llama, and other architectures would reveal whether the observed behaviors are properties of LLMs in general or specific to one model family. The small number of agents (8 to 10) compared to real markets (thousands to millions of participants) raises questions about scalability. And the relatively short simulation periods (15 to 100 rounds) may not capture longer-term dynamics like regime changes, learning effects, or the gradual exhaustion of trading opportunities.

Despite these limitations, the research establishes a critical foundation. As LLM-based trading moves from academic simulation to live deployment, the need for rigorous testing frameworks will only grow. Lopez-Lira’s open-source platform provides the tools for practitioners, regulators, and researchers to understand what happens when artificial intelligence stops assisting human traders and starts trading on its own. The question is no longer whether LLMs can trade but how we ensure they do so in ways that enhance rather than undermine market integrity.

Make cutting-edge research accessible. Transform any PDF into an interactive experience in seconds.

Start Now →

Frequently Asked Questions

Can large language models actually execute trades in financial markets?

Yes, LLMs like GPT-4o can process market data, form price expectations, and execute buy or sell orders through structured output formats. Research by Lopez-Lira (2025) demonstrates that LLM agents successfully participate in continuous double-auction markets, placing both limit and market orders based on their assigned trading strategies and real-time market conditions.

Do LLM trading agents maximize profits like human traders?

No. A key finding is that LLMs prioritize following their prompt instructions over maximizing profits. Unlike human traders who deviate from strategies to pursue gains, LLM agents faithfully execute their programmed approach even when doing so results in financial losses. This makes them fundamentally different from both human traders and traditional optimization-based algorithms.

Can LLM agents create market bubbles in simulations?

Yes. When markets start above fundamental value, LLM agents struggle to correct the overvaluation, sustaining bubble-like conditions. However, they are effective at correcting undervaluation, showing asymmetric price discovery. This mirrors real-world observations about short-sale constraints limiting downward price corrections.

What financial theories were tested using LLM trading agents?

The research tested the Efficient Market Hypothesis (price discovery and convergence), Dividend Discount Model valuation, experimental bubble formation theories, heterogeneous beliefs models, market microstructure theory, and complexity economics frameworks. Results showed conditional support for EMH depending on market structure and starting conditions.

What are the systemic risks of LLM-based trading in real markets?

The primary risk is correlated behavior: if many trading systems use similar foundation models, they may respond uniformly to market signals, amplifying instabilities without explicit coordination. Additionally, LLMs’ faithful adherence to potentially flawed strategies could amplify volatility, and their difficulty correcting overvaluation suggests LLM-dominated markets could sustain bubbles longer than expected.

Your documents deserve to be read.

PDFs get ignored. Presentations get skipped. Reports gather dust.

Libertify transforms them into interactive experiences people actually engage with.

Transform Your First Document Free →

No credit card required · 30-second setup