0:00

0:00





Self-Driving Cars in 2025: A Comprehensive Benchmark Reveals How Far We Still Have to Go

📌 Key Takeaways

  • Level 5 Autonomy Remains Elusive: Despite decades of promises, self-driving cars still require constant human supervision and can’t operate without high-definition maps.
  • Rule-Based Systems Outperform AI: When perception is perfect, simple planning algorithms dramatically outperformed sophisticated neural networks in comprehensive testing.
  • Obstacle Avoidance Is Universally Unsolved: All five leading models struggled with scenarios involving unexpected obstacles, revealing a critical safety gap.
  • Benchmark Gaming Obscures Progress: Models fine-tuned for specific leaderboards don’t generalize to real-world conditions, creating false progress indicators.
  • Perception Gap Is The Real Bottleneck: The study reveals that perception, not planning, remains the primary barrier to autonomous vehicle deployment.

The Broken Promise of Level 5 Autonomy

Since the DARPA Urban Challenge of 2007, the autonomous vehicle industry has repeatedly promised that fully self-driving cars are imminent. Nearly two decades later, the technology remains firmly in the R&D stage. Deployed test vehicles still require constant human supervision, and almost all current deployments depend on high-definition maps that require continuous updating, severely limiting where these vehicles can actually operate.

While perception systems have become sophisticated and planning modules have moved beyond brittle rule-based expert systems, neither state estimation nor planning and control are ready for mass deployment. The gap between industry promises and technical reality has become increasingly apparent as companies quietly push back deployment timelines and reduce the scope of their initial launches.

This persistent gap between promise and reality has prompted researchers to ask fundamental questions about current approaches to autonomous driving. Are we focusing on the right problems? Are our evaluation methods revealing genuine progress or merely benchmark optimization that doesn’t translate to real-world performance?

Why Comparing Self-Driving Models Has Been So Difficult

A fundamental challenge in advancing autonomous driving research is the lack of unified evaluation. Different state-of-the-art models are developed using different datasets, tested on different platforms, and measured with different metrics. This fragmentation makes it nearly impossible to determine which approaches genuinely advance the field versus which merely exploit benchmark-specific quirks.

Models can be fine-tuned to “hill-climb” on specific benchmarks, creating inflated performance numbers that don’t generalize to real-world scenarios. Three major leaderboards dominate the field — CARLA, Waymo Open Dataset, and nuPlan — each with unique datasets and evaluation protocols, making cross-comparison nearly impossible.

This evaluation crisis has profound implications for both research and industry. Without reliable comparisons, companies and researchers struggle to identify which technical approaches deserve investment and development resources. The result has been a proliferation of competing systems that claim state-of-the-art performance on their chosen benchmarks, but whose actual capabilities remain opaque when faced with real-world driving conditions.

The Study: An Apples-to-Apples Comparison on a Common Platform

Researchers at the University of Southern California conducted a systematic comparison by adopting CARLA Leaderboard v2.0 as a common evaluation platform. This approach finally enabled direct comparison of leading autonomous driving models under identical conditions — something the field had been lacking.

See how advanced AI models compare in this comprehensive interactive analysis.

Try It Free →

They compiled all top-ranked methods from each major competition since 2022, then narrowed to five models with publicly available code and pretrained weights. This reproducibility requirement immediately revealed a concerning trend: only 5 of 18 surveyed top methods were actually reproducible, highlighting a broader crisis in autonomous driving research transparency.

The five evaluated models represent the full spectrum of current approaches: three end-to-end systems (TF++, InterFuser, TCP), and two modular pipelines (PDM-Lite, MTR+MPC). Each was tested across 42 routes, Town05 benchmarks (short and long), and 17 distinct traffic scenarios drawn from NHTSA pre-crash typologies — covering parking exits, intersection negotiations, obstacle avoidance, lane changes, and dynamic object crossings.

Two Competing Architectures: End-to-End vs. Modular Pipelines

The autonomous driving field has split into two dominant paradigms, each with fundamental tradeoffs that shape their real-world performance characteristics.

Modular pipelines separate perception, prediction, and planning into distinct components. This approach offers significant advantages in interpretability and engineering flexibility — teams can swap best-in-class modules and debug specific components when failures occur. However, modular systems suffer from compounding errors as mistakes cascade through the pipeline, and optimizing individual components doesn’t necessarily optimize overall driving performance.

End-to-end systems collapse all stages into a single neural network trained via imitation or reinforcement learning, optimizing a joint objective from raw sensor data to control outputs. These systems can theoretically better handle edge cases and optimize for overall driving quality rather than individual component metrics. However, they lack interpretability and are notoriously difficult to tune, making it challenging to diagnose and fix failures.

A third hybrid approach has emerged, combining transformer-based backbones for high-level scene understanding with rule-based controllers that inject expert knowledge for low-level control. This architecture attempts to bridge the gap between sophisticated perception and reliable execution, potentially offering the best of both worlds.

The Five Models Under the Microscope

The study evaluated five representative models spanning the current state-of-the-art in autonomous driving research:

TF++ (Transfuser): Pioneered transformer-based fusion of camera and LiDAR data, using a GRU decoder to generate waypoints that are converted to control commands via a PID controller. This model represents the early generation of transformer-based approaches to autonomous driving.

TCP (Trajectory-guided Control Prediction): Combines waypoint prediction and direct control prediction, adaptively switching between the two approaches based on driving context. The system favors control prediction during turns and trajectory prediction during straight-line driving, attempting to optimize for different driving scenarios.

InterFuser: Ingests four RGB camera views plus LiDAR data through a shared Transformer encoder, with a distinctive safety module that solves a linear programming problem to select maximum safe speed. This approach explicitly incorporates safety constraints into the decision-making process.

PDM-Lite: A rule-based planner that operates using privileged simulator state (ground truth object positions and dynamics), employing the Intelligent Driver Model for speed control and a Kinematic Bicycle Model for trajectory forecasting. While not realistic for real-world deployment, this model serves as an important baseline for understanding the performance ceiling when perception is perfect.

MTR+MPC: A novel contribution combining the Waymo-leading Motion Transformer prediction backbone with a model-predictive control planner. This represents the first reported integration of these systems, combining state-of-the-art motion prediction with principled control theory.

The Surprising Winner: Old-School Rules Beat Neural Networks

The headline result challenges many assumptions about the superiority of learning-based approaches: the rule-based PDM-Lite dramatically outperformed all learning-based models. At the scenario level, PDM-Lite achieved a driving score of 54.3 versus the next-best InterFuser at 23.8 — more than double the performance.

Explore the detailed performance data and see how each model performed across different scenarios.

Get Started →

PDM-Lite scored near-perfect results in multiple challenging scenarios: Parking Exit (99.9), Signalized Junction Left Turn (94.9), and Opposite Vehicle Running a Red Light (94.7) — scenarios where even the best neural network models scored in the single digits. It achieved 100% route completion across all traffic negotiation scenarios, while end-to-end models frequently failed to complete even basic maneuvers.

However, this result comes with a critical caveat: PDM-Lite operates with privileged access to the true ground-truth simulator state, completely bypassing the perception problem. This gives it perfect knowledge of all vehicle positions, velocities, and intentions — information that real-world systems must infer from noisy sensor data.

The dramatic performance gap underscores that when perception is perfect, relatively simple planning algorithms can excel. This suggests that the perception gap, rather than planning sophistication, remains the primary bottleneck preventing real-world autonomous vehicle deployment. The finding has profound implications for research priorities and resource allocation within the autonomous driving community.

Where Each Approach Breaks Down

The detailed scenario analysis reveals telling failure patterns that illuminate the strengths and weaknesses of each architectural approach.

TCP achieved the highest map-level infraction penalty (82.7, meaning fewest traffic violations) and performed well on longer, less variable routes where its trajectory planning could operate effectively. However, it completely collapsed in fine-grained scenarios requiring precise maneuvering — scoring just 0.7 on Parking Exit and 0.0 on Invading Turn scenarios.

InterFuser posted the second-highest route completion rate but showed extreme variability across scenarios. While it scored a perfect 100.0 on Opposite Vehicle Taking Priority, it averaged only 9.4 across obstacle avoidance scenarios. This inconsistency suggests the model learned specific patterns well but struggles to generalize to novel situations.

MTR+MPC emerged as the safest model overall, achieving the highest scenario-level infraction penalty (46.2) and strong map-level performance (76.3). However, its conservative approach proved too restrictive for practical deployment — it completed only 19.8% of map routes. The domain mismatch between Waymo training data and CARLA simulation dynamics likely compounded this conservative bias.

TF++ struggled most severely with complex scenarios, scoring just 2.0 on Parking Exit and 0.0 on Opposite Vehicle Taking Priority. These failures reveal that transformer-based sensor fusion alone, without an explicit planning framework, cannot handle the geometric reasoning required for narrow spaces or dynamic interactions with other vehicles.

All models showed a consistent weak point: obstacle avoidance scenarios proved universally challenging, with even the privileged PDM-Lite averaging only 33.2 in this category — highlighting fundamental limitations in current approaches to unexpected object handling.

The Obstacle Avoidance Problem Nobody Has Solved

Across all five models, obstacle avoidance emerged as the most consistently challenging category, representing perhaps the most critical unsolved problem in autonomous driving. Scenarios involving vehicles opening doors, parked obstacles on two-way roads, invading turns, construction zones, side-lane hazards, and accidents yielded universally low driving scores.

The performance breakdown is stark: PDM-Lite (33.2), InterFuser (9.4), MTR+MPC (6.9), TCP (3.2), and TF++ (2.1). These scenarios require the most sophisticated integration of perception, prediction, and planning capabilities — skills that current systems have not mastered.

Successful obstacle avoidance demands that vehicles detect unexpected static or slow-moving obstacles, predict whether and how they might move, plan an avoidance maneuver that may require entering oncoming traffic lanes, and execute the maneuver safely while monitoring for additional dynamic changes. This represents arguably the most safety-critical capability for real-world deployment.

The universal failure across different architectural approaches suggests that the problem may require fundamental advances rather than incremental improvements to existing methods. Some researchers propose that advanced AI planning algorithms combined with better uncertainty quantification may be necessary to handle these scenarios safely.

The Role of Foundation Models and What Comes Next

Recent research has begun leveraging large language models and vision-language models for autonomous driving, representing a potentially transformative new direction for the field. These approaches embed human reasoning logic for interpretable decision-making and use large multimodal models for sophisticated scene understanding and rule-compliant driving behavior.

Foundation model approaches promise several advantages over current methods: improved adaptability to novel scenarios through pre-trained world knowledge, better explainability through natural language reasoning traces, and the potential for few-shot learning of new driving behaviors without extensive retraining.

Discover how foundation models are reshaping autonomous vehicle development in our interactive analysis.

Start Now →

While these multi-modal foundation model approaches were beyond the scope of the current evaluation, they represent an emerging frontier that may address some of the fundamental limitations revealed by this study. The study authors explicitly note this as a key direction for future research, alongside adjusting the safety-conservatism tradeoff in MPC-based planners and developing stronger control modules to pair with transformer-based perception systems.

However, foundation models also introduce new challenges: computational requirements for real-time inference, reliability concerns for safety-critical applications, and the need for robust evaluation frameworks that can assess natural language reasoning in driving contexts.

Practical Implications for the AV Industry

The findings carry significant implications for commercial autonomous vehicle development and investment strategies. First and most importantly, the perception gap — not the planning gap — remains the dominant bottleneck, as demonstrated by PDM-Lite’s success with perfect state information. This suggests that companies should prioritize perception research over increasingly sophisticated planning algorithms.

Second, current end-to-end models that achieve top performance on specific leaderboards may not generalize across diverse driving conditions. This finding suggests that benchmark performance serves as a poor proxy for real-world readiness, potentially misleading both investors and the public about genuine progress toward deployment.

Third, hybrid approaches that combine robust perception with deterministic planning currently yield the best overall results, pointing toward an architectural sweet spot that the industry should explore more thoroughly. The combination of learned perception with rule-based planning may offer better performance and safety guarantees than pure end-to-end approaches.

Fourth, the reproducibility crisis in autonomous vehicle research — where most top methods lack public code or pretrained weights — actively hampers genuine progress. Only 5 of 18 surveyed methods were reproducible, making it impossible to build on previous work effectively. This suggests that funding agencies and journals should require code and model release as a condition for publication.

The study also reveals that safety-focused approaches like MTR+MPC, while achieving fewer infractions, may be too conservative for practical deployment. This highlights the need for better methods to balance safety with progress, potentially through improved uncertainty quantification and risk assessment frameworks.

The Road Ahead: What Must Change

The study identifies several concrete research directions that could accelerate progress toward viable autonomous vehicles. First, the field must develop models that perform well without high-definition maps, as map dependence severely limits deployment scalability and makes systems brittle to environmental changes.

Second, researchers must close the perception-to-planning gap exposed by PDM-Lite’s privileged access advantage. This likely requires advances in sensor fusion, uncertainty quantification, and robust state estimation under challenging conditions like adverse weather, unusual lighting, or sensor degradation.

Third, the obstacle avoidance problem demands focused attention, as it represents scenarios most critical for safety. This may require novel approaches that combine reactive planning, predictive modeling, and risk assessment in ways current systems cannot achieve.

Fourth, the field must address the conservatism problem that crippled otherwise capable systems like MTR+MPC. This could involve better risk-aware planning algorithms, improved confidence calibration, or adaptive safety margins that adjust based on scenario complexity.

Fifth, the domain transfer problem that affected MTR+MPC when moving from Waymo to CARLA training data needs resolution. Real-world deployment requires systems that can adapt to new environments, vehicle dynamics, and traffic patterns without extensive retraining.

The authors advocate for the community to standardize on CARLA as both a training and evaluation benchmark, and to make code publicly compatible with it. This would enable the kind of rigorous comparison conducted in this study to become routine, accelerating progress through better evaluation and reproducibility.

As the paper’s conclusion states plainly: despite nearly two decades of promises and billions in investment, “we are still not there yet.” However, studies like this provide the rigorous evaluation framework necessary to focus efforts on the most critical remaining challenges and measure genuine progress toward the goal of safe, reliable autonomous vehicles.

Frequently Asked Questions

Why are self-driving cars still not widely available after decades of development?

Despite nearly two decades of development since the DARPA Urban Challenge, self-driving cars remain in the R&D stage due to fundamental challenges in perception, planning, and real-world adaptability. Current systems require constant human supervision and depend on high-definition maps, severely limiting where they can operate.

What are the main differences between end-to-end and modular autonomous driving systems?

Modular systems separate perception, prediction, and planning into distinct components, offering interpretability but suffering from compounding errors. End-to-end systems use a single neural network from sensor data to control outputs, better handling edge cases but lacking interpretability and being difficult to tune.

Which autonomous driving model performed best in the 2025 benchmark study?

The rule-based PDM-Lite dramatically outperformed all learning-based models with a driving score of 54.3 versus the next-best InterFuser at 23.8. However, PDM-Lite operates with privileged access to ground-truth simulator state, bypassing perception entirely.

What is the biggest unsolved challenge in autonomous driving according to recent research?

Obstacle avoidance emerged as the most challenging scenario across all models tested. These situations require sophisticated integration of perception, prediction, and planning to detect unexpected obstacles, predict their movement, and plan safe avoidance maneuvers.

How do current autonomous vehicle benchmarks compare to real-world driving conditions?

Current benchmarks like CARLA, Waymo Open Dataset, and nuPlan use different datasets and evaluation protocols, making cross-comparison difficult. Models can be fine-tuned to specific benchmarks, creating inflated performance that doesn’t generalize to real-world conditions.

Your documents deserve to be read.

PDFs get ignored. Presentations get skipped. Reports gather dust.

Libertify transforms them into interactive experiences people actually engage with.

No credit card required · 30-second setup