New benchmark reveals autonomous driving systems fail under common scenario shifts

Fail2Drive benchmark shows state-of-the-art driving models experience 22.8% average success rate drops when tested on shifted scenarios.

Simon Gerstenecker · Andreas Geiger · Katrin Renz
Research Digest··2 min read
Gerstenecker et al. · AI-generated illustration · Zotpaper
Gerstenecker et al. · AI-generated illustration · Zotpaper
Gerstenecker et al. introduce Fail2Drive, the first paired-route benchmark for testing how well autonomous driving systems generalize beyond their training scenarios. Testing multiple state-of-the-art models on 200 routes with 17 types of distribution shifts, they found consistent degradation with an average 22.8% drop in success rates.

What they did

The authors created Fail2Drive, a benchmark using the CARLA simulator with 200 driving routes designed to test generalization. Each route comes in pairs: one in-distribution scenario matching training conditions, and one shifted scenario with changes in appearance (weather, lighting), layout (road structure), behavior (traffic patterns), or robustness challenges. This paired design isolates the specific effect of each type of shift.

They evaluated multiple state-of-the-art closed-loop driving models across these scenarios and developed an open-source toolbox for creating new test scenarios. A privileged expert policy validates that all scenarios remain solvable.

Key findings

• All tested models showed significant performance drops, with an average 22.8% decrease in success rates on shifted scenarios compared to their in-distribution counterparts • Models exhibited unexpected failure modes, including ignoring clearly visible objects in LiDAR data • Systems failed to learn fundamental concepts like distinguishing free space from occupied space • Different types of distribution shifts caused varying degrees of degradation across models

Why it matters

Existing autonomous driving benchmarks often reuse training scenarios at test time, making it unclear whether success reflects genuine driving competence or simple memorization. This benchmark provides the first systematic way to measure true generalization in closed-loop driving, revealing that current state-of-the-art systems are far less robust than their performance on standard benchmarks suggests.

Caveats

The benchmark is limited to CARLA simulation scenarios, which may not capture all real-world complexities. The 17 scenario classes, while comprehensive, represent a subset of possible distribution shifts that autonomous vehicles might encounter. The study focuses on closed-loop evaluation but doesn't address how to improve generalization.

§

Analysis

This work addresses a critical gap in autonomous driving evaluation. Most benchmarks test on scenarios similar to training data, creating an illusion of robustness. By systematically measuring generalization across paired scenarios, Fail2Drive provides a more honest assessment of current capabilities. The finding that models ignore LiDAR-visible objects suggests fundamental issues in sensor fusion and spatial reasoning that go beyond simple overfitting. The open-source toolbox could accelerate research into more robust driving systems.

newspaper

Research Digest

Articles published under the Zotpaper byline are synthesized from multiple source publications by our AI editor and reviewed by our editorial process. Each story combines reporting from credible outlets to give readers a balanced, comprehensive view.