What they did
The authors created Fail2Drive, a benchmark using the CARLA simulator with 200 driving routes designed to test generalization. Each route comes in pairs: one in-distribution scenario matching training conditions, and one shifted scenario with changes in appearance (weather, lighting), layout (road structure), behavior (traffic patterns), or robustness challenges. This paired design isolates the specific effect of each type of shift.
They evaluated multiple state-of-the-art closed-loop driving models across these scenarios and developed an open-source toolbox for creating new test scenarios. A privileged expert policy validates that all scenarios remain solvable.
Key findings
• All tested models showed significant performance drops, with an average 22.8% decrease in success rates on shifted scenarios compared to their in-distribution counterparts • Models exhibited unexpected failure modes, including ignoring clearly visible objects in LiDAR data • Systems failed to learn fundamental concepts like distinguishing free space from occupied space • Different types of distribution shifts caused varying degrees of degradation across models
Why it matters
Existing autonomous driving benchmarks often reuse training scenarios at test time, making it unclear whether success reflects genuine driving competence or simple memorization. This benchmark provides the first systematic way to measure true generalization in closed-loop driving, revealing that current state-of-the-art systems are far less robust than their performance on standard benchmarks suggests.
Caveats
The benchmark is limited to CARLA simulation scenarios, which may not capture all real-world complexities. The 17 scenario classes, while comprehensive, represent a subset of possible distribution shifts that autonomous vehicles might encounter. The study focuses on closed-loop evaluation but doesn't address how to improve generalization.