Agent's Actions
We support seven actions as shown below. The action blocking is designed to be dynamic and would always trying to block the ego.
Generating realistic and controllable traffic scenes from natural language can greatly enhance the development and evaluation of autonomous driving systems. However, this task poses unique challenges: (1) grounding free-form text into spatially valid and semantically coherent layouts, (2) composing scenarios without predefined locations, and (3) planning multi-agent behaviors and selecting roads that respect agents' configurations. To address these, we propose a modular framework, TTSG, comprising prompt analysis, road retrieval, agent planning, and a novel plan-aware road ranking algorithm to solve these challenges. While large language models (LLMs) are used as general planners, our design integrates them into a tightly controlled pipeline that enforces structure, feasibility, and scene diversity. Notably, our ranking strategy ensures consistency between agent actions and road geometry, enabling scene generation without predefined routes or spawn points. The framework supports both routine and safety-critical scenarios, as well as multi-stage event composition. Experiments on SafeBench demonstrate that our method achieves the lowest average collision rate (3.5%) across three critical scenarios. Moreover, driving captioning models trained on our generated scenes improve action reasoning by over 30 CIDEr points. These results underscore our proposed framework for flexible, interpretable, and safety-oriented simulation.
We support seven actions as shown below. The action blocking is designed to be dynamic and would always trying to block the ego.
We support seven actions as shown below. The action blocking is designed to be dynamic and would always trying to block the ego.
We support nine different types of agents as shown below.
Agent Type | ||||
---|---|---|---|---|
car | bus | truck | ||
firetruck | ambulance | police car | ||
motorcycle | cyclist | pedestrian |
We support ten different objects and three different signals as shown below.
signals | traffic_light | stop |
---|---|---|
yield | ||
objects | speed_30 | speed_40 |
speed_60 | speed_90 | |
parallel_open_crosswalk | ladder_crosswalk | |
continental_crosswalk | dashed_single_white | |
solid_single_white | stop_line | |
stop_sign_on_road |
A firetruck from the left road is coming when the ego car is turning right.
Daily traffic with more than ten cars.
Lots of cars, buses, trucks, and motorcycles are seen.
A pedestrian on the sidewalk is crossing the street in front of a truck stopping on the shoulder. Both are located on the front right.
A cyclist is crossing the street from a sidewalk in a dangerous way on a rainy night.
Some cars from the opposite straight is coming when the ego car is turning left.
The ego car is going straight at the intersection with a traffic light. There are some puddles on the road.
The ego car is turning left at the intersection with no traffic light, stop sign, or stop sign on road. A car coming from the straight is turning right.
A pedestrian is crossing the road with the parallel open crosswalk and the ego car is turning right.
The ego car is turning left at the intersection with no traffic light, stop sign, or stop sign on road. A car coming from the straight is going straight.
The ego car is being blocked by two cars in front.
We show that our approach can also be used to train the agent under the critical scenario by selected three challenging scenarios from SafeBench. "CR" represents "Collision Rate" and "OS" represents "Overall Score". We compare our results with Learning-to-Collide (LS) [1], AdvSim (AS) [2], Adversarial Trajectory Optimization (AT) [3], and ChatScene (CS) [4]. We color each column with best and second best.
Algo. | Metric | Scenario | Avg. | ||
---|---|---|---|---|---|
Straight Obstacle | Lane Changing | Unprotected Left-turn | |||
LC | CR↧ | 0.120 | 0.510 | 0.000 | 0.210 |
AS | 0.230 | 0.530 | 0.050 | 0.270 | |
AT | 0.140 | 0.300 | 0.000 | 0.150 | |
CS | 0.030 | 0.110 | 0.100 | 0.080 | |
Ours | 0.021 | 0.085 | 0.000 | 0.035 | |
LC | OS↥ | 0.827 | 0.684 | 0.954 | 0.822 | AS | 0.784 | 0.666 | 0.937 | 0.796 | AT | 0.849 | 0.803 | 0.948 | 0.867 |
CS | 0.905 | 0.906 | 0.903 | 0.905 | |
Ours | 0.895 | 0.894 | 0.953 | 0.914 |
We demonstrate how our pipeline can be used for training captioning model under critical scenario. We report the "BLEU" (B), "METEOR" (M), and "CIDEr" (C) for narration and reasoning.
Narration | Reasoning | |||||
---|---|---|---|---|---|---|
B | M | C | B | M | C | |
ADAPT [5] | 4.8 | 13.5 | 15.2 | 0.0 | 10.0 | 18.4 |
+ ours | 9.9 | 18.4 | 52.9 | 7.2 | 11.2 | 51.9 |
We evaluate the efficacy of framework components by generating ten distinct prompts, each tailored to test the components under diverse scenarios. These scenarios encompass three normal, four critical, and three specific road conditions, such as the presence of traffic lights or the absence of crossroads. "AA" represents "Agent Accuracy", "RA" represents "Road Accuracy", and "SA" represents "Scene Accuracy". Prompts for the ablation study can be found in here.
Quality. | Scenario | Avg. | ||||||
---|---|---|---|---|---|---|---|---|
Normal | Critical | Conditional | ||||||
Plan Quality | AA↥ | RA↥ | AA↥ | RA↥ | AA↥ | RA↥ | AA↥ | RA↥ |
w/o. analysis | 0.917 | 0.667 | 0.833 | 0.750 | 0.750 | 0.917 | 0.833 | 0.775 |
w. analysis | 0.917 | 1.000 | 0.875 | 0.750 | 1.000 | 0.917 | 0.925 | 0.875 |
Scene Quality | SA↥ | SA↥ | SA↥ | SA↥ | ||||
w/o. ranking | 0.667 | 0.450 | 0.600 | 0.560 | ||||
w. ranking | 0.867 | 0.750 | 0.800 | 0.800 |
We provide a diversity test to evaluate the diversity of the generated traffic scenes. We make eight different scenes and test each scene five times. We report the "Agent Diversity" (AD), "Road Diversity" (RD), and "Scene Accuracy" (SA). Prompts for the diversity test can be found in here.
Metric | Scenario | Avg. | |||||||
---|---|---|---|---|---|---|---|---|---|
Normal | Critical | Conditional | |||||||
Daily Traffic |
Intersection |
Pedestrian Crushing |
Blocking Agent |
Dangerous Cut-off |
Only
having Two-wheel Vehicles |
Having Emergency Vehicles |
Rainy Weather |
||
AD↥ | 0.789 | 0.833 | 0.500 | 0.750 | 0.600 | 0.714 | 0.500 | 0.800 | 0.686 |
RD↥ | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 0.800 | 1.000 | 1.000 | 0.975 |
SA↥ | 1.000 | 0.800 | 0.400 | 0.800 | 0.800 | 0.600 | 1.000 | 1.000 | 0.800 |
[1] W. Ding, B. Chen, M. Xu, and D. Zhao, "Learning to collide: An adaptive safety-critical scenarios generating method," in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2020, pp. 2243–2250.
[2] J. Wang, A. Pun, J. Tu, S. Manivasagam, A. Sadat, S. Casas, M. Ren, and R. Urtasun, "Advsim: Generating safety-critical scenarios for self-driving vehicles," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9909–9918
[3] J. Zhang, C. Xu, and B. Li, "Chatscene: Knowledge-enabled safety-critical scenario generation for autonomous vehicles," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 15 459-15 469.
[4] Q. Zhang, S. Hu, J. Sun, Q. A. Chen, and Z. M. Mao, "On adversarial robustness of trajectory prediction for autonomous vehicles," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 159–15 168.
[5] Jin, B., Liu, X., Zheng, Y., Li, P., Zhao, H., Zhang, T., ... & Liu, J. "ADAPT: Action-aware Driving Caption Transformer". In International Conference on Robotics and Automation , 2023, pp. 7554-7561.
@article{ruan2024ttsg,
title={Traffic Scene Generation from Natural Language Description for Autonomous Vehicles with Large Language Model},
author={Ruan, Bo-Kai and Tsui, Hao-Tang and Li, Yung-Hui and Shuai, Hong-Han},
journal={arXiv preprint arXiv:2409.09575},
year={2024}
}