Traffic Scene Generation from Natural Language Description for Autonomous Vehicles with Large Language Model

Bo-Kai Ruan¹, Hao-Tang Tsui¹, Yung-Hui Li², Hong-Han Shuai¹

¹National Yang Ming Chiao Tung University, ²AI Research Center, Hon Hai Research Institute, Taiwan

The TTSG pipeline, encompassing four primary stages: analysis, road candidate retrieval, agent planning, road ranking, and generation.

Abstract

Generating realistic and controllable traffic scenes from natural language can greatly enhance the development and evaluation of autonomous driving systems. However, this task poses unique challenges: (1) grounding free-form text into spatially valid and semantically coherent layouts, (2) composing scenarios without predefined locations, and (3) planning multi-agent behaviors and selecting roads that respect agents' configurations. To address these, we propose a modular framework, TTSG, comprising prompt analysis, road retrieval, agent planning, and a novel plan-aware road ranking algorithm to solve these challenges. While large language models (LLMs) are used as general planners, our design integrates them into a tightly controlled pipeline that enforces structure, feasibility, and scene diversity. Notably, our ranking strategy ensures consistency between agent actions and road geometry, enabling scene generation without predefined routes or spawn points. The framework supports both routine and safety-critical scenarios, as well as multi-stage event composition. Experiments on SafeBench demonstrate that our method achieves the lowest average collision rate (3.5%) across three critical scenarios. Moreover, driving captioning models trained on our generated scenes improve action reasoning by over 30 CIDEr points. These results underscore our proposed framework for flexible, interpretable, and safety-oriented simulation.

Video

Pipeline Animation

Environment

Agent's Actions

We support seven actions as shown below. The action blocking is designed to be dynamic and would always trying to block the ego.

Agent's Position

We support seven actions as shown below. The action blocking is designed to be dynamic and would always trying to block the ego.

Agent's Type

We support nine different types of agents as shown below.


car	bus	truck
firetruck	ambulance	police car
motorcycle	cyclist	pedestrian

Signals and Objects

We support ten different objects and three different signals as shown below.

signals	traffic_light	stop
signals	yield
objects	speed_30	speed_40
	speed_60	speed_90
	parallel_open_crosswalk	ladder_crosswalk
	continental_crosswalk	dashed_single_white
	solid_single_white	stop_line
	stop_sign_on_road

Visualization

Normal Scenario

A firetruck from the left road is coming when the ego car is turning right.

Daily traffic with more than ten cars.

Lots of cars, buses, trucks, and motorcycles are seen.

Critical Scenario

A pedestrian on the sidewalk is crossing the street in front of a truck stopping on the shoulder. Both are located on the front right.

A cyclist is crossing the street from a sidewalk in a dangerous way on a rainy night.

Some cars from the opposite straight is coming when the ego car is turning left.

Conditional Selection

The ego car is going straight at the intersection with a traffic light. There are some puddles on the road.

The ego car is turning left at the intersection with no traffic light, stop sign, or stop sign on road. A car coming from the straight is turning right.

A pedestrian is crossing the road with the parallel open crosswalk and the ego car is turning right.

Sequential Events

Event 1

The ego car is turning left at the intersection with no traffic light, stop sign, or stop sign on road. A car coming from the straight is going straight.

Event 2

The ego car is being blocked by two cars in front.

Experiments

Critical Scenario

We show that our approach can also be used to train the agent under the critical scenario by selected three challenging scenarios from SafeBench. "CR" represents "Collision Rate" and "OS" represents "Overall Score". We compare our results with Learning-to-Collide (LS) [1], AdvSim (AS) [2], Adversarial Trajectory Optimization (AT) [3], and ChatScene (CS) [4]. We color each column with best and second best.

Algo.	Metric	Scenario			Avg.
Algo.	Metric	Straight Obstacle	Lane Changing	Unprotected Left-turn	Avg.
LC	CR↧	0.120	0.510	0.000	0.210
AS		0.230	0.530	0.050	0.270
AT		0.140	0.300	0.000	0.150
CS		0.030	0.110	0.100	0.080
Ours		0.021	0.085	0.000	0.035
LC	OS↥	0.827	0.684	0.954	0.822
AS		0.784	0.666	0.937	0.796
AT		0.849	0.803	0.948	0.867
CS		0.905	0.906	0.903	0.905
Ours		0.895	0.894	0.953	0.914

Caption Driving

We demonstrate how our pipeline can be used for training captioning model under critical scenario. We report the "BLEU" (B), "METEOR" (M), and "CIDEr" (C) for narration and reasoning.

	Narration			Reasoning
	B	M	C	B	M	C
ADAPT [5]	4.8	13.5	15.2	0.0	10.0	18.4
+ ours	9.9	18.4	52.9	7.2	11.2	51.9

Ablation Study

We evaluate the efficacy of framework components by generating ten distinct prompts, each tailored to test the components under diverse scenarios. These scenarios encompass three normal, four critical, and three specific road conditions, such as the presence of traffic lights or the absence of crossroads. "AA" represents "Agent Accuracy", "RA" represents "Road Accuracy", and "SA" represents "Scene Accuracy". Prompts for the ablation study can be found in here.

Quality.	Scenario						Avg.
Quality.	Normal		Critical		Conditional		Avg.
Plan Quality	AA↥	RA↥	AA↥	RA↥	AA↥	RA↥	AA↥	RA↥
w/o. analysis	0.917	0.667	0.833	0.750	0.750	0.917	0.833	0.775
w. analysis	0.917	1.000	0.875	0.750	1.000	0.917	0.925	0.875
Scene Quality	SA↥		SA↥		SA↥		SA↥
w/o. ranking	0.667		0.450		0.600		0.560
w. ranking	0.867		0.750		0.800		0.800

Diversity Test

We provide a diversity test to evaluate the diversity of the generated traffic scenes. We make eight different scenes and test each scene five times. We report the "Agent Diversity" (AD), "Road Diversity" (RD), and "Scene Accuracy" (SA). Prompts for the diversity test can be found in here.

Metric	Scenario								Avg.
	Normal		Critical			Conditional
	Daily Traffic	Intersection	Pedestrian Crushing	Blocking Agent	Dangerous Cut-off	Only having Two-wheel Vehicles	Having Emergency Vehicles	Rainy Weather
AD↥	0.789	0.833	0.500	0.750	0.600	0.714	0.500	0.800	0.686
RD↥	1.000	1.000	1.000	1.000	1.000	0.800	1.000	1.000	0.975
SA↥	1.000	0.800	0.400	0.800	0.800	0.600	1.000	1.000	0.800

Reference

[1] W. Ding, B. Chen, M. Xu, and D. Zhao, "Learning to collide: An adaptive safety-critical scenarios generating method," in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2020, pp. 2243–2250.

[2] J. Wang, A. Pun, J. Tu, S. Manivasagam, A. Sadat, S. Casas, M. Ren, and R. Urtasun, "Advsim: Generating safety-critical scenarios for self-driving vehicles," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9909–9918

[3] J. Zhang, C. Xu, and B. Li, "Chatscene: Knowledge-enabled safety-critical scenario generation for autonomous vehicles," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 15 459-15 469.

[4] Q. Zhang, S. Hu, J. Sun, Q. A. Chen, and Z. M. Mao, "On adversarial robustness of trajectory prediction for autonomous vehicles," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 159–15 168.

[5] Jin, B., Liu, X., Zheng, Y., Li, P., Zhao, H., Zhang, T., ... & Liu, J. "ADAPT: Action-aware Driving Caption Transformer". In International Conference on Robotics and Automation , 2023, pp. 7554-7561.

BibTeX

@article{ruan2024ttsg,
  title={Traffic Scene Generation from Natural Language Description for Autonomous Vehicles with Large Language Model},
  author={Ruan, Bo-Kai and Tsui, Hao-Tang and Li, Yung-Hui and Shuai, Hong-Han},
  journal={arXiv preprint arXiv:2409.09575},
  year={2024}
}