Traffic Scene Generation from Natural Language Description for Autonomous Vehicles with Large Language Model

Bo-Kai Ruan1, Hao-Tang Tsui1, Yung-Hui Li2, Hong-Han Shuai1
1National Yang Ming Chiao Tung University, 2AI Research Center, Hon Hai Research Institute, Taiwan
this slowpoke moves

The TTSG pipeline, encompassing four primary stages: analysis, road candidate retrieval, agent planning, road ranking, and generation.

Abstract

Generating realistic and controllable traffic scenes from natural language can greatly enhance the development and evaluation of autonomous driving systems. However, this task poses unique challenges: (1) grounding free-form text into spatially valid and semantically coherent layouts, (2) composing scenarios without predefined locations, and (3) planning multi-agent behaviors and selecting roads that respect agents' configurations. To address these, we propose a modular framework, TTSG, comprising prompt analysis, road retrieval, agent planning, and a novel plan-aware road ranking algorithm to solve these challenges. While large language models (LLMs) are used as general planners, our design integrates them into a tightly controlled pipeline that enforces structure, feasibility, and scene diversity. Notably, our ranking strategy ensures consistency between agent actions and road geometry, enabling scene generation without predefined routes or spawn points. The framework supports both routine and safety-critical scenarios, as well as multi-stage event composition. Experiments on SafeBench demonstrate that our method achieves the lowest average collision rate (3.5%) across three critical scenarios. Moreover, driving captioning models trained on our generated scenes improve action reasoning by over 30 CIDEr points. These results underscore our proposed framework for flexible, interpretable, and safety-oriented simulation.

Video

Pipeline Animation

this slowpoke moves

Environment

Agent's Actions

We support seven actions as shown below. The action blocking is designed to be dynamic and would always trying to block the ego.

action

Agent's Position

We support seven actions as shown below. The action blocking is designed to be dynamic and would always trying to block the ego.

position

Agent's Type

We support nine different types of agents as shown below.

Agent Type
car bus truck
firetruck ambulance police car
motorcycle cyclist pedestrian

Signals and Objects

We support ten different objects and three different signals as shown below.

signals traffic_light stop
yield
objects speed_30 speed_40
speed_60 speed_90
parallel_open_crosswalk ladder_crosswalk
continental_crosswalk dashed_single_white
solid_single_white stop_line
stop_sign_on_road

Visualization

agent

position

action

others


Normal Scenario

A firetruck from the left road is coming when the ego car is turning right.

Daily traffic with more than ten cars.

Lots of cars, buses, trucks, and motorcycles are seen.


Critical Scenario

A pedestrian on the sidewalk is crossing the street in front of a truck stopping on the shoulder. Both are located on the front right.

A cyclist is crossing the street from a sidewalk in a dangerous way on a rainy night.

Some cars from the opposite straight is coming when the ego car is turning left.


Conditional Selection

The ego car is going straight at the intersection with a traffic light. There are some puddles on the road.

The ego car is turning left at the intersection with no traffic light, stop sign, or stop sign on road. A car coming from the straight is turning right.

A pedestrian is crossing the road with the parallel open crosswalk and the ego car is turning right.

Sequential Events

Event 1

The ego car is turning left at the intersection with no traffic light, stop sign, or stop sign on road. A car coming from the straight is going straight.


Event 2

The ego car is being blocked by two cars in front.

Experiments

Critical Scenario

We show that our approach can also be used to train the agent under the critical scenario by selected three challenging scenarios from SafeBench. "CR" represents "Collision Rate" and "OS" represents "Overall Score". We compare our results with Learning-to-Collide (LS) [1], AdvSim (AS) [2], Adversarial Trajectory Optimization (AT) [3], and ChatScene (CS) [4]. We color each column with best and second best.

Algo. Metric Scenario Avg.
Straight Obstacle Lane Changing Unprotected Left-turn
LC CR 0.120 0.510 0.000 0.210
AS 0.230 0.530 0.050 0.270
AT 0.140 0.300 0.000 0.150
CS 0.030 0.110 0.100 0.080
Ours 0.021 0.085 0.000 0.035
LC OS 0.827 0.684 0.954 0.822
AS 0.784 0.666 0.937 0.796
AT 0.849 0.803 0.948 0.867
CS 0.905 0.906 0.903 0.905
Ours 0.895 0.894 0.953 0.914

Caption Driving

We demonstrate how our pipeline can be used for training captioning model under critical scenario. We report the "BLEU" (B), "METEOR" (M), and "CIDEr" (C) for narration and reasoning.

Narration Reasoning
B M C B M C
ADAPT [5] 4.8 13.5 15.2 0.0 10.0 18.4
+ ours 9.9 18.4 52.9 7.2 11.2 51.9

Ablation Study

We evaluate the efficacy of framework components by generating ten distinct prompts, each tailored to test the components under diverse scenarios. These scenarios encompass three normal, four critical, and three specific road conditions, such as the presence of traffic lights or the absence of crossroads. "AA" represents "Agent Accuracy", "RA" represents "Road Accuracy", and "SA" represents "Scene Accuracy". Prompts for the ablation study can be found in here.

Quality. Scenario Avg.
Normal Critical Conditional
Plan Quality AA RA AA RA AA RA AA RA
w/o. analysis 0.917 0.667 0.833 0.750 0.750 0.917 0.833 0.775
w.     analysis 0.917 1.000 0.875 0.750 1.000 0.917 0.925 0.875
Scene Quality SA SA SA SA
w/o. ranking 0.667 0.450 0.600 0.560
w.     ranking 0.867 0.750 0.800 0.800

Diversity Test

We provide a diversity test to evaluate the diversity of the generated traffic scenes. We make eight different scenes and test each scene five times. We report the "Agent Diversity" (AD), "Road Diversity" (RD), and "Scene Accuracy" (SA). Prompts for the diversity test can be found in here.

Metric Scenario Avg.
Normal Critical Conditional
Daily
Traffic
Intersection Pedestrian
Crushing
Blocking
Agent
Dangerous
Cut-off
Only having
Two-wheel
Vehicles
Having
Emergency
Vehicles
Rainy
Weather
AD 0.789 0.833 0.500 0.750 0.600 0.714 0.500 0.800 0.686
RD 1.000 1.000 1.000 1.000 1.000 0.800 1.000 1.000 0.975
SA 1.000 0.800 0.400 0.800 0.800 0.600 1.000 1.000 0.800

Reference

[1] W. Ding, B. Chen, M. Xu, and D. Zhao, "Learning to collide: An adaptive safety-critical scenarios generating method," in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2020, pp. 2243–2250.

[2] J. Wang, A. Pun, J. Tu, S. Manivasagam, A. Sadat, S. Casas, M. Ren, and R. Urtasun, "Advsim: Generating safety-critical scenarios for self-driving vehicles," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9909–9918

[3] J. Zhang, C. Xu, and B. Li, "Chatscene: Knowledge-enabled safety-critical scenario generation for autonomous vehicles," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 15 459-15 469.

[4] Q. Zhang, S. Hu, J. Sun, Q. A. Chen, and Z. M. Mao, "On adversarial robustness of trajectory prediction for autonomous vehicles," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 159–15 168.

[5] Jin, B., Liu, X., Zheng, Y., Li, P., Zhao, H., Zhang, T., ... & Liu, J. "ADAPT: Action-aware Driving Caption Transformer". In International Conference on Robotics and Automation , 2023, pp. 7554-7561.

BibTeX

@article{ruan2024ttsg,
  title={Traffic Scene Generation from Natural Language Description for Autonomous Vehicles with Large Language Model},
  author={Ruan, Bo-Kai and Tsui, Hao-Tang and Li, Yung-Hui and Shuai, Hong-Han},
  journal={arXiv preprint arXiv:2409.09575},
  year={2024}
}