Traffic Scene Generation from Natural Language Description for Autonomous Vehicles with Large Language Model

Bo-Kai Ruan1, Hao-Tang Tsui1, Yung-Hui Li2, Hong-Han Shuai1
1National Yang Ming Chiao Tung University, 2AI Research Center, Hon Hai Research Institute, Taiwan
this slowpoke moves

The TTSG pipeline, encompassing four primary stages: analysis, road candidate retrieval, agent planning, road ranking, and generation.

Abstract

In this work, we introduce a text-to-traffic scene framework that generates diverse traffic scenes within the Carla simulator based on natural language descriptions with a large language model. Existing methods for text-to-scene generation often depend on generating critical scenarios along a few fixed paths, thus greatly losing the diversity of the environment and limiting the flexibility of customization. In contrast, our approach utilizes a common structured output format to enable the flexible generation of a wide range of traffic scenarios. Users can specify various parameters such as weather conditions, vehicle types, and road signals as one of the generation conditions. Importantly, our model does not require a predetermined location or a trajectory. It can autonomously select the starting point and details of the scenario based on the user's input to generate scenes from scratch. Additionally, our framework is capable of generating not only critical scenarios but also everyday traffic scenes, enhancing its utility. We demonstrate that our framework can provide diverse agent planning and road selection and can facilitate the training of autonomous agents in critical traffic by achieving comparable or superior performance.

Video

Pipeline Animation

this slowpoke moves

Environment

Agent's Actions

We support seven actions as shown below. The action blocking is designed to be dynamic and would always trying to block the ego.

action

Agent's Position

We support seven actions as shown below. The action blocking is designed to be dynamic and would always trying to block the ego.

position

Agent's Type

We support nine different types of agents as shown below.

Agent Type
car bus truck
firetruck ambulance police car
motorcycle cyclist pedestrian

Signals and Objects

We support ten different objects and three different signals as shown below.

signals traffic_light stop
yield
objects speed_30 speed_40
speed_60 speed_90
parallel_open_crosswalk ladder_crosswalk
continental_crosswalk dashed_single_white
solid_single_white stop_line
stop_sign_on_road

Visualization

agent

position

action

others


Normal Scenario

A firetruck from the left road is coming when the ego car is turning right.

Daily traffic with more than ten cars.

Lots of cars, buses, trucks, and motorcycles are seen.


Critical Scenario

A pedestrian on the sidewalk is crossing the street in front of a truck stopping on the shoulder. Both are located on the front right.

A cyclist is crossing the street from a sidewalk in a dangerous way on a rainy night.

Some cars from the opposite straight is coming when the ego car is turning left.


Conditional Selection

The ego car is going straight at the intersection with a traffic light. There are some puddles on the road.

The ego car is turning left at the intersection with no traffic light, stop sign, or stop sign on road. A car coming from the straight is turning right.

A pedestrian is crossing the road with the parallel open crosswalk and the ego car is turning right.

Experiments

Diversity Test

We provide a diversity test to evaluate the diversity of the generated traffic scenes. We make eight different scenes and test each scene five times. We report the "Agent Diversity" (AD), "Road Diversity" (RD), and "Text Matching" (TM). Prompts for the diversity test can be found in here.

Metric Scenario Avg.
Normal Critical Conditional
Daily
Traffic
Intersection Pedestrian
Crushing
Blocking
Agent
Dangerous
Cut-off
Only having
Two-wheel
Vehicles
Having
Emergency
Vehicles
Rainy
Weather
AD 0.789 0.833 0.500 0.750 0.600 0.714 0.500 0.800 0.686
RD 1.000 1.000 1.000 1.000 1.000 0.800 1.000 1.000 0.975
TM 1.000 0.800 0.400 0.800 0.800 0.600 1.000 1.000 0.800

Critical Scenario

We show that our approach can also be used to train the agent under the critical scenario by selected three challenging scenarios from SafeBench. "CR" represents "Collision Rate" and "OS" represents "Overall Score". We compare our results with Learning-to-Collide (LS) [1], AdvSim (AS) [2], Adversarial Trajectory Optimization (AT) [3], and ChatScene (CS) [4]. We color each column with best and second best.

Algo. Metric Scenario Avg.
Straight Obstacle Lane Changing Unprotected Left-turn
LC CR 0.120 0.510 0.000 0.210
AS 0.230 0.530 0.050 0.270
AT 0.140 0.300 0.000 0.150
CS 0.030 0.110 0.100 0.080
Ours 0.000 0.020 0.000 0.067
LC OS 0.827 0.684 0.954 0.822
AS 0.784 0.666 0.937 0.796
AT 0.849 0.803 0.948 0.867
CS 0.905 0.906 0.903 0.905
Ours 0.955 0.861 0.951 0.922

Ablation Study

We evaluate the efficacy of framework components by generating ten distinct prompts, each tailored to test the components under diverse scenarios. These scenarios encompass three normal, four critical, and three specific road conditions, such as the presence of traffic lights or the absence of crossroads. "AA" represents "Agent Accuracy", "RA" represents "Road Accuracy", and "TM" represents "Text Matching". Prompts for the ablation study can be found in here.

Quality. Scenario Avg.
Normal Critical Conditional
Plan Quality AA RA AA RA AA RA AA RA
w/o. analysis 0.917 0.667 0.833 0.750 0.750 0.917 0.833 0.775
w.     analysis 0.917 1.000 0.875 0.750 1.000 0.917 0.925 0.875
Scene Quality TM TM TM TM
w/o. ranking 0.667 0.450 0.600 0.560
w.     ranking 0.867 0.750 0.800 0.800

Reference

[1] W. Ding, B. Chen, M. Xu, and D. Zhao, “Learning to collide: An adaptive safety-critical scenarios generating method,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2020, pp. 2243–2250.

[2] J. Wang, A. Pun, J. Tu, S. Manivasagam, A. Sadat, S. Casas, M. Ren, and R. Urtasun, “Advsim: Generating safety-critical scenarios for self-driving vehicles,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9909–9918

[3] J. Zhang, C. Xu, and B. Li, “Chatscene: Knowledge-enabled safety-critical scenario generation for autonomous vehicles,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 15 459-15 469.

[4] Q. Zhang, S. Hu, J. Sun, Q. A. Chen, and Z. M. Mao, “On adversarial robustness of trajectory prediction for autonomous vehicles,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 159–15 168.

BibTeX

@article{ruan2024ttsg,
  title={Traffic Scene Generation from Natural Language Description for Autonomous Vehicles with Large Language Model},
  author={Ruan, Bo-Kai and Tsui, Hao-Tang and Li, Yung-Hui and Shuai, Hong-Han},
  journal={arXiv preprint arXiv:2409.09575},
  year={2024}
}