LLM/VLLM Agent in Autonomous-Driving: Integrates varied input modalities from different driving environments to perform specific tasks. Depending on the function, the model outputs perception, planning, and control signals, leveraging the reasoning and cognitive capabilities of the LLM through question answering, counterfactual reasoning, and description.
Authors: Nitin Dwivedi, Pranav Singh Chib, Pravendra Singh
The ongoing advancements in 3D perception and the integration of Large Language Models (LLMs) in Autonomous Driving (AD) have significantly enhanced the capabilities of intelligent vehicles, building on the achievements of 2D perception in tasks like object detection, scene analysis, and language-guided reasoning. This paper presents a comprehensive survey of 3D perception-based LLM agents, examining their methodologies, applications, and potential to revolutionize autonomous systems. Uniquely, our work bridges a critical gap in the literature by offering the first meta-analysis exclusively focused on the synergy between 3D perception and LLMs, addressing emerging challenges such as 3D tokenization, spatial reasoning, and computational scalability. Unlike prior surveys centered on 2D tasks, this study provides an in-depth exploration of 3D-specific advancements, highlighting the transformative potential of these systems. The paper highlights the significance of this integration for fostering safer, human-centric AD, identifying opportunities to overcome current limitations, and driving innovation in intelligent mobility solutions.
This repo contains a curated list of resources on 3D LLM-based autonomous driving research, arranged chronologically. We regularly update it with the latest papers and their corresponding open-source implementations.
- LLM driving Agents
- 3D Perception based LLM driving Agents
- Generative World Models
- Language Based AD Datasets (QA, Captioning etc.)
Mtd-gpt: A multi-task
decision-making gpt model for autonomous driving at unsignalized in-
tersections [ITSC-2023]
Jiaqi Liu, Peng Hang, Xiao Qi, Jianqiang Wang, Jian Sun
-
LLM Arch: GPT 2
-
Task: Perception, Planning, Decision Making
-
Metrics: RLHF, GPT Score
-
Datasets: Expert Dataset
GPT-Driver: Learning to Drive with GPT [NeurIPS-2023 Workshop]
Jiageng Mao, Yuxi Qian, Junjie Ye, Hang Zhao, Yue Wang
Drivegpt4: Interpretable end-to-end autonomous driving via
large language mode [RAL-2024]
Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee. K. Wong, Zhenguo Li, Hengshuang Zhao
-
Task: Planning, Control
-
Metrics: BLEU4, METEOR, ChatGPT score, RMSE, CIDEr
-
Datasets: BDD-X
Drive as You Speak: Enabling Human-Like Interaction with Large Language Models in Autonomous Vehicles [WACV-2024 Workshop]
Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Ziran Wang
-
LLM Arch: GPT4
-
Task: Motion Planning
-
Metrics: GPT Score
Drive Like a Human: Rethinking Autonomous Driving with Large Language Models [WACV-2024 Workshop]
Daocheng Fu, Xin Li, Licheng Wen, Min Dou, Pinlong Cai, Botian Shi, Yu Qiao
-
LLM Arch: GPT-3.5
-
Task: Planning, Control
-
Datasets: HighwayEnv
Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving [ICRA-2024]
Long Chen, Oleg Sinavski, Jan Hünermann, Alice Karnsund, Andrew James Willmott, Danny Birch, Daniel Maund, Jamie Shotton
-
LLM Arch: GPT-3.5
-
Task: Perception, Planning
-
Metrics: MAE, GPT Grading
-
Datasets: Driving QA
LANGUAGEMPC: LARGE LANGUAGE MODELS AS DECISION MAKERS FOR AUTONOMOUS DRIVING [arXiv-2023]
Hao Sha, Yao Mu, Yuxuan Jiang, Li Chen, Chenfeng Xu, Ping Luo, Shengbo Eben Li, Masayoshi Tomizuka, Wei Zhan, Mingyu Ding
-
LLM Arch: GPT-3.5
-
Task: Planning, Control
-
Metrics: Observation matrix, Weight matrix, Action bias, Collision Rate, Inefficiency, Time
-
Datasets: IdSim
A language agent for autonomous driving [COLM-2024]
Jiageng Mao, Jiageng_Mao1, Junjie Ye, Yuxi Qian, Marco Pavone, Yue Wang
-
LLM Arch: GPT-3.5
-
Task: Perception, Prediction, Planning
-
Metrics: Avg. L2, Collision Rate
-
Datasets: nuScenes
“Asynchronous Large Language Model Enhanced Planner for Autonomous Driving [ECCV-2024]
Yuan Chen, Zi-han Ding, Ziqin Wang, Yan Wang, Lijun Zhang, Si Liu
-
LLM Arch: Llama2-13B
-
Task: Planning
-
Metrics: Driving direction compliance, Ego comfortable, Ego progress along the expert route, No ego at-fault collisions, Speed limit compliance, Time to collision
-
Datasets: nuPlan
Omnidrive: A holistic llm-agent framework
for autonomous driving with 3d perception, reasoning and planninge [arXiv-2024]
Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, Jose M. Alvarez
-
Task: 3D perception, VQA
-
Metrics: CR, IR, METEOR, ROUGE and CIDEr
Is a 3D-Tokenized LLM the Key to Reliable Autonomous Driving? [arXiv-2024]
Yifan Bai, Dongming Wu, Yingfei Liu, Fan Jia, Weixin Mao, Ziheng Zhang, Yucheng Zhao, Jianbing Shen, Xing Wei, Tiancai Wang, Xiangyu Zhang
-
LLM Arch: Vicuna, StreamPETR, TopoMLP
-
Task: 3D perception
-
Metrics: Avg. L2 and Collision Rate
-
Datasets: nuScenes
DME-Driver: Integrating Human Decision Logic and 3D Scene Perception in Autonomous Driving [arXiv-2024]
Wencheng Han, Dongqian Guo, Cheng-Zhong Xu, Jianbing Shen
-
Task: Perception, Planning
-
Metrics: Avg. L2 and Collision Rate
Talk2bev: Language-enhanced bird’s-eye view maps for autonomous driving [ICRA-2024]
Tushar Choudhary, Vikrant Dewangan, Shivam Chandhok, Shubham Priyadarshan, Anushka Jain, Arun K. Singh, Siddharth Srivastava, Krishna Murthy Jatavallabhula, K. Madhava Krishna
-
Task: Perception, Planning
-
Metrics: Accuracy on MCQ, Jaccard Index, Distance Error
-
Datasets: nuScenes
DriveLM: Driving with Graph Visual Question Answering [ECCV-2024]
Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, Hongyang Li
-
LLM Arch: Vicuna-13B
-
Task: Graph VQA, Perception, Prediction, Planning
-
Metrics: GPTScore, Meteor, BLEU1, CIDEr, ROUGE, Classification accuracy
-
Datasets: DriveLM-nuScenes, DriveLM-CARLA
LLMI3D: MLLM-based 3D Perception from a Single 2D Image [arXiv-2025]
Fan Yang, Sicheng Zhao, Yanhao Zhang, Hui Chen, Haonan Lu, Jungong Han, Guiguang Ding Li
-
LLM Arch: GPT4
-
Task: Perception, Reasoning.
-
Metrics: Acc@0.25, Acc@0.5 (IoU thresholds of 25% and 50%), DepthError, LengthError, WidthError, and HeightError.
-
Datasets: IG3D
Dolphins: Multimodal Language Model for Driving [ECCV-2024]
Yingzi Ma, Yulong Cao, Jiachen Sun, Marco Pavone, Chaowei Xiao
-
Task: Prediction and Planning VQA
-
Metrics: VQA accuracy
-
Datasets: GQA, MSCOCO, VQAv2, OK-VQA, TDIUC, Visual Genome dataset, BDD-X
SurrealDriver: Designing Generative Driver Agent Simulation Framework in Urban Contexts based on Large Language Model [arXiv-2024]
Ye Jin, Ruoxuan Yang, Zhijie Yi, Xiaoxi Shen, Huiling Peng, Xiaoan Liu, Jingli Qin, Jiayang Li, Jintao Xie, Peizhong Gao, Guyue Zhou, Jiangtao Gong
-
LLM Arch: GPT4
-
Task: Planning, Control
-
Metrics: Collision Rate, ANOVAs
-
Datasets: Driving-Thinking-Dataset based on CARLA
DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving [ECCV-2024]
Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, Jiwen Lu
-
LLM Arch: CLIP, Stable Diffusion v1.4
-
Task: Generation
-
Datasets: nuScenes
DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation [arXiv-2024]
Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Xinze Chen, Guan Huang, Xiaoyi Bao, Xingang Wang
GAIA-1: A Generative World Model for Autonomous Driving [arXiv-2024]
Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, Gianluca Corrado
Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving [CVPR-2024]
AYuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, Zhaoxiang Zhang
-
Task: Perception, Planning
-
Datasets: nuScenes, Waymo Open Dataset
| Dataset | Reasoning | Outlook | Size |
|---|---|---|---|
| BDD-X 2018 | Description | Planning Description & Justification | 8M frames, 20k text strings |
| HAD HRI Advice 2019 | Advice | Goal-oriented & stimulus-driven advice | 5,675 video clips, 45k text strings |
| Talk2Car 2019 | Description | Goal Point Description | 30k frames, 10k text strings |
| DRAMA 2022 | Description | QA + Captions | 18k frames, 100k text strings |
| nuScenes-QA 2023 | QA | Perception Result | 30k frames, 460k QA pairs |
| DriveLM-2023 | QA + Scene Description | Perception, Prediction and Planning with Logic | 30k frames, 600k QA pairs |
| Rank2Tell-2023 | Captioning and Reasoning | Localization and Ranking | 118 frames |
| DRAMA-2023 | Captioning and Reasoning | Perception and Prediction | 17785 frames |
| LingoQA-2024 | Captioning and Reasoning | Perception and Planning | 28k frames, 419.9k Annotations |
| MAPLM-2024 | QA, Captioning and Reasoning | Perception and Prediction | 2M frames |

