Skip to content

Latest commit

 

History

History
342 lines (167 loc) · 17.2 KB

File metadata and controls

342 lines (167 loc) · 17.2 KB

3D-LLM-Autonomous-Driving

LLM/VLLM Agent in Autonomous-Driving: Integrates varied input modalities from different driving environments to perform specific tasks. Depending on the function, the model outputs perception, planning, and control signals, leveraging the reasoning and cognitive capabilities of the LLM through question answering, counterfactual reasoning, and description.


3D Perception-Based LLM Agents for Autonomous Driving: A Survey on the Shift from 2D to 3D

Authors: Nitin Dwivedi, Pranav Singh Chib, Pravendra Singh

The ongoing advancements in 3D perception and the integration of Large Language Models (LLMs) in Autonomous Driving (AD) have significantly enhanced the capabilities of intelligent vehicles, building on the achievements of 2D perception in tasks like object detection, scene analysis, and language-guided reasoning. This paper presents a comprehensive survey of 3D perception-based LLM agents, examining their methodologies, applications, and potential to revolutionize autonomous systems. Uniquely, our work bridges a critical gap in the literature by offering the first meta-analysis exclusively focused on the synergy between 3D perception and LLMs, addressing emerging challenges such as 3D tokenization, spatial reasoning, and computational scalability. Unlike prior surveys centered on 2D tasks, this study provides an in-depth exploration of 3D-specific advancements, highlighting the transformative potential of these systems. The paper highlights the significance of this integration for fostering safer, human-centric AD, identifying opportunities to overcome current limitations, and driving innovation in intelligent mobility solutions.

Timeline of existing LLMs integrated with 2D/3D perception for AD in recent years

Table of Contents

This repo contains a curated list of resources on 3D LLM-based autonomous driving research, arranged chronologically. We regularly update it with the latest papers and their corresponding open-source implementations.

  1. LLM driving Agents
  2. 3D Perception based LLM driving Agents
  3. Generative World Models
  4. Language Based AD Datasets (QA, Captioning etc.)

LLM Driving Agents

Mtd-gpt: A multi-task decision-making gpt model for autonomous driving at unsignalized in- tersections [ITSC-2023]
Jiaqi Liu, Peng Hang, Xiao Qi, Jianqiang Wang, Jian Sun

  • LLM Arch: GPT 2

  • Task: Perception, Planning, Decision Making

  • Metrics: RLHF, GPT Score

  • Datasets: Expert Dataset

GPT-Driver: Learning to Drive with GPT [NeurIPS-2023 Workshop]
Jiageng Mao, Yuxi Qian, Junjie Ye, Hang Zhao, Yue Wang
GitHub

  • LLM Arch: GPT-3.5

  • Task: Planning

  • Metrics: Avg. L2 and Collision Rate

  • Datasets: nuScenes

Drivegpt4: Interpretable end-to-end autonomous driving via large language mode [RAL-2024]
Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee. K. Wong, Zhenguo Li, Hengshuang Zhao
GitHub

  • LLM Arch: Llama 2, CLIP

  • Task: Planning, Control

  • Metrics: BLEU4, METEOR, ChatGPT score, RMSE, CIDEr

  • Datasets: BDD-X

Drive as You Speak: Enabling Human-Like Interaction with Large Language Models in Autonomous Vehicles [WACV-2024 Workshop]
Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Ziran Wang

  • LLM Arch: GPT4

  • Task: Motion Planning

  • Metrics: GPT Score

Drive Like a Human: Rethinking Autonomous Driving with Large Language Models [WACV-2024 Workshop]
Daocheng Fu, Xin Li, Licheng Wen, Min Dou, Pinlong Cai, Botian Shi, Yu Qiao
GitHub

Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving [ICRA-2024]
Long Chen, Oleg Sinavski, Jan Hünermann, Alice Karnsund, Andrew James Willmott, Danny Birch, Daniel Maund, Jamie Shotton
GitHub

  • LLM Arch: GPT-3.5

  • Task: Perception, Planning

  • Metrics: MAE, GPT Grading

  • Datasets: Driving QA

LANGUAGEMPC: LARGE LANGUAGE MODELS AS DECISION MAKERS FOR AUTONOMOUS DRIVING [arXiv-2023]
Hao Sha, Yao Mu, Yuxuan Jiang, Li Chen, Chenfeng Xu, Ping Luo, Shengbo Eben Li, Masayoshi Tomizuka, Wei Zhan, Mingyu Ding
GitHub

  • LLM Arch: GPT-3.5

  • Task: Planning, Control

  • Metrics: Observation matrix, Weight matrix, Action bias, Collision Rate, Inefficiency, Time

  • Datasets: IdSim

A language agent for autonomous driving [COLM-2024]
Jiageng Mao, Jiageng_Mao1, Junjie Ye, Yuxi Qian, Marco Pavone, Yue Wang
GitHub

  • LLM Arch: GPT-3.5

  • Task: Perception, Prediction, Planning

  • Metrics: Avg. L2, Collision Rate

  • Datasets: nuScenes

“Asynchronous Large Language Model Enhanced Planner for Autonomous Driving [ECCV-2024]
Yuan Chen, Zi-han Ding, Ziqin Wang, Yan Wang, Lijun Zhang, Si Liu
GitHub

  • LLM Arch: Llama2-13B

  • Task: Planning

  • Metrics: Driving direction compliance, Ego comfortable, Ego progress along the expert route, No ego at-fault collisions, Speed limit compliance, Time to collision

  • Datasets: nuPlan

🔼 Back to top


3D Perception based LLM Driving Agents

Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planninge [arXiv-2024]
Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, Jose M. Alvarez
GitHub

Is a 3D-Tokenized LLM the Key to Reliable Autonomous Driving? [arXiv-2024]
Yifan Bai, Dongming Wu, Yingfei Liu, Fan Jia, Weixin Mao, Ziheng Zhang, Yucheng Zhao, Jianbing Shen, Xing Wei, Tiancai Wang, Xiangyu Zhang

DME-Driver: Integrating Human Decision Logic and 3D Scene Perception in Autonomous Driving [arXiv-2024]
Wencheng Han, Dongqian Guo, Cheng-Zhong Xu, Jianbing Shen

Talk2bev: Language-enhanced bird’s-eye view maps for autonomous driving [ICRA-2024]
Tushar Choudhary, Vikrant Dewangan, Shivam Chandhok, Shubham Priyadarshan, Anushka Jain, Arun K. Singh, Siddharth Srivastava, Krishna Murthy Jatavallabhula, K. Madhava Krishna
GitHub

  • LLM Arch: Vicuna, Flan5XXL

  • Task: Perception, Planning

  • Metrics: Accuracy on MCQ, Jaccard Index, Distance Error

  • Datasets: nuScenes

DriveLM: Driving with Graph Visual Question Answering [ECCV-2024]
Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, Hongyang Li
GitHub

LLMI3D: MLLM-based 3D Perception from a Single 2D Image [arXiv-2025]
Fan Yang, Sicheng Zhao, Yanhao Zhang, Hui Chen, Haonan Lu, Jungong Han, Guiguang Ding Li

  • LLM Arch: GPT4

  • Task: Perception, Reasoning.

  • Metrics: Acc@0.25, Acc@0.5 (IoU thresholds of 25% and 50%), DepthError, LengthError, WidthError, and HeightError.

  • Datasets: IG3D

🔼 Back to top


Generative World Models

Dolphins: Multimodal Language Model for Driving [ECCV-2024]
Yingzi Ma, Yulong Cao, Jiachen Sun, Marco Pavone, Chaowei Xiao
GitHub

SurrealDriver: Designing Generative Driver Agent Simulation Framework in Urban Contexts based on Large Language Model [arXiv-2024]
Ye Jin, Ruoxuan Yang, Zhijie Yi, Xiaoxi Shen, Huiling Peng, Xiaoan Liu, Jingli Qin, Jiayang Li, Jintao Xie, Peizhong Gao, Guyue Zhou, Jiangtao Gong
GitHub

DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving [ECCV-2024]
Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, Jiwen Lu
GitHub

DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation [arXiv-2024]
Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Xinze Chen, Guan Huang, Xiaoyi Bao, Xingang Wang
GitHub

GAIA-1: A Generative World Model for Autonomous Driving [arXiv-2024]
Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, Gianluca Corrado
GitHub

Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving [CVPR-2024]
AYuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, Zhaoxiang Zhang
GitHub

🔼 Back to top


Language Based AD Datasets

Dataset Reasoning Outlook Size
BDD-X 2018 Description Planning Description & Justification 8M frames, 20k text strings
HAD HRI Advice 2019 Advice Goal-oriented & stimulus-driven advice 5,675 video clips, 45k text strings
Talk2Car 2019 Description Goal Point Description 30k frames, 10k text strings
DRAMA 2022 Description QA + Captions 18k frames, 100k text strings
nuScenes-QA 2023 QA Perception Result 30k frames, 460k QA pairs
DriveLM-2023 QA + Scene Description Perception, Prediction and Planning with Logic 30k frames, 600k QA pairs
Rank2Tell-2023 Captioning and Reasoning Localization and Ranking 118 frames
DRAMA-2023 Captioning and Reasoning Perception and Prediction 17785 frames
LingoQA-2024 Captioning and Reasoning Perception and Planning 28k frames, 419.9k Annotations
MAPLM-2024 QA, Captioning and Reasoning Perception and Prediction 2M frames

🔼 Back to top