Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding

The benchmark toolbox for the paper: Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding

📝 Project Overview

DVBench is a comprehensive benchmark designed to evaluate the video understanding capabilities of Vision Large Language Models (VLLMs) in safety-critical driving scenarios. This benchmark focuses on assessing models' ability to understand driving videos, which is crucial for the safe deployment of autonomous and assisted driving technologies.

Main Contributions

Problem Identification: We are among the first to investigate VLLMs' capabilities in perception and reasoning within safety-critical (Crash, Near-Crash) driving scenarios and systematically define the hierarchical abilities essential for evaluating the safety of autonomous driving systems in high-risk contexts.
Benchmark Development: We introduce DVBench, the first comprehensive benchmark for safety-critical driving video understanding, featuring 10,000 curated multiple-choice questions across 25 key driving-related abilities. DVBench is designed to rigorously assess perception and reasoning in dynamic driving environments.
Systematic Evaluation: We evaluate 14 state-of-the-art VLLMs, providing an in-depth analysis of their strengths and limitations in safety-critical driving scenarios. This paper establishes structured evaluation protocols and infrastructure to enable fair comparisons and guide future advancements in VLLMs for autonomous driving.

VLLM Performance

Detailed Performance of the VLLMs on DVBench using GroupEval (L2 abilities)

The following abbreviations are used:

Perception Abilities: EC (Environmental Conditions), PI (Physical Infrastructure), OC (Operational Constraints), Obj (Objects), Zone (Zones)
Reasoning Abilities: EU (Event Understanding), BMA (Behavior & Maneuver Analysis), SR (Spatial Reasoning), RHA (Risk & Hazard Assessment), CR (Causal & Responsibility)

VLLMs	Perception						Reasoning
VLLMs	Overall	EC	PI	OC	Obj	Zone	Overall	EU	BMA	SR	RHA	CR
LLaMA-VID-7B^[1]	12.2%	26.7%	2.8%	13.5%	11.4%	9.1%	10.4%	2.0%	3.3%	7.5%	19.3%	11.4%
LLaMA-VID-13B^[1]	12.8%	30.0%	0.0%	8.1%	15.9%	12.1%	9.6%	3.9%	0.0%	6.0%	21.6%	4.5%
LLaVA-One-Vision-0.5B^[2]	15.0%	40.0%	2.8%	16.2%	13.6%	6.1%	11.1%	5.9%	10.0%	9.0%	21.6%	0.0%
Qwen2-VL-7B^[3]	25.6%	43.3%	2.8%	27.0%	27.3%	30.3%	27.1%	33.3%	13.3%	17.9%	33.0%	31.8%
LLaVA-Next-Video-7B^[4]	18.3%	43.3%	2.8%	16.2%	18.2%	15.2%	16.1%	7.8%	13.3%	10.4%	27.3%	13.6%
Video-LLaVa-7B^[5]	21.1%	70.0%	2.8%	8.1%	22.7%	9.1%	18.6%	5.9%	13.3%	10.4%	34.1%	18.2%
PLLaVA-7B^[6]	20.0%	56.7%	2.8%	10.8%	18.2%	18.2%	17.9%	3.9%	10.0%	10.4%	34.1%	18.2%
LLaVA-Next-Video-34B^[4]	23.3%	50.0%	25.0%	21.6%	11.4%	15.2%	16.1%	5.9%	10.0%	17.9%	27.3%	6.8%
LLaVA-One-Vision-7B^[2]	28.3%	70.0%	19.4%	21.6%	20.5%	18.2%	20.0%	3.9%	20.0%	16.4%	37.5%	9.1%
PLLaVA-13B^[6]	23.9%	63.3%	11.1%	18.9%	18.2%	15.2%	15.4%	5.9%	0.0%	9.0%	34.1%	9.1%
Qwen2-VL-72B^[3]	32.8%	50.0%	25.0%	35.1%	22.7%	36.4%	33.9%	41.2%	13.3%	31.3%	42.0%	27.3%
Qwen2-VL-2B^[3]	31.7%	76.7%	27.8%	24.3%	20.5%	18.2%	26.4%	7.8%	26.7%	16.4%	44.3%	27.3%
MiniCPM-V^[7]	39.4%	70.0%	19.4%	45.9%	36.4%	30.3%	35.4%	39.2%	16.7%	22.4%	53.4%	27.3%
LLaVA-One-Vision-72B^[2]	36.7%	66.7%	16.7%	40.5%	34.1%	30.3%	36.4%	33.3%	30.0%	29.9%	46.6%	34.1%

🛠️ Installation Guide

Prerequisites

Python 3.10+
PyTorch
CUDA-supported GPU (for larger models)

Installation Steps

Clone the repository:

git clone https://github.com/tong-zeng/DVBench.git
cd DVBench

Environment Setup:

We recommend using Conda for efficient environment management.

We have exported our conda environments for different models in the envs folder. You can create a conda environment using these yml files:

To create an environment for vllm models:
```
conda env create -f envs/driving_vllm_bench_vllm_latest.yml
# Activate the environment
conda activate driving_vllm_bench_vllm_latest
```
To create environments for huggingface models:
```
conda env create -f envs/driving_vllm_bench_llamavid.yml
# Activate the environment
conda activate driving_vllm_bench_llamavid
```
Available environment files:
- driving_vllm_bench_llamavid.yml - Environment for LLaVA-Vid
- driving_vllm_bench_chatunivi.yml - Environment for Chat-UniVi
- driving_vllm_bench_minicpmv.yml - Environment for MiniCPM-V
- driving_vllm_bench_pllava.yml - Environment for PLLaVA
- driving_vllm_bench_videochat2.yml - Environment for VideoChat2
- driving_vllm_bench_videochatgpt.yml - Environment for VideoChatGPT
- and more...
For more information on managing conda environments, please refer to the official conda documentation.

📊 Usage

Running the Benchmark

We have prepared inference scripts for all supported models to simplify the benchmark process. The benchmark process consists of two main steps: running inference and calculating performance metrics.

Step 1: Running Inference

We provide ready-to-use bash scripts for each model in the scripts directory:

HuggingFace models: scripts/hf/
VLLM models: scripts/vllm/

# For example, to run inference with LLaVA-Vid 7B model
cd DVBench
bash scripts/hf/llama_vid_7b.sh

# For MiniCPM-V model
bash scripts/vllm/minicpmv.sh

# You may need to modify the scripts to adjust paths or GPU settings

The inference scripts will:

Set up the appropriate environment variables
Run the model on the test dataset
Save model responses to the specified output directory

Step 2: Calculating Performance

After running inference, use the auto_accuracy.py script to calculate performance metrics:

# Calculate performance metrics for all models
python dvbench/benchmarking/auto_accuracy.py

This will generate comprehensive performance metrics across all evaluation dimensions and save the results in the specified output directory.

📊 The Fine-Tuning code and Fine-Tuned Models

DVBench evaluates vision language models on the following key aspects:

Please refer to https://github.com/tong-zeng/qwen2-vl-finetuning.git

🔄 Supported Models

DVBench currently supports evaluation of the following vision language models:

LLaMA-VID-7B
LLaMA-VID-13B
LLaVA-One-Vision-0.5B
Qwen2-VL-7B
LLaVA-Next-Video-7B
Video-LLaVa-7B
PLLaVA-7B
LLaVA-Next-Video-34B
LLaVA-One-Vision-7B
PLLaVA-13B
Qwen2-VL-72B
Qwen2-VL-2B
MiniCPM-V
LLaVA-One-Vision-72B

To add a new model, please refer to the example implementations in the dvbench/inference/models/ directory.

📚 Project Structure

DVBench/
├── dvbench/                # Main code package
│   ├── benchmarking/       # Evaluation benchmark related code
│   ├── inference/          # Model inference and interfaces
│   │   ├── configs/        # Model configuration files
│   │   ├── models/         # Model implementations
│   │   └── utils/          # Utility functions
│   └── visual_processing/  # Video and image processing tools
├── envs/                   # Conda environment yml files
├── scripts/                # Running scripts
│   ├── hf/                 # HuggingFace model inference scripts
│   └── vllm/               # VLLM model inference scripts
├── all_experiments/        # Experiment results storage
└── videos/                 # Test video set

📄 License

DVBench is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).

📧 Contact

For any questions or suggestions, please open an issue.

Project Repository: https://github.com/tong-zeng/DVBench

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
all_experiments/table1_main_performance		all_experiments/table1_main_performance
dvbench		dvbench
envs		envs
scripts		scripts
videos/Cropped_Videos		videos/Cropped_Videos
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding

📝 Project Overview

Main Contributions

VLLM Performance

Detailed Performance of the VLLMs on DVBench using GroupEval (L2 abilities)

🛠️ Installation Guide

Prerequisites

Installation Steps

📊 Usage

Running the Benchmark

Step 1: Running Inference

Step 2: Calculating Performance

📊 The Fine-Tuning code and Fine-Tuned Models

🔄 Supported Models

📚 Project Structure

📄 License

📧 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

tong-zeng/DVBench

Folders and files

Latest commit

History

Repository files navigation

Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding

📝 Project Overview

Main Contributions

VLLM Performance

Detailed Performance of the VLLMs on DVBench using GroupEval (L2 abilities)

🛠️ Installation Guide

Prerequisites

Installation Steps

📊 Usage

Running the Benchmark

Step 1: Running Inference

Step 2: Calculating Performance

📊 The Fine-Tuning code and Fine-Tuned Models

🔄 Supported Models

📚 Project Structure

📄 License

📧 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages