Skip to content

Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding

Notifications You must be signed in to change notification settings

tong-zeng/DVBench

Repository files navigation

Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding

License: CC BY-NC 4.0 Python 3.10+

The benchmark toolbox for the paper: Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding

📝 Project Overview

DVBench is a comprehensive benchmark designed to evaluate the video understanding capabilities of Vision Large Language Models (VLLMs) in safety-critical driving scenarios. This benchmark focuses on assessing models' ability to understand driving videos, which is crucial for the safe deployment of autonomous and assisted driving technologies.

Main Contributions

  • Problem Identification: We are among the first to investigate VLLMs' capabilities in perception and reasoning within safety-critical (Crash, Near-Crash) driving scenarios and systematically define the hierarchical abilities essential for evaluating the safety of autonomous driving systems in high-risk contexts.

  • Benchmark Development: We introduce DVBench, the first comprehensive benchmark for safety-critical driving video understanding, featuring 10,000 curated multiple-choice questions across 25 key driving-related abilities. DVBench is designed to rigorously assess perception and reasoning in dynamic driving environments.

  • Systematic Evaluation: We evaluate 14 state-of-the-art VLLMs, providing an in-depth analysis of their strengths and limitations in safety-critical driving scenarios. This paper establishes structured evaluation protocols and infrastructure to enable fair comparisons and guide future advancements in VLLMs for autonomous driving.

VLLM Performance

Detailed Performance of the VLLMs on DVBench using GroupEval (L2 abilities)

The following abbreviations are used:

  • Perception Abilities: EC (Environmental Conditions), PI (Physical Infrastructure), OC (Operational Constraints), Obj (Objects), Zone (Zones)
  • Reasoning Abilities: EU (Event Understanding), BMA (Behavior & Maneuver Analysis), SR (Spatial Reasoning), RHA (Risk & Hazard Assessment), CR (Causal & Responsibility)
VLLMs Perception Reasoning
Overall EC PI OC Obj Zone Overall EU BMA SR RHA CR
LLaMA-VID-7B[1] 12.2%26.7%2.8%13.5%11.4%9.1% 10.4%2.0%3.3%7.5%19.3%11.4%
LLaMA-VID-13B[1] 12.8%30.0%0.0%8.1%15.9%12.1% 9.6%3.9%0.0%6.0%21.6%4.5%
LLaVA-One-Vision-0.5B[2] 15.0%40.0%2.8%16.2%13.6%6.1% 11.1%5.9%10.0%9.0%21.6%0.0%
Qwen2-VL-7B[3] 25.6%43.3%2.8%27.0%27.3%30.3% 27.1%33.3%13.3%17.9%33.0%31.8%
LLaVA-Next-Video-7B[4] 18.3%43.3%2.8%16.2%18.2%15.2% 16.1%7.8%13.3%10.4%27.3%13.6%
Video-LLaVa-7B[5] 21.1%70.0%2.8%8.1%22.7%9.1% 18.6%5.9%13.3%10.4%34.1%18.2%
PLLaVA-7B[6] 20.0%56.7%2.8%10.8%18.2%18.2% 17.9%3.9%10.0%10.4%34.1%18.2%
LLaVA-Next-Video-34B[4] 23.3%50.0%25.0%21.6%11.4%15.2% 16.1%5.9%10.0%17.9%27.3%6.8%
LLaVA-One-Vision-7B[2] 28.3%70.0%19.4%21.6%20.5%18.2% 20.0%3.9%20.0%16.4%37.5%9.1%
PLLaVA-13B[6] 23.9%63.3%11.1%18.9%18.2%15.2% 15.4%5.9%0.0%9.0%34.1%9.1%
Qwen2-VL-72B[3] 32.8%50.0%25.0%35.1%22.7%36.4% 33.9%41.2%13.3%31.3%42.0%27.3%
Qwen2-VL-2B[3] 31.7%76.7%27.8%24.3%20.5%18.2% 26.4%7.8%26.7%16.4%44.3%27.3%
MiniCPM-V[7] 39.4%70.0%19.4%45.9%36.4%30.3% 35.4%39.2%16.7%22.4%53.4%27.3%
LLaVA-One-Vision-72B[2] 36.7%66.7%16.7%40.5%34.1%30.3% 36.4%33.3%30.0%29.9%46.6%34.1%

🛠️ Installation Guide

Prerequisites

  • Python 3.10+
  • PyTorch
  • CUDA-supported GPU (for larger models)

Installation Steps

  1. Clone the repository:
git clone https://github.com/tong-zeng/DVBench.git
cd DVBench
  1. Environment Setup:

    We recommend using Conda for efficient environment management.

    We have exported our conda environments for different models in the envs folder. You can create a conda environment using these yml files:

    To create an environment for vllm models:

    conda env create -f envs/driving_vllm_bench_vllm_latest.yml
    # Activate the environment
    conda activate driving_vllm_bench_vllm_latest

    To create environments for huggingface models:

    conda env create -f envs/driving_vllm_bench_llamavid.yml
    # Activate the environment
    conda activate driving_vllm_bench_llamavid

    Available environment files:

    • driving_vllm_bench_llamavid.yml - Environment for LLaVA-Vid
    • driving_vllm_bench_chatunivi.yml - Environment for Chat-UniVi
    • driving_vllm_bench_minicpmv.yml - Environment for MiniCPM-V
    • driving_vllm_bench_pllava.yml - Environment for PLLaVA
    • driving_vllm_bench_videochat2.yml - Environment for VideoChat2
    • driving_vllm_bench_videochatgpt.yml - Environment for VideoChatGPT
    • and more...

    For more information on managing conda environments, please refer to the official conda documentation.

📊 Usage

Running the Benchmark

We have prepared inference scripts for all supported models to simplify the benchmark process. The benchmark process consists of two main steps: running inference and calculating performance metrics.

Step 1: Running Inference

We provide ready-to-use bash scripts for each model in the scripts directory:

  • HuggingFace models: scripts/hf/
  • VLLM models: scripts/vllm/
# For example, to run inference with LLaVA-Vid 7B model
cd DVBench
bash scripts/hf/llama_vid_7b.sh

# For MiniCPM-V model
bash scripts/vllm/minicpmv.sh

# You may need to modify the scripts to adjust paths or GPU settings

The inference scripts will:

  1. Set up the appropriate environment variables
  2. Run the model on the test dataset
  3. Save model responses to the specified output directory

Step 2: Calculating Performance

After running inference, use the auto_accuracy.py script to calculate performance metrics:

# Calculate performance metrics for all models
python dvbench/benchmarking/auto_accuracy.py

This will generate comprehensive performance metrics across all evaluation dimensions and save the results in the specified output directory.

📊 The Fine-Tuning code and Fine-Tuned Models

DVBench evaluates vision language models on the following key aspects:

Please refer to https://github.com/tong-zeng/qwen2-vl-finetuning.git

🔄 Supported Models

DVBench currently supports evaluation of the following vision language models:

  • LLaMA-VID-7B
  • LLaMA-VID-13B
  • LLaVA-One-Vision-0.5B
  • Qwen2-VL-7B
  • LLaVA-Next-Video-7B
  • Video-LLaVa-7B
  • PLLaVA-7B
  • LLaVA-Next-Video-34B
  • LLaVA-One-Vision-7B
  • PLLaVA-13B
  • Qwen2-VL-72B
  • Qwen2-VL-2B
  • MiniCPM-V
  • LLaVA-One-Vision-72B

To add a new model, please refer to the example implementations in the dvbench/inference/models/ directory.

📚 Project Structure

DVBench/
├── dvbench/                # Main code package
│   ├── benchmarking/       # Evaluation benchmark related code
│   ├── inference/          # Model inference and interfaces
│   │   ├── configs/        # Model configuration files
│   │   ├── models/         # Model implementations
│   │   └── utils/          # Utility functions
│   └── visual_processing/  # Video and image processing tools
├── envs/                   # Conda environment yml files
├── scripts/                # Running scripts
│   ├── hf/                 # HuggingFace model inference scripts
│   └── vllm/               # VLLM model inference scripts
├── all_experiments/        # Experiment results storage
└── videos/                 # Test video set

📄 License

DVBench is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).

📧 Contact

For any questions or suggestions, please open an issue.


Project Repository: https://github.com/tong-zeng/DVBench

About

Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published