Vision Language Driving Perception

Fine-tuning Vision-Language Models for Autonomous Driving Decision Planning

Overview

Vision Language Driving Perception is an open-source project focused on fine-tuning Vision-Language Models (VLMs) for decision planning in autonomous driving scenarios. By leveraging the expressive power of pre-trained VLMs, this project adapts them to downstream driving tasks such as behavior prediction, maneuver classification, and goal-directed planning.

This repository provides tools, datasets, and training pipelines to adapt Vision-Language Models for real-world autonomous driving decision modules.

Features

VLM Fine-Tuning Pipeline: Modular pipeline to fine-tune VLM on driving-specific tasks.
Dataset Integration: Supports structured scene data (e.g., nuPlan, Waymo, or custom vectorized environments).
Prompt Engineering for Driving Tasks: Custom vision-language prompts for planning-relevant tasks.
Evaluation Tools: Custom metrics for VLM output quality and scenario performance.

Use Cases

Planning-aware scene understanding
Maneuver prediction with vision-language reasoning
Goal-directed trajectory selection
Safety-critical decision refinement using natural language context

Project Structure

Vision Language Driving Perception/
│
├── README.md
├── setup.py
├── command
│   ├── InternVL2-1B.sh
│   └── eval.sh
├── config
│   └── zero_stage1_config.json
├── data
│   ├── sample.jsonl
│   └── vlm_samples
├── docs
│   ├── env_install.md
│   ├── vlm_finetune.md
│   └── vlm_trt.md
├── internvl
│   ├── __init__.py
│   ├── conversation.py
│   ├── dist_utils.py
│   ├── model
│   ├── patch
│   └── train
├── scripts
│   ├── internvl_eval.py
│   ├── pytorch_internvl_infer.py
│   └── trt_internvl_infer.py
├── tools
│   ├── __init__.py
│   ├── arrow2jsonl.py
│   ├── bart_score.py
│   ├── convert_parquet.py
│   ├── convert_to_int8.py
│   ├── extract_mlp.py
│   ├── extract_video_frames.py
│   ├── extract_vit.py
│   ├── json2jsonl.py
│   ├── jsonl2jsonl.py
│   ├── merge_lora.py
│   ├── replace_llm.py
│   └── resize_pos_embed.py
└── vlmtrt
    ├── build_vit_engine.py
    ├── conversation.py
    └── convert_qwen2_ckpt.py

Getting Started

1. Clone the repository

git clone https://github.com/thillai-c/Vision-Language-Driving-Perception.git
cd Vision-Language-Driving-Perception

2. Install dependencies

conda create -n vision-language-driving-perception python=3.10
conda activate vision-language-driving-perception
pip install -r requirements.txt

3. Prepare dataset

You can use scene data from nuPlan, Waymo, or a custom driving dataset. Please follow docs/data_prepare.md for instructions.

4. Run training

Please follow docs/vlm_finetune.md to fine-tune VLMs.

Evaluation

Please follow docs/vlm_finetune.md to evaluate your fine-tuned model on benchmark scenarios.

VLM Converter Module

The VLM Converter is a performance-boosting module designed to convert and quantize large Vision-Language Models (VLMs) using TensorRT-LLM, significantly improving inference speed while maintaining accuracy.

Please follow docs/vlm_trt.md to convert and quantize VLMs.

Acknowledgments

This project utilizes open-source tools and frameworks from the broader autonomous driving and vision-language model communities.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision Language Driving Perception

Overview

Features

Use Cases

Project Structure

Getting Started

1. Clone the repository

2. Install dependencies

3. Prepare dataset

4. Run training

Evaluation

VLM Converter Module

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
command		command
config		config
data		data
docs		docs
internvl		internvl
scripts		scripts
tools		tools
vlmtrt		vlmtrt
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

Vision Language Driving Perception

Overview

Features

Use Cases

Project Structure

Getting Started

1. Clone the repository

2. Install dependencies

3. Prepare dataset

4. Run training

Evaluation

VLM Converter Module

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages