Skip to content

thillai-c/Vision-Language-Driving-Perception

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vision Language Driving Perception

Fine-tuning Vision-Language Models for Autonomous Driving Decision Planning


Overview

Vision Language Driving Perception is an open-source project focused on fine-tuning Vision-Language Models (VLMs) for decision planning in autonomous driving scenarios. By leveraging the expressive power of pre-trained VLMs, this project adapts them to downstream driving tasks such as behavior prediction, maneuver classification, and goal-directed planning.

This repository provides tools, datasets, and training pipelines to adapt Vision-Language Models for real-world autonomous driving decision modules.


Features

  • VLM Fine-Tuning Pipeline: Modular pipeline to fine-tune VLM on driving-specific tasks.
  • Dataset Integration: Supports structured scene data (e.g., nuPlan, Waymo, or custom vectorized environments).
  • Prompt Engineering for Driving Tasks: Custom vision-language prompts for planning-relevant tasks.
  • Evaluation Tools: Custom metrics for VLM output quality and scenario performance.

Use Cases

  • Planning-aware scene understanding
  • Maneuver prediction with vision-language reasoning
  • Goal-directed trajectory selection
  • Safety-critical decision refinement using natural language context

Project Structure

Vision Language Driving Perception/
│
├── README.md
├── setup.py
├── command
│   ├── InternVL2-1B.sh
│   └── eval.sh
├── config
│   └── zero_stage1_config.json
├── data
│   ├── sample.jsonl
│   └── vlm_samples
├── docs
│   ├── env_install.md
│   ├── vlm_finetune.md
│   └── vlm_trt.md
├── internvl
│   ├── __init__.py
│   ├── conversation.py
│   ├── dist_utils.py
│   ├── model
│   ├── patch
│   └── train
├── scripts
│   ├── internvl_eval.py
│   ├── pytorch_internvl_infer.py
│   └── trt_internvl_infer.py
├── tools
│   ├── __init__.py
│   ├── arrow2jsonl.py
│   ├── bart_score.py
│   ├── convert_parquet.py
│   ├── convert_to_int8.py
│   ├── extract_mlp.py
│   ├── extract_video_frames.py
│   ├── extract_vit.py
│   ├── json2jsonl.py
│   ├── jsonl2jsonl.py
│   ├── merge_lora.py
│   ├── replace_llm.py
│   └── resize_pos_embed.py
└── vlmtrt
    ├── build_vit_engine.py
    ├── conversation.py
    └── convert_qwen2_ckpt.py

Getting Started

1. Clone the repository

git clone https://github.com/thillai-c/Vision-Language-Driving-Perception.git
cd Vision-Language-Driving-Perception

2. Install dependencies

conda create -n vision-language-driving-perception python=3.10
conda activate vision-language-driving-perception
pip install -r requirements.txt

3. Prepare dataset

You can use scene data from nuPlan, Waymo, or a custom driving dataset. Please follow docs/data_prepare.md for instructions.

4. Run training

Please follow docs/vlm_finetune.md to fine-tune VLMs.

Evaluation

Please follow docs/vlm_finetune.md to evaluate your fine-tuned model on benchmark scenarios.

VLM Converter Module

The VLM Converter is a performance-boosting module designed to convert and quantize large Vision-Language Models (VLMs) using TensorRT-LLM, significantly improving inference speed while maintaining accuracy.

Please follow docs/vlm_trt.md to convert and quantize VLMs.

Acknowledgments

This project utilizes open-source tools and frameworks from the broader autonomous driving and vision-language model communities.

About

Vision-Language Model fine-tuning pipeline for autonomous driving with distributed training, TensorRT optimization, and custom evaluation metrics.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors