Fully Exploiting Vision Foundation Model’s Profound Prior Knowledge for Generalizable RGB-Depth Driving Scene Parsing (TIV-2026)

Paper | Project Page

This is the official implementation for HFIT:

Fully Exploiting Vision Foundation Model’s Profound Prior Knowledge for Generalizable RGB-Depth Driving Scene Parsing

Sicen Guo, Tianyou Wen, Chuang-Wei Liu, Qijun Chen, Rui Fan

📋 Abstract

Recent vision foundation models (VFMs), typically based on Vision Transformer (ViT), have significantly advanced numerous computer vision tasks. Despite their success in tasks focused solely on RGB images, the potential of VFMs in RGB-depth driving scene parsing remains largely under-explored. In this article, we take one step toward this emerging research area by investigating a feasible technique to fully exploit VFMs for generalizable RGB-depth driving scene parsing. Specifically, we explore the inherent characteristics of RGB and depth data, thereby presenting a Heterogeneous Feature Integration Transformer (HFIT). This network enables the efficient extraction and integration of comprehensive heterogeneous features without re-training ViTs. Relative depth prediction results from VFMs, used as inputs to the HFIT side adapter, overcome the limitations of the dependence on depth maps. Our proposed HFIT demonstrates superior performance compared to all other traditional single-modal and data-fusion scene parsing networks, pre-trained VFMs, and ViT adapters on the Cityscapes and KITTI Semantics datasets. We believe this novel strategy paves the way for future innovations in VFM-based data-fusion techniques for driving scene parsing.

🪧 Overview

To the best of our knowledge, this study is the first to investigate the adaptation of VFMs for RGB-D driving scene parsing. HFIT can directly extract and integrate heterogeneous features without re-training ViTs. By leveraging relative depth prediction results from VFMs as inputs to the HFIT adapter, our approach overcomes the limitations of the dependence on depth maps.
We introduce a DSPE strategy to jointly capture local semantics from RGB-D data, enabling HFIT to effectively learn heterogeneous features.
We design an RHFF module to collaboratively recalibrate spatial priors, enhancing the comprehensiveness of feature fusion.
We propose an HGFI strategy to combine complementary strengths of multi-level features, equipping HFIT with enhanced fine-grained feature representation capabilities.

🏗️️ Setup

Follow the detailed Setup in MMSegmentation.

💾 Dataset Preparation

Cityscapes Dataset can be downloaded from the Cityscapes Dataset. We generate dense depth maps using a pretrained Depth Anything V2 network. They can be downloaded from here.

📸 Checkpoints

The official weights of the trained model on the Cityscapes dataset and Depth Anything ViT Large model can be downloaded from here.

🏋 Training

We provide training scripts for HFIT. It can be trained on a single Nvidia 3090 GPU.

CUDA_VISIBLE_DEVICES=1 PORT=29501 bash tools/dist_train.sh configs/hfit_uper_cityscapes.py 1

🏋 Testing

We provide testing scripts for HFIT. It can be trained on a single Nvidia 3090 GPU.

python tools/test.py configs/hfit_uper_cityscapes.py checkpoints/cityscapes_hfit.pth

The results are as follows:

+---------------+-------+-------+-------+--------+-----------+--------+
|     Class     |  IoU  |  Dice |  Acc  | Fscore | Precision | Recall |
+---------------+-------+-------+-------+--------+-----------+--------+
|      road     | 98.25 | 99.12 | 98.71 | 99.12  |   99.53   | 98.71  |
|    sidewalk   | 87.03 | 93.06 | 95.09 | 93.06  |   91.12   | 95.09  |
|    building   | 94.69 | 97.27 | 97.33 | 97.27  |   97.22   | 97.33  |
|      wall     | 68.26 | 81.13 | 79.79 | 81.13  |   82.52   | 79.79  |
|     fence     |  71.8 | 83.59 | 81.64 | 83.59  |   85.63   | 81.64  |
|      pole     | 73.98 | 85.04 | 83.73 | 85.04  |   86.39   | 83.73  |
| traffic light | 76.73 | 86.83 | 86.65 | 86.83  |   87.02   | 86.65  |
|  traffic sign |  84.3 | 91.48 | 91.24 | 91.48  |   91.72   | 91.24  |
|   vegetation  | 93.58 | 96.68 | 96.98 | 96.68  |   96.39   | 96.98  |
|    terrain    | 71.35 | 83.28 | 79.49 | 83.28  |   87.45   | 79.49  |
|      sky      | 95.67 | 97.79 | 98.95 | 97.79  |   96.64   | 98.95  |
|     person    | 87.62 |  93.4 |  93.8 |  93.4  |   93.01   |  93.8  |
|     rider     | 74.95 | 85.68 | 85.02 | 85.68  |   86.35   | 85.02  |
|      car      | 96.51 | 98.22 | 98.47 | 98.22  |   97.97   | 98.47  |
|     truck     | 92.14 | 95.91 | 95.25 | 95.91  |   96.58   | 95.25  |
|      bus      | 93.73 | 96.76 | 97.33 | 96.76  |    96.2   | 97.33  |
|     train     | 88.35 | 93.81 | 91.88 | 93.81  |   95.82   | 91.88  |
|   motorcycle  | 78.74 | 88.11 | 84.69 | 88.11  |   91.81   | 84.69  |
|    bicycle    | 82.46 | 90.39 | 91.89 | 90.39  |   88.93   | 91.89  |
+---------------+-------+-------+-------+--------+-----------+--------+
aAcc: 96.9900  mIoU: 84.7400  mDice: 91.4500  mAcc: 90.9400  mFscore: 91.4500  mPrecision: 92.0200  mRecall: 90.9400

🗣️ Acknowledgements

Our code is based on MMSegmentation and ViT-Adapter. We sincerely appreciate their amazing works.

This research was supported by the National Natural Science Foundation of China under Grants 62473288 and 62233013, the Fundamental Research Funds for the Central Universities, and the Xiaomi Young Talents Program.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
configs		configs
mmseg		mmseg
requirements		requirements
tests		tests
tools		tools
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py
train_cityscapes.sh		train_cityscapes.sh
val_cityscapes.sh		val_cityscapes.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fully Exploiting Vision Foundation Model’s Profound Prior Knowledge for Generalizable RGB-Depth Driving Scene Parsing (TIV-2026)

Paper | Project Page

📋 Abstract

🪧 Overview

🏗️️ Setup

💾 Dataset Preparation

📸 Checkpoints

🏋 Training

🏋 Testing

🗣️ Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Fully Exploiting Vision Foundation Model’s Profound Prior Knowledge for Generalizable RGB-Depth Driving Scene Parsing (TIV-2026)

Paper | Project Page

📋 Abstract

🪧 Overview

🏗️️ Setup

💾 Dataset Preparation

📸 Checkpoints

🏋 Training

🏋 Testing

🗣️ Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages