Skip to content

[ACM MM 2025] Mobile U-ViT: Revisiting large kernel and U-shaped ViT for efficient medical image segmentation

License

Notifications You must be signed in to change notification settings

FengheTan9/Mobile-U-ViT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mobile U-ViT: Revisiting large kernel and U-shaped ViT for efficient medical image segmentation

Teaser

Fenghe Tang1,2, Bingkun Nian3, Jianrui Ding4, Wenxin Ma1,2, Quan Quan5,
Chengqi Dong1,2, Jie Yang3, Wei Liu3, S. Kevin Zhou1,2


arXiv github License: Apache2.0

News

  • Mobile U-ViT accepted by ACM MM'25 🥰
  • Paper and Code released ! 😎

Abstract

In clinical practice, medical image analysis often requires efficient execution on resource-constrained mobile devices. However, exist ing mobile models—primarily optimized for natural images—tend to perform poorly on medical tasks due to the significant information density gap between natural and medical domains. Combining com putational efficiency with medical imaging-specific architectural advantages remains a challenge when developing lightweight, uni versal, and high-performing networks. To address this, we propose a mobile model called Mobile U-shaped Vision Transformer (Mobile U-ViT) tailored for medical image segmentation. Specifically, we employ the newly purposed ConvUtr as a hierarchical patch embedding, featuring a parameter-efficient large-kernel CNN with inverted bottleneck fusion. This design exhibits transformer-like representation learning capacity while being lighter and faster. To enable efficient local-global information exchange, we introduce a novel Large-kernel Local-Global-Local (LGL) block that effectively balances the low information density and high-level semantic dis crepancy of medical images. Finally, we incorporate a shallow and lightweight transformer bottleneck for long-range modeling and employ a cascaded decoder with downsample skip connections for dense prediction. Despite its reduced computational demands, our medical-optimized architecture achieves state-of-the-art per formance across eight public 2D and 3D datasets covering diverse imaging modalities, including zero-shot testing on four unseen datasets. These results establish it as an efficient yet powerful and generalization solution for mobile medical image analysis.

Teaser

Results:

Teaser

Teaser

Quick Start

1. Environment

  • GPU: NVIDIA GeForce RTX4090 GPU
  • Pytorch: 1.13.0 cuda 11.7
  • cudatoolkit: 11.7.1
  • scikit-learn: 1.0.2
  • albumentations: 1.2.0

2. Datasets

Please put the BUSI dataset or your own dataset as the following architecture.

└── Mobile-U-ViT
    ├── data
        ├── busi
            ├── images
            |   ├── benign (10).png
            │   ├── malignant (17).png
            │   ├── ...
            |
            └── masks
                ├── 0
                |   ├── benign (10).png
                |   ├── malignant (17).png
                |   ├── ...
        ├── your dataset
            ├── images
            |   ├── 0a7e06.png
            │   ├── ...
            |
            └── masks
                ├── 0
                |   ├── 0a7e06.png
                |   ├── ...
    ├── dataloader
    ├── network
    ├── utils
    ├── main.py
    └── split.py

3. 2D Training & Validation

You can first split your dataset:

python split.py --dataset_name busi --dataset_root ./data

Then, train and validate:

python main.py --model ["mobileuvit", "mobileuvit_l"] --base_dir ./data/busi --train_file_dir busi_train.txt --val_file_dir busi_val.txt

4. 3D Training & Validation

Downstream pipeline can be referred to UNETR.

# An example of Training on BTCV (num_classes=14)
from network.MobileUViT_3D import mobileuvit_l

model = mobileuvit_l(inch=1, out_channel=14).cuda()

Teaser

Acknowledgements:

This code uses helper functions from CMUNeXt.

Citation

If the code, paper and weights help your research, please cite:

@inproceedings{tang2025mobile,
  title={Mobile U-ViT: Revisiting large kernel and U-shaped ViT for efficient medical image segmentation},
  author={Tang, Fenghe and Nian, Bingkun and Ding, Jianrui and Ma, Wenxin and Quan, Quan and Dong, Chengqi and Yang, Jie and Liu, Wei and Zhou, S Kevin},
  booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
  pages={3408--3417},
  year={2025}
}

@article{tang2025mobile,
  title={Mobile U-ViT: Revisiting large kernel and U-shaped ViT for efficient medical image segmentation},
  author={Tang, Fenghe and Nian, Bingkun and Ding, Jianrui and Ma, Wenxin and Quan, Quan and Dong, Chengqi and Yang, Jie and Liu, Wei and Zhou, S Kevin},
  journal={arXiv preprint arXiv:2508.01064},
  year={2025}
}

License

This project is released under the Apache 2.0 license. Please see the LICENSE file for more information.

About

[ACM MM 2025] Mobile U-ViT: Revisiting large kernel and U-shaped ViT for efficient medical image segmentation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages