Chengqi Dong1,2, Jie Yang3, Wei Liu3, S. Kevin Zhou1,2
2 Suzhou Institute for Advanced Research, University of Science and Technology of China
3 School of Automation and Intelligent Sensing, Shanghai Jiao Tong University
4 School of Computer Science and Technology, Harbin Institute of Technology
5 State Grid Hunan ElectricPower Corporation Limited Research Institute
- Mobile U-ViT accepted by ACM MM'25 🥰
- Paper and Code released ! 😎
In clinical practice, medical image analysis often requires efficient execution on resource-constrained mobile devices. However, exist ing mobile models—primarily optimized for natural images—tend to perform poorly on medical tasks due to the significant information density gap between natural and medical domains. Combining com putational efficiency with medical imaging-specific architectural advantages remains a challenge when developing lightweight, uni versal, and high-performing networks. To address this, we propose a mobile model called Mobile U-shaped Vision Transformer (Mobile U-ViT) tailored for medical image segmentation. Specifically, we employ the newly purposed ConvUtr as a hierarchical patch embedding, featuring a parameter-efficient large-kernel CNN with inverted bottleneck fusion. This design exhibits transformer-like representation learning capacity while being lighter and faster. To enable efficient local-global information exchange, we introduce a novel Large-kernel Local-Global-Local (LGL) block that effectively balances the low information density and high-level semantic dis crepancy of medical images. Finally, we incorporate a shallow and lightweight transformer bottleneck for long-range modeling and employ a cascaded decoder with downsample skip connections for dense prediction. Despite its reduced computational demands, our medical-optimized architecture achieves state-of-the-art per formance across eight public 2D and 3D datasets covering diverse imaging modalities, including zero-shot testing on four unseen datasets. These results establish it as an efficient yet powerful and generalization solution for mobile medical image analysis.
- GPU: NVIDIA GeForce RTX4090 GPU
- Pytorch: 1.13.0 cuda 11.7
- cudatoolkit: 11.7.1
- scikit-learn: 1.0.2
- albumentations: 1.2.0
Please put the BUSI dataset or your own dataset as the following architecture.
└── Mobile-U-ViT
├── data
├── busi
├── images
| ├── benign (10).png
│ ├── malignant (17).png
│ ├── ...
|
└── masks
├── 0
| ├── benign (10).png
| ├── malignant (17).png
| ├── ...
├── your dataset
├── images
| ├── 0a7e06.png
│ ├── ...
|
└── masks
├── 0
| ├── 0a7e06.png
| ├── ...
├── dataloader
├── network
├── utils
├── main.py
└── split.py
You can first split your dataset:
python split.py --dataset_name busi --dataset_root ./dataThen, train and validate:
python main.py --model ["mobileuvit", "mobileuvit_l"] --base_dir ./data/busi --train_file_dir busi_train.txt --val_file_dir busi_val.txtDownstream pipeline can be referred to UNETR.
# An example of Training on BTCV (num_classes=14)
from network.MobileUViT_3D import mobileuvit_l
model = mobileuvit_l(inch=1, out_channel=14).cuda()This code uses helper functions from CMUNeXt.
If the code, paper and weights help your research, please cite:
@inproceedings{tang2025mobile,
title={Mobile U-ViT: Revisiting large kernel and U-shaped ViT for efficient medical image segmentation},
author={Tang, Fenghe and Nian, Bingkun and Ding, Jianrui and Ma, Wenxin and Quan, Quan and Dong, Chengqi and Yang, Jie and Liu, Wei and Zhou, S Kevin},
booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
pages={3408--3417},
year={2025}
}
@article{tang2025mobile,
title={Mobile U-ViT: Revisiting large kernel and U-shaped ViT for efficient medical image segmentation},
author={Tang, Fenghe and Nian, Bingkun and Ding, Jianrui and Ma, Wenxin and Quan, Quan and Dong, Chengqi and Yang, Jie and Liu, Wei and Zhou, S Kevin},
journal={arXiv preprint arXiv:2508.01064},
year={2025}
}
This project is released under the Apache 2.0 license. Please see the LICENSE file for more information.




