This repo provides the PyTorch source code of our paper: UGround: Towards Unified Visual Grounding with Unrolled Transformers.
It is also the official code release of [READ], CVPR'25, see also: [arXiv].
Authors: Rui Qian, Xin Yin, Chuanhang Deng, Zhiyuan Peng, Jian Xiong, Wei Zhai, Dejing Dou†.
We present UGround, a Unified visual Grounding paradigm that dynamically selects intermediate layers across Unrolled transformers as ''mask as prompt'', diverging from the prevailing pipeline that leverages the fixed last hidden layer as ''<SEG> as prompt''. UGround addresses two primary challenges posed by the prevailing paradigm: (1) its reliance on the fixed last hidden layer, which sequentially amplifies cumulative errors arising from layer-by-layer propagation without intermediate correction, and (2) its use of <SEG> as a prompt, which implicitly projects textual embeddings into visual space without explicit spatial cues (e.g., coordinates). Central to UGround is Policy-Prompted Masking, which comprises two key components: Stochastic Skip Connection (SSC) and Mask as Prompt (MasP). SSC is a reinforcement learning policy that, via stochastic sampling, allows each <SEG> token to slide across unrolled transformer layers, enabling dynamic layer selection at which it connects to the vision model (e.g., SAM) in a skip-connection fashion. Given the selected hidden layer, MasP uses the similarity map derived from the <SEG> token and image tokens as a soft logit mask to prompt SAM for mask generation, offering explicit spatial cues through its activation regions. To validate the effectiveness of UGround, we, for the first time, have unified visual grounding within a single framework from an attribute perspective, spanning from traditional refer expression segmentation to newly proposed reasoning segmentation, single-target to multi-target, positive query to false premise (empty target). All codes and models are publicly available at https://github.com/rui-qian/UGround.
- [2025.10.4] UGround code and UGround-LLaVA-v1.5-7B/13B models are released. Welcome to check them out!
- [2025.10.3] Paper is released and GitHub repo is created.
- Support framework decoupling (框架分离: LISA, PixelLM, GSVA, READ, SESAME)
- Support file tracking (文件追踪)
- Support full logging (全量日志)
- Support resuming training from checkpoints. (断点续训)
- Support distributed training (分布式训练)
- Support multi-dataset training and evaluation (多数据集训练评估)
- Support rich visualization, e.g., data visualization and training visualization 可视化丰富(数据可视化,训练可视化)
#!/bin/bash
# 1. curl -O https://repo.anaconda.com/archive/Anaconda3-2025.06-0-Linux-x86_64.sh
# 2. bash Anaconda3-2025.06-0-Linux-x86_64.sh
# 3. conda create -n uground python=3.9
# 4. conda activate uground
# 5. chmod +x build.sh
# 6. ./build.sh
# 7. wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.0cxx11abiFALSE-cp39-cp39-linux_x86_64.whl
# 8. pip install flash_attn-2.6.3+cu118torch2.0cxx11abiFALSE-cp39-cp39-linux_x86_64.whl
# 9. chmod +x install.sh
#10. ./install.sh
For ease of installation, we have encapsulated the setup steps into a script, build.sh. You can complete the environment configuration within 5 minutes.
Currently, we support 8 dataset types, namely: A: sem_seg, B: refer_seg, C: neg_refer_seg, D: correct_refer_seg, E: vqa, F: reason_seg, G: reason_seg_plus, and H: multi_reason_seg. Please Visit UGround dataset page for more details.
A: sem_seg: ade20k||cocostuff||pascal_part||paco_lvis||mapillary
B: refer_seg: refclef||refcoco||refcoco+||refcocog||refzom||grefcoco
C: neg_refer_seg: R-refcoco||R-refcoco+||R-refcocog
D: correct_refer_seg: fprefcoco||fprefcoco+||fprefcocog
E: vqa: llava_instruct_150k
F: reason_seg: ReasonSeg|train
G: reason_seg_plus(LISA++): instance_seg||cot||conversations||caption
H: multi_reason_seg(muse): MultiReasonSeg|train
Tip
Please note that CUDA 11.7 (cu117) and CUDA 11.8 (cu118) may lead to slight differences in results. Based on our tests, both cu117 and cu118 can be installed successfully on NVIDIA A100-SXM4-40GB; however, cu117 fails to install on NVIDIA H800. See also: build.sh. (Credit: Chuanhang Deng)
pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 triton==2.0.0 \
--index-url https://download.pytorch.org/whl/cu118pip install torch==2.0.1+cu117 torchvision==0.15.2+cu117 triton==2.0.0 \
--index-url https://download.pytorch.org/whl/cu117Note
* indicates that the results are obtained using CUDA 11.8.
./scripts/7b_reason_seg_val/train_uground_llava1.5_ema.sh # for ReasonSeg 7B
./scripts/13b_reason_seg_val/train_uground_llava1.5_ema.sh # for ReasonSeg 13B
./scripts/7b_reason_seg_val/merge_lora_weight_uground_llava1.5_ema.sh # for ReasonSeg 7B
./scripts/13b_reason_seg_val/merge_lora_weight_uground_llava1.5_ema.sh # for ReasonSeg 13B
./scripts/7b_reason_seg_val/eval_uground_llava1.5_ema.sh # for ReasonSeg 7B
./scripts/13b_reason_seg_val/eval_uground_llava1.5_ema.sh # for ReasonSeg 13B
./scripts/13b_reason_seg_val/chat_uground.sh # Chat Interface
./scripts/13b_reason_seg_val/app_uground.sh # UGround Dashboard
./scripts/7b_reason_seg_val/dataset_demo.sh # Data Visualization Dashboard
./scripts/7b_reason_seg_val/start_tensorboard_uground_llava1.5_ema.sh # Training Visualization Dashboard
We are grateful for the foundational code provided by PixelLM, SESAME, GSVA, READ, LISA, LLaVA, and SAM. Utilizing their resources implies agreement to their respective licenses. Our project benefits greatly from these contributions, and we acknowledge their significant impact on our work.
If you use our work or our implementation in this repo, or find them helpful, please consider giving a citation.
@inproceedings{qian2025UGround,
title={UGround: Towards Unified Visual Grounding with Unrolled Transformers},
author={Qian, Rui and Yin, Xin and Deng, Chuanhang and Peng, Zhiyuan and Xiong, Jian and Zhai, Wei and Dou, Dejing},
booktitle={arXiv},
year={2025}
}
@inproceedings{qian2025reasoning,
title={Reasoning to Attend: Try to Understand How< SEG> Token Works},
author={Qian, Rui and Yin, Xin and Dou, Dejing},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2025}
}
If you have any questions, feel free to reach out at qiianruii@gmail.com, xyin@zju.edu.cn, dengch2000@gmail.com, pzy2000@sjtu.edu.cn, jianxxiong@gmail.com, zhaiwei682@gmail.com and dejingdou@gmail.com.












