UGround: Towards Unified Visual Grounding with Unrolled Transformers

This repo provides the PyTorch source code of our paper: UGround: Towards Unified Visual Grounding with Unrolled Transformers.

It is also the official code release of [READ], CVPR'25, see also: [arXiv].

Authors: Rui Qian, Xin Yin, Chuanhang Deng, Zhiyuan Peng, Jian Xiong, Wei Zhai, Dejing Dou†.

Abstract

We present UGround, a Unified visual Grounding paradigm that dynamically selects intermediate layers across Unrolled transformers as ''mask as prompt'', diverging from the prevailing pipeline that leverages the fixed last hidden layer as ''<SEG> as prompt''. UGround addresses two primary challenges posed by the prevailing paradigm: (1) its reliance on the fixed last hidden layer, which sequentially amplifies cumulative errors arising from layer-by-layer propagation without intermediate correction, and (2) its use of <SEG> as a prompt, which implicitly projects textual embeddings into visual space without explicit spatial cues (e.g., coordinates). Central to UGround is Policy-Prompted Masking, which comprises two key components: Stochastic Skip Connection (SSC) and Mask as Prompt (MasP). SSC is a reinforcement learning policy that, via stochastic sampling, allows each <SEG> token to slide across unrolled transformer layers, enabling dynamic layer selection at which it connects to the vision model (e.g., SAM) in a skip-connection fashion. Given the selected hidden layer, MasP uses the similarity map derived from the <SEG> token and image tokens as a soft logit mask to prompt SAM for mask generation, offering explicit spatial cues through its activation regions. To validate the effectiveness of UGround, we, for the first time, have unified visual grounding within a single framework from an attribute perspective, spanning from traditional refer expression segmentation to newly proposed reasoning segmentation, single-target to multi-target, positive query to false premise (empty target). All codes and models are publicly available at https://github.com/rui-qian/UGround.

News

[2025.10.4] UGround code and UGround-LLaVA-v1.5-7B/13B models are released. Welcome to check them out!
[2025.10.3] Paper is released and GitHub repo is created.

Currently Supported Features

Support framework decoupling (框架分离: LISA, PixelLM, GSVA, READ, SESAME)
Support file tracking (文件追踪)
Support full logging (全量日志)
Support resuming training from checkpoints. (断点续训)
Support distributed training (分布式训练)
Support multi-dataset training and evaluation (多数据集训练评估)
Support rich visualization, e.g., data visualization and training visualization 可视化丰富(数据可视化，训练可视化)

Installation Guide

#!/bin/bash
# 1. curl -O https://repo.anaconda.com/archive/Anaconda3-2025.06-0-Linux-x86_64.sh
# 2. bash Anaconda3-2025.06-0-Linux-x86_64.sh
# 3. conda create -n uground python=3.9
# 4. conda activate uground
# 5. chmod +x build.sh 
# 6. ./build.sh
# 7. wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.0cxx11abiFALSE-cp39-cp39-linux_x86_64.whl
# 8. pip install flash_attn-2.6.3+cu118torch2.0cxx11abiFALSE-cp39-cp39-linux_x86_64.whl
# 9. chmod +x install.sh
#10. ./install.sh

For ease of installation, we have encapsulated the setup steps into a script, build.sh. You can complete the environment configuration within 5 minutes.

Model and Dataset Preparation

Currently, we support 8 dataset types, namely: A: sem_seg, B: refer_seg, C: neg_refer_seg, D: correct_refer_seg, E: vqa, F: reason_seg, G: reason_seg_plus, and H: multi_reason_seg. Please Visit UGround dataset page for more details.

A: sem_seg: ade20k||cocostuff||pascal_part||paco_lvis||mapillary

B: refer_seg: refclef||refcoco||refcoco+||refcocog||refzom||grefcoco

C: neg_refer_seg: R-refcoco||R-refcoco+||R-refcocog

D: correct_refer_seg: fprefcoco||fprefcoco+||fprefcocog

E: vqa: llava_instruct_150k

F: reason_seg: ReasonSeg|train

G: reason_seg_plus(LISA++): instance_seg||cot||conversations||caption

H: multi_reason_seg(muse): MultiReasonSeg|train

Model Name	gIoU cIoU	Snapshot	HG-ckpt URL
Results on ReasonSeg
UGround-LLaVA-v1.5-7B_ema/val	64.17 71.08	archive	weights [logs]
UGround-LLaVA-v1.5-7B_ema_mix/val	65.38 75.27	archive	weights [logs]
UGround-LLaVA-v1.5-7B_ema/test	63.36 66.09	archive	weights [logs]
UGround-LLaVA-v1.5-13B_ema/val	67.89 74.92	archive	weights [logs]
UGround-LLaVA-v1.5-13B_ema/test	65.50 65.86	archive	weights [logs]
UGround-LLaVA-v1.5-7B/val	66.13 72.07 *	archive	weights(Credit: Chuanhang Deng) [logs]
UGround-LLaVA-v1.5-7B_ema_mixed/val	66.69 74.13 *	archive	weights(Credit: Chuanhang Deng) [logs]
UGround-LLaVA-v1.5-7B/test	63.55 65.44	archive	weights
UGround-LLaVA-v1.5-13B/val	- -	archive	weights
UGround-LLaVA-v1.5-13B/test	65.03 65.47	archive	weights
Results on ReferSeg
UGround-LLaVA-v1.5-7B(refcocog/val)	76.52 74.73	archive	weights
Results on FP-ReferSeg
UGround-LLaVA-v1.5-7B(fp-refcoco)	See:85.86 cIoU:62.80	archive	weights [logs]
UGround-LLaVA-v1.5-7B(fp-refcoco+)	See:85.10 cIoU:56.03	archive	weights [logs]
UGround-LLaVA-v1.5-7B(fp-refcocog)	See:86.86 cIoU:58.55	archive	weights [logs]
Results on gReferSeg
UGround-LLaVA-v1.5-7B	74.49 66.38	archive	weights [logs]
UGround-LLaVA-v1.5-7B	72.46 65.56 *	archive	weights(Credit: Chuanhang Deng) [logs]

Tip

Please note that CUDA 11.7 (cu117) and CUDA 11.8 (cu118) may lead to slight differences in results. Based on our tests, both cu117 and cu118 can be installed successfully on NVIDIA A100-SXM4-40GB; however, cu117 fails to install on NVIDIA H800. See also: build.sh. (Credit: Chuanhang Deng)

pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 triton==2.0.0 \
  --index-url https://download.pytorch.org/whl/cu118

pip install torch==2.0.1+cu117 torchvision==0.15.2+cu117 triton==2.0.0 \
  --index-url https://download.pytorch.org/whl/cu117

Note

* indicates that the results are obtained using CUDA 11.8.

Experimental results

Training

./scripts/7b_reason_seg_val/train_uground_llava1.5_ema.sh     # for ReasonSeg 7B
./scripts/13b_reason_seg_val/train_uground_llava1.5_ema.sh    # for ReasonSeg 13B

Merge LoRA Weight

./scripts/7b_reason_seg_val/merge_lora_weight_uground_llava1.5_ema.sh     # for ReasonSeg 7B
./scripts/13b_reason_seg_val/merge_lora_weight_uground_llava1.5_ema.sh    # for ReasonSeg 13B

Validation

./scripts/7b_reason_seg_val/eval_uground_llava1.5_ema.sh     # for ReasonSeg 7B
./scripts/13b_reason_seg_val/eval_uground_llava1.5_ema.sh    # for ReasonSeg 13B

Inference

./scripts/13b_reason_seg_val/chat_uground.sh    # Chat Interface
./scripts/13b_reason_seg_val/app_uground.sh     # UGround Dashboard

Supported Features

Full Logging

Multi-dataset Evaluation

Data Visualization

./scripts/7b_reason_seg_val/dataset_demo.sh    # Data Visualization Dashboard

Training Visualization

./scripts/7b_reason_seg_val/start_tensorboard_uground_llava1.5_ema.sh    # Training Visualization Dashboard

Acknowledgements

We are grateful for the foundational code provided by PixelLM, SESAME, GSVA, READ, LISA, LLaVA, and SAM. Utilizing their resources implies agreement to their respective licenses. Our project benefits greatly from these contributions, and we acknowledge their significant impact on our work.

Citation

If you use our work or our implementation in this repo, or find them helpful, please consider giving a citation.

@inproceedings{qian2025UGround,
  title={UGround: Towards Unified Visual Grounding with Unrolled Transformers},
  author={Qian, Rui and Yin, Xin and Deng, Chuanhang and Peng, Zhiyuan and Xiong, Jian and Zhai, Wei and Dou, Dejing},
  booktitle={arXiv},
  year={2025}
}
@inproceedings{qian2025reasoning,
  title={Reasoning to Attend: Try to Understand How< SEG> Token Works},
  author={Qian, Rui and Yin, Xin and Dou, Dejing},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2025}
}

Contact

If you have any questions, feel free to reach out at qiianruii@gmail.com, xyin@zju.edu.cn, dengch2000@gmail.com, pzy2000@sjtu.edu.cn, jianxxiong@gmail.com, zhaiwei682@gmail.com and dejingdou@gmail.com.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
configs		configs
dataloaders		dataloaders
model		model
scripts		scripts
tools		tools
vis_output		vis_output
LICENSE		LICENSE
README.md		README.md
app.py		app.py
build.sh		build.sh
chat.py		chat.py
install.sh		install.sh
merge_lora_weights_and_save_hf_model.py		merge_lora_weights_and_save_hf_model.py
requirements.txt		requirements.txt
setup.py		setup.py
test_ds.py		test_ds.py
train_ds.py		train_ds.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UGround: Towards Unified Visual Grounding with Unrolled Transformers

Abstract

News

Currently Supported Features

Installation Guide

Model and Dataset Preparation

Experimental results

Training

Merge LoRA Weight

Validation

Inference

Supported Features

Full Logging

Multi-dataset Evaluation

Data Visualization

Training Visualization

Acknowledgements

Citation

Contact

About

Uh oh!

Releases 2

Packages

Languages

License

rui-qian/UGround

Folders and files

Latest commit

History

Repository files navigation

UGround: Towards Unified Visual Grounding with Unrolled Transformers

Abstract

News

Currently Supported Features

Installation Guide

Model and Dataset Preparation

Experimental results

Training

Merge LoRA Weight

Validation

Inference

Supported Features

Full Logging

Multi-dataset Evaluation

Data Visualization

Training Visualization

Acknowledgements

Citation

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages