Skip to content

UGround: Towards Unified Visual Grounding with Unrolled Transformers

License

Notifications You must be signed in to change notification settings

rui-qian/UGround

Repository files navigation

logo

UGround: Towards Unified Visual Grounding with Unrolled Transformers

MIT license arXiv

This repo provides the PyTorch source code of our paper: UGround: Towards Unified Visual Grounding with Unrolled Transformers.

It is also the official code release of [READ], CVPR'25, see also: [arXiv].

Authors: Rui Qian, Xin Yin, Chuanhang Deng, Zhiyuan Peng, Jian Xiong, Wei Zhai, Dejing Dou†.

Abstract

We present UGround, a Unified visual Grounding paradigm that dynamically selects intermediate layers across Unrolled transformers as ''mask as prompt'', diverging from the prevailing pipeline that leverages the fixed last hidden layer as ''<SEG> as prompt''. UGround addresses two primary challenges posed by the prevailing paradigm: (1) its reliance on the fixed last hidden layer, which sequentially amplifies cumulative errors arising from layer-by-layer propagation without intermediate correction, and (2) its use of <SEG> as a prompt, which implicitly projects textual embeddings into visual space without explicit spatial cues (e.g., coordinates). Central to UGround is Policy-Prompted Masking, which comprises two key components: Stochastic Skip Connection (SSC) and Mask as Prompt (MasP). SSC is a reinforcement learning policy that, via stochastic sampling, allows each <SEG> token to slide across unrolled transformer layers, enabling dynamic layer selection at which it connects to the vision model (e.g., SAM) in a skip-connection fashion. Given the selected hidden layer, MasP uses the similarity map derived from the <SEG> token and image tokens as a soft logit mask to prompt SAM for mask generation, offering explicit spatial cues through its activation regions. To validate the effectiveness of UGround, we, for the first time, have unified visual grounding within a single framework from an attribute perspective, spanning from traditional refer expression segmentation to newly proposed reasoning segmentation, single-target to multi-target, positive query to false premise (empty target). All codes and models are publicly available at https://github.com/rui-qian/UGround.

News

  • [2025.10.4] UGround code and UGround-LLaVA-v1.5-7B/13B models are released. Welcome to check them out!
  • [2025.10.3] Paper is released and GitHub repo is created.

Currently Supported Features

Installation Guide

#!/bin/bash
# 1. curl -O https://repo.anaconda.com/archive/Anaconda3-2025.06-0-Linux-x86_64.sh
# 2. bash Anaconda3-2025.06-0-Linux-x86_64.sh
# 3. conda create -n uground python=3.9
# 4. conda activate uground
# 5. chmod +x build.sh 
# 6. ./build.sh
# 7. wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.0cxx11abiFALSE-cp39-cp39-linux_x86_64.whl
# 8. pip install flash_attn-2.6.3+cu118torch2.0cxx11abiFALSE-cp39-cp39-linux_x86_64.whl
# 9. chmod +x install.sh
#10. ./install.sh

For ease of installation, we have encapsulated the setup steps into a script, build.sh. You can complete the environment configuration within 5 minutes.

Model and Dataset Preparation

Currently, we support 8 dataset types, namely: A: sem_seg, B: refer_seg, C: neg_refer_seg, D: correct_refer_seg, E: vqa, F: reason_seg, G: reason_seg_plus, and H: multi_reason_seg. Please Visit UGround dataset page for more details.

A: sem_seg: ade20k||cocostuff||pascal_part||paco_lvis||mapillary

B: refer_seg: refclef||refcoco||refcoco+||refcocog||refzom||grefcoco

C: neg_refer_seg: R-refcoco||R-refcoco+||R-refcocog

D: correct_refer_seg: fprefcoco||fprefcoco+||fprefcocog

E: vqa: llava_instruct_150k

F: reason_seg: ReasonSeg|train

G: reason_seg_plus(LISA++): instance_seg||cot||conversations||caption

H: multi_reason_seg(muse): MultiReasonSeg|train

Model Name gIoU cIoU Snapshot HG-ckpt URL
Results on ReasonSeg
UGround-LLaVA-v1.5-7B_ema/val 64.17 71.08 archive weights [logs]
UGround-LLaVA-v1.5-7B_ema_mix/val 65.38 75.27 archive weights [logs]
UGround-LLaVA-v1.5-7B_ema/test 63.36 66.09 archive weights [logs]
UGround-LLaVA-v1.5-13B_ema/val 67.89 74.92 archive weights [logs]
UGround-LLaVA-v1.5-13B_ema/test 65.50 65.86 archive weights [logs]
UGround-LLaVA-v1.5-7B/val 66.13 72.07 * archive weights(Credit: Chuanhang Deng) [logs]
UGround-LLaVA-v1.5-7B_ema_mixed/val 66.69 74.13 * archive weights(Credit: Chuanhang Deng) [logs]
UGround-LLaVA-v1.5-7B/test 63.55 65.44 archive weights
UGround-LLaVA-v1.5-13B/val - - archive weights
UGround-LLaVA-v1.5-13B/test 65.03 65.47 archive weights
Results on ReferSeg
UGround-LLaVA-v1.5-7B(refcocog/val) 76.52 74.73 archive weights
Results on FP-ReferSeg
UGround-LLaVA-v1.5-7B(fp-refcoco) See:85.86 cIoU:62.80 archive weights [logs]
UGround-LLaVA-v1.5-7B(fp-refcoco+) See:85.10 cIoU:56.03 archive weights [logs]
UGround-LLaVA-v1.5-7B(fp-refcocog) See:86.86 cIoU:58.55 archive weights [logs]
Results on gReferSeg
UGround-LLaVA-v1.5-7B 74.49 66.38 archive weights [logs]
UGround-LLaVA-v1.5-7B 72.46 65.56 * archive weights(Credit: Chuanhang Deng) [logs]

Tip

Please note that CUDA 11.7 (cu117) and CUDA 11.8 (cu118) may lead to slight differences in results. Based on our tests, both cu117 and cu118 can be installed successfully on NVIDIA A100-SXM4-40GB; however, cu117 fails to install on NVIDIA H800. See also: build.sh. (Credit: Chuanhang Deng)

pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 triton==2.0.0 \
  --index-url https://download.pytorch.org/whl/cu118
pip install torch==2.0.1+cu117 torchvision==0.15.2+cu117 triton==2.0.0 \
  --index-url https://download.pytorch.org/whl/cu117

Note

* indicates that the results are obtained using CUDA 11.8.

Experimental results

Training

./scripts/7b_reason_seg_val/train_uground_llava1.5_ema.sh     # for ReasonSeg 7B
./scripts/13b_reason_seg_val/train_uground_llava1.5_ema.sh    # for ReasonSeg 13B

Merge LoRA Weight

./scripts/7b_reason_seg_val/merge_lora_weight_uground_llava1.5_ema.sh     # for ReasonSeg 7B
./scripts/13b_reason_seg_val/merge_lora_weight_uground_llava1.5_ema.sh    # for ReasonSeg 13B

Validation

./scripts/7b_reason_seg_val/eval_uground_llava1.5_ema.sh     # for ReasonSeg 7B
./scripts/13b_reason_seg_val/eval_uground_llava1.5_ema.sh    # for ReasonSeg 13B

Inference

./scripts/13b_reason_seg_val/chat_uground.sh    # Chat Interface
./scripts/13b_reason_seg_val/app_uground.sh     # UGround Dashboard

Supported Features

  • Full Logging

  • Multi-dataset Evaluation

  • Data Visualization

./scripts/7b_reason_seg_val/dataset_demo.sh    # Data Visualization Dashboard
  • Training Visualization

./scripts/7b_reason_seg_val/start_tensorboard_uground_llava1.5_ema.sh    # Training Visualization Dashboard

Acknowledgements

We are grateful for the foundational code provided by PixelLM, SESAME, GSVA, READ, LISA, LLaVA, and SAM. Utilizing their resources implies agreement to their respective licenses. Our project benefits greatly from these contributions, and we acknowledge their significant impact on our work.

Citation

If you use our work or our implementation in this repo, or find them helpful, please consider giving a citation.

@inproceedings{qian2025UGround,
  title={UGround: Towards Unified Visual Grounding with Unrolled Transformers},
  author={Qian, Rui and Yin, Xin and Deng, Chuanhang and Peng, Zhiyuan and Xiong, Jian and Zhai, Wei and Dou, Dejing},
  booktitle={arXiv},
  year={2025}
}
@inproceedings{qian2025reasoning,
  title={Reasoning to Attend: Try to Understand How< SEG> Token Works},
  author={Qian, Rui and Yin, Xin and Dou, Dejing},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2025}
}

Contact

If you have any questions, feel free to reach out at qiianruii@gmail.com, xyin@zju.edu.cn, dengch2000@gmail.com, pzy2000@sjtu.edu.cn, jianxxiong@gmail.com, zhaiwei682@gmail.com and dejingdou@gmail.com.

About

UGround: Towards Unified Visual Grounding with Unrolled Transformers

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages