Zhichao Sun, Huazhang Hu, Yidong Ma, Gang Liu, Yibo Chen, Xu Tang, Yao Hu, Yongchao Xu
We propose CQ-DINO, a category query-based object detection framework for vast vocabulary object detection.

The recommended configuration is 8 A100 GPUs, with CUDA version 12.1. The other configurations in MMDetection should also work.
Please follow the guide to install and set up of the mmdetection.
conda create --name openmmlab python=3.10.6 -y
conda activate openmmlab
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121
pip install -U openmim
mim install mmengine
mim install "mmcv==2.2.0"
git clone [email protected]:RedAIGC/CQ-DINO.git
cd CQ-DINO
pip install -v -e .
The dataset preparation is the same as the one in mmgrounding of MMDetection, which you can refer to.
Please download and prepare V3Det Dataset at V3Det Homepage and V3Det Github. After downloading and unzipping, place the dataset or create a symbolic link to it in the data/v3det directory, with the following directory structure:
CQ-DINO
├── configs
├── data
│ ├── v3det
│ │ ├── annotations
│ │ | ├── v3det_2023_v1_train.json
│ │ ├── images
│ │ │ ├── a00000066
│ │ │ │ ├── xxx.jpg
│ │ │ ├── ...
Then use coco2odvg.py to convert it into the ODVG format required for training:
python tools/dataset_converters/coco2odvg.py data/v3det/annotations/v3det_2023_v1_train.json -d v3detAfter the program has run, two new files v3det_2023_v1_train_od.json and v3det_2023_v1_label_map.json will be created in the data/v3det/annotations directory, with the complete structure as follows:
CQ-DINO
├── configs
├── data
│ ├── v3det
│ │ ├── annotations
│ │ | ├── v3det_2023_v1_train.json
│ │ | ├── v3det_2023_v1_train_od.json
│ │ | ├── v3det_2023_v1_label_map.json
│ │ ├── images
│ │ │ ├── a00000066
│ │ │ │ ├── xxx.jpg
│ │ │ ├── ...
Please download it from the COCO official website or from opendatalab.
After downloading and unzipping, place the dataset or create a symbolic link to the data/coco directory. The directory structure is as follows:
CQ-DINO
├── configs
├── data
│ ├── coco
│ │ ├── annotations
│ │ │ ├── instances_train2017.json
│ │ │ ├── instances_val2017.json
│ │ ├── train2017
│ │ │ ├── xxx.jpg
│ │ │ ├── ...
│ │ ├── val2017
│ │ │ ├── xxxx.jpg
│ │ │ ├── ...
Then use coco2odvg.py to convert it into the ODVG format required for training:
python tools/dataset_converters/coco2odvg.py data/coco/annotations/instances_train2017.json -d cocoThis will generate new files, instances_train2017_od.json and coco2017_label_map.json, in the data/coco/annotations/ directory. The complete dataset structure is as follows:
CQ-DINO
├── configs
├── data
│ ├── coco
│ │ ├── annotations
│ │ │ ├── instances_train2017.json
│ │ │ ├── instances_train2017_od.json
│ │ │ ├── coco2017_label_map.json
│ │ │ ├── instances_val2017.json
│ │ ├── train2017
│ │ │ ├── xxx.jpg
│ │ │ ├── ...
│ │ ├── val2017
│ │ │ ├── xxxx.jpg
│ │ │ ├── ...
- Download the first stage parametres from Google drive in the directory stage1.
- Download the category embeddings from Google drive.
The complete structure is as follows:
CQ-DINO
├── configs
├── stage1
│ ├── cqdino_swinb1k_v3det_stage1.pth
│ ├── cqdino_swinb22k_v3det_stage1.pth
│ ├── cqdino_swinl_coco_stage1.pth
│ ├── cqdino_swinl_v3det_stage1.pth
├── v3det_clip_embeddings.pth
├── coco_clip_embeddings.pth
- Download the BERT Base model from Huggingface and replace the
lang_model_namein the config files. For example:
_base_ = [
'../_base_/datasets/coco_detection.py',
'../_base_/schedules/schedule_1x.py', '../_base_/default_runtime.py'
]
q2l_config_name = 'q2l_config.json'
lang_model_name = '/home/huggingface_hub/models--google-bert--bert-base-uncased'# single gpu
python tools/train.py configs/cqdino/cqdino_tree_swinb22k_v3det.py
# multi gpu
bash tools/dist_train.sh configs/cqdino/cqdino_tree_swinb22k_v3det.py NUM_GPUsDownload the checkpoint from Google drive.
# single gpu
python tools/test.py configs/cqdino/cqdino_tree_swinb22k_v3det.py cqdino_swinb22k_v3det.pth
# multi gpu
bash tools/dist_test.sh configs/cqdino/cqdino_tree_swinb22k_v3det.py cqdino_swinb22k_v3det.pth NUM_GPUs@article{sun2025cq,
title={{CQ-DINO}: Mitigating Gradient Dilution via Category Queries for Vast Vocabulary Object Detection},
author={Sun, Zhichao and Hu, Huazhang and Ma, Yidong and Liu, Gang and Chen, Nemo and Tang, Xu and Xu, Yongchao},
journal={arXiv preprint arXiv:2503.18430},
year={2025}
}