Skip to content

Commit a8d0e3c

Browse files
committed
support Oriented GLIP
1 parent 122ef40 commit a8d0e3c

7 files changed

+1799
-0
lines changed

projects/GLIP/README.md

Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
# [Oriented GLIP] GLIP: Grounded Language-Image Pre-training
2+
3+
> [GLIP: Grounded Language-Image Pre-training](https://arxiv.org/abs/2112.03857)
4+
5+
## Abstract
6+
7+
This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich. In our experiments, we pre-train GLIP on 27M grounding data, including 3M human-annotated and 24M web-crawled image-text pairs. The learned representations demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks. 1) When directly evaluated on COCO and LVIS (without seeing any images in COCO during pre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many supervised baselines. 2) After fine-tuned on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior SoTA. 3) When transferred to 13 downstream object detection tasks, a 1-shot GLIP rivals with a fully-supervised Dynamic Head.
8+
9+
<div align=center>
10+
<img src="https://github.com/open-mmlab/mmyolo/assets/17425982/b87228d7-f000-4a5d-b103-fe535984417a"/>
11+
</div>
12+
13+
## Installation
14+
15+
```shell
16+
cd $MMDETROOT
17+
18+
# source installation
19+
pip install -r requirements/multimodal.txt
20+
21+
# or mim installation
22+
mim install mmdet[multimodal]
23+
```
24+
25+
- NOTE
26+
27+
GLIP utilizes BERT as the language model, which requires access to https://huggingface.co/. If you encounter connection errors due to network access, you can download the required files on a computer with internet access and save them locally. Finally, modify the `lang_model_name` field in the config to the local path. Please refer to the following code:
28+
29+
```python
30+
from transformers import BertConfig, BertModel
31+
from transformers import AutoTokenizer
32+
33+
config = BertConfig.from_pretrained("bert-base-uncased")
34+
model = BertModel.from_pretrained("bert-base-uncased", add_pooling_layer=False, config=config)
35+
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
36+
37+
config.save_pretrained("your path/bert-base-uncased")
38+
model.save_pretrained("your path/bert-base-uncased")
39+
tokenizer.save_pretrained("your path/bert-base-uncased")
40+
```
41+
42+
43+
## Dataset Preparation
44+
45+
- Step1: download NWPU dataset, format as:
46+
47+
```text
48+
├── NWPU-RESISC45
49+
└── NWPU-RESISC45
50+
├── CLASS 1
51+
├── CLASS 2
52+
└── ...
53+
```
54+
55+
- Step2: prepare OVD dataset.
56+
57+
```
58+
python projects/GroundingDINO/tools/prepare_ovdg_dataset.py \
59+
--data_dir data/NWPU-RESISC45/NWPU-RESISC45 \
60+
--save_path data/NWPU-RESISC45/annotations/nwpu45_unlabeled_2.json
61+
```
62+
63+
64+
## Quick Start:
65+
66+
```shell
67+
bash projects/GLIP/run.sh
68+
```
69+
70+
71+
## Training
72+
73+
> **Note**: we follow the similar training pipeline as CastDet.
74+
75+
- Step1: train base-detector
76+
77+
```shell
78+
exp1="glip_atss_r50_a_fpn_dyhead_visdronezsd_base"
79+
python tools/train.py projects/GLIP/configs/$exp1.py
80+
```
81+
82+
83+
- **[Optional]** Step2: pseudo-labeling
84+
85+
```shell
86+
# 2.1 pseudo-labeling
87+
exp2="glip_atss_r50_a_fpn_dyhead_visdronezsd_base_nwpu45_pseudo_labeling"
88+
python tools/test.py \
89+
projects/GLIP/configs/$exp2.py \
90+
work_dirs/$exp1/iter_20000.pth
91+
92+
# 2.2 merge predictions
93+
python projects/GroundingDINO/tools/merge_ovdg_preds.py \
94+
--ann_path data/NWPU-RESISC45/annotations/nwpu45_unlabeled_2.json \
95+
--pred_path work_dirs/$exp2/nwpu45_pseudo_labeling_2.bbox.json \
96+
--save_path work_dirs/$exp2/nwpu45_unlabeled_with_glip_pseudos_2.json
97+
98+
# move to data folder
99+
cp work_dirs/$exp2/nwpu45_unlabeled_with_glip_pseudos_2.json data/NWPU-RESISC45/annotations/nwpu45_unlabeled_with_glip_pseudos_2.json
100+
```
101+
102+
- **[Optional]** Step3: post-training
103+
104+
```shell
105+
exp3="glip_atss_r50_a_fpn_dyhead_visdronezsd_base_nwpu"
106+
python tools/train.py \
107+
projects/GLIP/configs/$exp3.py
108+
```
109+
110+
## Evaluation
111+
112+
```shell
113+
python tools/test.py \
114+
projects/GLIP/configs/$exp3.py \
115+
work_dirs/$exp3/iter_10000.pth \
116+
--work-dir work_dirs/$exp3/dior_test
117+
```
118+
119+
## Acknowledgement
120+
121+
Thanks the wonderful open source projects [MMDetection](https://github.com/open-mmlab/mmdetection), [MMRotate](https://github.com/open-mmlab/mmrotate), and [GLIP](https://github.com/microsoft/GLIP)!
122+
123+
124+
## Citation
125+
126+
```
127+
// Oriented GLIP (this repo)
128+
@misc{li2024exploitingunlabeleddatamultiple,
129+
title={Exploiting Unlabeled Data with Multiple Expert Teachers for Open Vocabulary Aerial Object Detection and Its Orientation Adaptation},
130+
author={Yan Li and Weiwei Guo and Xue Yang and Ning Liao and Shaofeng Zhang and Yi Yu and Wenxian Yu and Junchi Yan},
131+
year={2024},
132+
eprint={2411.02057},
133+
archivePrefix={arXiv},
134+
primaryClass={cs.CV},
135+
url={https://arxiv.org/abs/2411.02057},
136+
}
137+
138+
// GLIP (Horizontal detection)
139+
@inproceedings{li2021grounded,
140+
title={Grounded Language-Image Pre-training},
141+
author={Liunian Harold Li* and Pengchuan Zhang* and Haotian Zhang* and Jianwei Yang and Chunyuan Li and Yiwu Zhong and Lijuan Wang and Lu Yuan and Lei Zhang and Jenq-Neng Hwang and Kai-Wei Chang and Jianfeng Gao},
142+
year={2022},
143+
booktitle={CVPR},
144+
}
145+
```
Lines changed: 190 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,190 @@
1+
_base_ = [
2+
'mmrotate::_base_/datasets/visdronezsd.py',
3+
'mmrotate::_base_/default_runtime.py'
4+
]
5+
angle_version = 'le90'
6+
lang_model_name = 'bert-base-uncased'
7+
batch_size = 8
8+
num_workers = 2
9+
10+
custom_imports = dict(
11+
imports=['projects.GLIP.glip'], allow_failed_imports=False)
12+
13+
14+
model = dict(
15+
type='mmdet.GLIP',
16+
data_preprocessor=dict(
17+
type='mmdet.DetDataPreprocessor',
18+
mean=[103.53, 116.28, 123.675],
19+
std=[57.375, 57.12, 58.395],
20+
bgr_to_rgb=False,
21+
pad_size_divisor=32,
22+
boxtype2tensor=False),
23+
backbone=dict(
24+
type='mmdet.ResNet',
25+
depth=50,
26+
num_stages=4,
27+
out_indices=(1, 2, 3),
28+
frozen_stages=1,
29+
norm_cfg=dict(type='BN', requires_grad=False),
30+
norm_eval=True,
31+
style='pytorch',
32+
init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50')),
33+
neck=dict(
34+
type='mmdet.FPN_DropBlock',
35+
plugin=dict(
36+
type='mmdet.DropBlock',
37+
drop_prob=0.3,
38+
block_size=3,
39+
warmup_iters=0),
40+
in_channels=[512, 1024, 2048],
41+
out_channels=256,
42+
start_level=0,
43+
relu_before_extra_convs=True,
44+
add_extra_convs='on_output',
45+
num_outs=5),
46+
bbox_head=dict(
47+
type='RotatedATSSVLFusionHead',
48+
lang_model_name=lang_model_name,
49+
num_classes=20,
50+
in_channels=256,
51+
feat_channels=256,
52+
anchor_generator=dict(
53+
type='FakeRotatedAnchorGenerator',
54+
angle_version=angle_version,
55+
ratios=[1.0],
56+
octave_base_scale=8, #
57+
scales_per_octave=1,
58+
strides=[8, 16, 32, 64, 128]),
59+
bbox_coder=dict(
60+
type='DeltaXYWHTRBBoxCoder',
61+
angle_version=angle_version,
62+
norm_factor=None,
63+
edge_swap=True,
64+
proj_xy=True,
65+
target_means=(.0, .0, .0, .0, .0),
66+
target_stds=(1.0, 1.0, 1.0, 1.0, 1.0)),
67+
loss_cls=dict(
68+
type='mmdet.FocalLoss',
69+
use_sigmoid=True,
70+
gamma=2.0,
71+
alpha=0.25,
72+
loss_weight=1.0),
73+
loss_bbox=dict(type='RotatedIoULoss', mode='linear', loss_weight=2.0),
74+
loss_centerness=dict(
75+
type='mmdet.CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0)),
76+
language_model=dict(type='mmdet.BertModel', name=lang_model_name),
77+
train_cfg=dict(
78+
assigner=dict(
79+
type='RotatedATSSAssigner',
80+
topk=9,
81+
iou_calculator=dict(type='RBboxOverlaps2D')),
82+
sampler=dict(
83+
type='mmdet.PseudoSampler'), # Focal loss should use PseudoSampler
84+
allowed_border=-1,
85+
pos_weight=-1,
86+
debug=False),
87+
test_cfg=dict(
88+
nms_pre=2000,
89+
min_bbox_size=0,
90+
score_thr=0.05,
91+
nms=dict(type='nms_rotated', iou_threshold=0.1),
92+
max_per_img=2000))
93+
94+
# dataset settings
95+
train_pipeline = [
96+
dict(type='mmdet.LoadImageFromFile', backend_args=_base_.backend_args),
97+
dict(type='mmdet.LoadAnnotations', with_bbox=True, box_type='qbox'),
98+
dict(type='ConvertBoxType', box_type_mapping=dict(gt_bboxes='rbox')),
99+
dict(type='mmdet.Resize', scale=(800, 800), keep_ratio=True),
100+
dict(type='mmdet.FilterAnnotations', min_gt_bbox_wh=(1e-2, 1e-2)),
101+
dict(
102+
type='mmdet.RandomFlip',
103+
prob=0.75,
104+
direction=['horizontal', 'vertical', 'diagonal']),
105+
dict(
106+
type='mmdet.PackDetInputs',
107+
meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
108+
'scale_factor', 'flip', 'flip_direction', 'text',
109+
'custom_entities'))
110+
]
111+
112+
val_pipeline = [
113+
dict(type='mmdet.LoadImageFromFile', backend_args=_base_.backend_args),
114+
dict(type='mmdet.Resize', scale=(800, 800), keep_ratio=True),
115+
dict(type='mmdet.LoadAnnotations', with_bbox=True, box_type='qbox'),
116+
dict(type='ConvertBoxType', box_type_mapping=dict(gt_bboxes='rbox')),
117+
dict(
118+
type='mmdet.PackDetInputs',
119+
meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
120+
'scale_factor', 'text', 'custom_entities'))
121+
]
122+
123+
124+
train_dataloader = dict(
125+
batch_size=batch_size,
126+
num_workers=num_workers,
127+
sampler=dict(type='DefaultSampler'),
128+
dataset=dict(
129+
pipeline=train_pipeline,
130+
return_classes=True))
131+
132+
val_dataloader = dict(
133+
batch_size=batch_size,
134+
num_workers=num_workers,
135+
dataset=dict(
136+
pipeline=val_pipeline,
137+
return_classes=True))
138+
139+
# test_dataloader = val_dataloader
140+
test_dataloader = dict(
141+
batch_size=2,
142+
num_workers=num_workers,
143+
dataset=dict(
144+
ann_file='ImageSets/Main/test.txt',
145+
# data_prefix=dict(img_path='JPEGImages-trainval'),
146+
pipeline=val_pipeline,
147+
return_classes=True)
148+
)
149+
150+
# training schedule for 180k
151+
train_cfg = dict(
152+
type='IterBasedTrainLoop', max_iters=20000, val_interval=4000)
153+
val_cfg = dict(type='ValLoop')
154+
test_cfg = dict(type='TestLoop')
155+
156+
# learning rate policy
157+
param_scheduler = [
158+
dict(
159+
type='LinearLR', start_factor= 1.0 / 3, by_epoch=False, begin=0, end=500),
160+
dict(
161+
type='MultiStepLR',
162+
begin=0,
163+
end=20000,
164+
by_epoch=False,
165+
milestones=[16000, 18000],
166+
gamma=0.1)
167+
]
168+
169+
# optimizer
170+
optim_wrapper = dict(
171+
type='OptimWrapper',
172+
optimizer=dict(type='SGD', lr=0.005, momentum=0.9, weight_decay=0.0001),
173+
paramwise_cfg=dict(
174+
custom_keys={
175+
'absolute_pos_embed': dict(decay_mult=0.),
176+
'relative_position_bias_table': dict(decay_mult=0.),
177+
'norm': dict(decay_mult=0.)
178+
}),
179+
clip_grad=dict(max_norm=35, norm_type=2))
180+
181+
182+
default_hooks = dict(
183+
logger=dict(type='LoggerHook', interval=20),
184+
checkpoint=dict(by_epoch=False, interval=2000, max_keep_ckpts=1))
185+
log_processor = dict(by_epoch=False)
186+
187+
_base_.visualizer.vis_backends = [
188+
dict(type='LocalVisBackend'),
189+
dict(type='TensorboardVisBackend')
190+
]

0 commit comments

Comments
 (0)