VisionXLab
diff --git a/‎projects/GLIP/README.md‎
Lines changed: 145 additions & 0 deletions b/‎projects/GLIP/README.md‎
Lines changed: 145 additions & 0 deletions
diff --git a/‎projects/GLIP/configs/glip_atss_r50_a_fpn_dyhead_visdronezsd_base.py‎
Lines changed: 190 additions & 0 deletions b/‎projects/GLIP/configs/glip_atss_r50_a_fpn_dyhead_visdronezsd_base.py‎
Lines changed: 190 additions & 0 deletions
@@ -0,0 +1,145 @@
+# [Oriented GLIP] GLIP: Grounded Language-Image Pre-training
+
+> [GLIP: Grounded Language-Image Pre-training](https://arxiv.org/abs/2112.03857)
+
+## Abstract
+
+This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich. In our experiments, we pre-train GLIP on 27M grounding data, including 3M human-annotated and 24M web-crawled image-text pairs. The learned representations demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks. 1) When directly evaluated on COCO and LVIS (without seeing any images in COCO during pre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many supervised baselines. 2) After fine-tuned on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior SoTA. 3) When transferred to 13 downstream object detection tasks, a 1-shot GLIP rivals with a fully-supervised Dynamic Head.
+
+<div align=center>
+<img src="https://github.com/open-mmlab/mmyolo/assets/17425982/b87228d7-f000-4a5d-b103-fe535984417a"/>
+</div>
+
+## Installation
+
+```shell
+cd $MMDETROOT
+
+# source installation
+pip install -r requirements/multimodal.txt
+
+# or mim installation
+mim install mmdet[multimodal]
+```
+
+- NOTE
+
+GLIP utilizes BERT as the language model, which requires access to https://huggingface.co/. If you encounter connection errors due to network access, you can download the required files on a computer with internet access and save them locally. Finally, modify the `lang_model_name` field in the config to the local path. Please refer to the following code:
+
+```python
+from transformers import BertConfig, BertModel
+from transformers import AutoTokenizer
+
+config = BertConfig.from_pretrained("bert-base-uncased")
+model = BertModel.from_pretrained("bert-base-uncased", add_pooling_layer=False, config=config)
+tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
+
+config.save_pretrained("your path/bert-base-uncased")
+model.save_pretrained("your path/bert-base-uncased")
+tokenizer.save_pretrained("your path/bert-base-uncased")
+```
+
+
+## Dataset Preparation
+
+- Step1: download NWPU dataset, format as:
+
+```text
+├── NWPU-RESISC45
+    └── NWPU-RESISC45
+        ├── CLASS 1
+        ├── CLASS 2
+        └── ...
+```
+
+- Step2: prepare OVD dataset.
+
+```
+python projects/GroundingDINO/tools/prepare_ovdg_dataset.py \
+    --data_dir data/NWPU-RESISC45/NWPU-RESISC45 \
+    --save_path data/NWPU-RESISC45/annotations/nwpu45_unlabeled_2.json
+```
+
+
+## Quick Start:
+
+```shell
+bash projects/GLIP/run.sh
+```
+
+
+## Training
+
+> **Note**: we follow the similar training pipeline as CastDet.
+
+-  Step1: train base-detector
+
+```shell
+exp1="glip_atss_r50_a_fpn_dyhead_visdronezsd_base"
+python tools/train.py projects/GLIP/configs/$exp1.py
+```
+
+
+- **[Optional]** Step2: pseudo-labeling
+
+```shell
+# 2.1 pseudo-labeling
+exp2="glip_atss_r50_a_fpn_dyhead_visdronezsd_base_nwpu45_pseudo_labeling"
+python tools/test.py \
+    projects/GLIP/configs/$exp2.py \
+    work_dirs/$exp1/iter_20000.pth
+
+# 2.2 merge predictions
+python projects/GroundingDINO/tools/merge_ovdg_preds.py \
+    --ann_path data/NWPU-RESISC45/annotations/nwpu45_unlabeled_2.json \
+    --pred_path work_dirs/$exp2/nwpu45_pseudo_labeling_2.bbox.json \
+    --save_path work_dirs/$exp2/nwpu45_unlabeled_with_glip_pseudos_2.json
+
+# move to data folder
+cp work_dirs/$exp2/nwpu45_unlabeled_with_glip_pseudos_2.json data/NWPU-RESISC45/annotations/nwpu45_unlabeled_with_glip_pseudos_2.json
+```
+
+- **[Optional]** Step3: post-training
+
+```shell
+exp3="glip_atss_r50_a_fpn_dyhead_visdronezsd_base_nwpu"
+python tools/train.py \
+    projects/GLIP/configs/$exp3.py
+```
+
+## Evaluation
+
+```shell
+python tools/test.py \
+    projects/GLIP/configs/$exp3.py \
+    work_dirs/$exp3/iter_10000.pth \
+    --work-dir work_dirs/$exp3/dior_test
+```
+
+## Acknowledgement
+
+Thanks the wonderful open source projects [MMDetection](https://github.com/open-mmlab/mmdetection), [MMRotate](https://github.com/open-mmlab/mmrotate), and [GLIP](https://github.com/microsoft/GLIP)!
+
+
+## Citation
+
+```
+// Oriented GLIP (this repo)
+@misc{li2024exploitingunlabeleddatamultiple,
+      title={Exploiting Unlabeled Data with Multiple Expert Teachers for Open Vocabulary Aerial Object Detection and Its Orientation Adaptation}, 
+      author={Yan Li and Weiwei Guo and Xue Yang and Ning Liao and Shaofeng Zhang and Yi Yu and Wenxian Yu and Junchi Yan},
+      year={2024},
+      eprint={2411.02057},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2411.02057}, 
+}
+
+// GLIP (Horizontal detection)
+@inproceedings{li2021grounded,
+      title={Grounded Language-Image Pre-training},
+      author={Liunian Harold Li* and Pengchuan Zhang* and Haotian Zhang* and Jianwei Yang and Chunyuan Li and Yiwu Zhong and Lijuan Wang and Lu Yuan and Lei Zhang and Jenq-Neng Hwang and Kai-Wei Chang and Jianfeng Gao},
+      year={2022},
+      booktitle={CVPR},
+}
+```
@@ -0,0 +1,190 @@
+_base_ = [
+    'mmrotate::_base_/datasets/visdronezsd.py',
+    'mmrotate::_base_/default_runtime.py'
+]
+angle_version = 'le90'
+lang_model_name = 'bert-base-uncased'
+batch_size = 8
+num_workers = 2
+
+custom_imports = dict(
+    imports=['projects.GLIP.glip'], allow_failed_imports=False)
+
+
+model = dict(
+    type='mmdet.GLIP',
+    data_preprocessor=dict(
+        type='mmdet.DetDataPreprocessor',
+        mean=[103.53, 116.28, 123.675],
+        std=[57.375, 57.12, 58.395],
+        bgr_to_rgb=False,
+        pad_size_divisor=32,
+        boxtype2tensor=False),
+    backbone=dict(
+        type='mmdet.ResNet',
+        depth=50,
+        num_stages=4,
+        out_indices=(1, 2, 3),
+        frozen_stages=1,
+        norm_cfg=dict(type='BN', requires_grad=False),
+        norm_eval=True,
+        style='pytorch',
+        init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50')),
+    neck=dict(
+        type='mmdet.FPN_DropBlock',
+        plugin=dict(
+            type='mmdet.DropBlock',
+            drop_prob=0.3,
+            block_size=3,
+            warmup_iters=0),
+        in_channels=[512, 1024, 2048],
+        out_channels=256,
+        start_level=0,
+        relu_before_extra_convs=True,
+        add_extra_convs='on_output',
+        num_outs=5),
+    bbox_head=dict(
+        type='RotatedATSSVLFusionHead',
+        lang_model_name=lang_model_name,
+        num_classes=20,
+        in_channels=256,
+        feat_channels=256,
+        anchor_generator=dict(
+            type='FakeRotatedAnchorGenerator',
+            angle_version=angle_version,
+            ratios=[1.0],
+            octave_base_scale=8,    #
+            scales_per_octave=1,
+            strides=[8, 16, 32, 64, 128]),
+        bbox_coder=dict(
+            type='DeltaXYWHTRBBoxCoder',
+            angle_version=angle_version,
+            norm_factor=None,
+            edge_swap=True,
+            proj_xy=True,
+            target_means=(.0, .0, .0, .0, .0),
+            target_stds=(1.0, 1.0, 1.0, 1.0, 1.0)),
+        loss_cls=dict(
+            type='mmdet.FocalLoss',
+            use_sigmoid=True,
+            gamma=2.0,
+            alpha=0.25,
+            loss_weight=1.0),
+        loss_bbox=dict(type='RotatedIoULoss', mode='linear', loss_weight=2.0),
+        loss_centerness=dict(
+            type='mmdet.CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0)),
+    language_model=dict(type='mmdet.BertModel', name=lang_model_name),
+    train_cfg=dict(
+        assigner=dict(
+            type='RotatedATSSAssigner',
+            topk=9,
+            iou_calculator=dict(type='RBboxOverlaps2D')),
+        sampler=dict(
+            type='mmdet.PseudoSampler'),  # Focal loss should use PseudoSampler
+        allowed_border=-1,
+        pos_weight=-1,
+        debug=False),
+    test_cfg=dict(
+        nms_pre=2000,
+        min_bbox_size=0,
+        score_thr=0.05,
+        nms=dict(type='nms_rotated', iou_threshold=0.1),
+        max_per_img=2000))
+
+# dataset settings
+train_pipeline = [
+    dict(type='mmdet.LoadImageFromFile', backend_args=_base_.backend_args),
+    dict(type='mmdet.LoadAnnotations', with_bbox=True, box_type='qbox'),
+    dict(type='ConvertBoxType', box_type_mapping=dict(gt_bboxes='rbox')),
+    dict(type='mmdet.Resize', scale=(800, 800), keep_ratio=True),
+    dict(type='mmdet.FilterAnnotations', min_gt_bbox_wh=(1e-2, 1e-2)),
+    dict(
+        type='mmdet.RandomFlip',
+        prob=0.75,
+        direction=['horizontal', 'vertical', 'diagonal']),
+    dict(
+        type='mmdet.PackDetInputs',
+        meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+                   'scale_factor', 'flip', 'flip_direction', 'text',
+                   'custom_entities'))
+]
+
+val_pipeline = [
+    dict(type='mmdet.LoadImageFromFile', backend_args=_base_.backend_args),
+    dict(type='mmdet.Resize', scale=(800, 800), keep_ratio=True),
+    dict(type='mmdet.LoadAnnotations', with_bbox=True, box_type='qbox'),
+    dict(type='ConvertBoxType', box_type_mapping=dict(gt_bboxes='rbox')),
+    dict(
+        type='mmdet.PackDetInputs',
+        meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
+                   'scale_factor', 'text', 'custom_entities'))
+]
+
+
+train_dataloader = dict(
+    batch_size=batch_size,
+    num_workers=num_workers,
+    sampler=dict(type='DefaultSampler'),
+    dataset=dict(
+        pipeline=train_pipeline,
+        return_classes=True))
+
+val_dataloader = dict(
+    batch_size=batch_size,
+    num_workers=num_workers,
+    dataset=dict(
+        pipeline=val_pipeline,
+        return_classes=True))
+
+# test_dataloader = val_dataloader
+test_dataloader = dict(
+    batch_size=2,
+    num_workers=num_workers,
+    dataset=dict(
+        ann_file='ImageSets/Main/test.txt',
+        # data_prefix=dict(img_path='JPEGImages-trainval'),
+        pipeline=val_pipeline,
+        return_classes=True)
+        )
+
+# training schedule for 180k
+train_cfg = dict(
+    type='IterBasedTrainLoop', max_iters=20000, val_interval=4000)
+val_cfg = dict(type='ValLoop')
+test_cfg = dict(type='TestLoop')
+
+# learning rate policy
+param_scheduler = [
+    dict(
+        type='LinearLR', start_factor= 1.0 / 3, by_epoch=False, begin=0, end=500),
+    dict(
+        type='MultiStepLR',
+        begin=0,
+        end=20000,
+        by_epoch=False,
+        milestones=[16000, 18000],
+        gamma=0.1)
+]
+
+# optimizer
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='SGD', lr=0.005, momentum=0.9, weight_decay=0.0001),
+    paramwise_cfg=dict(
+        custom_keys={
+            'absolute_pos_embed': dict(decay_mult=0.),
+            'relative_position_bias_table': dict(decay_mult=0.),
+            'norm': dict(decay_mult=0.)
+        }),
+    clip_grad=dict(max_norm=35, norm_type=2))
+
+
+default_hooks = dict(
+    logger=dict(type='LoggerHook', interval=20),
+    checkpoint=dict(by_epoch=False, interval=2000, max_keep_ckpts=1))
+log_processor = dict(by_epoch=False)
+
+_base_.visualizer.vis_backends = [
+    dict(type='LocalVisBackend'),
+    dict(type='TensorboardVisBackend')
+    ]