Tri-Axial Scaling in Aerial Object Detection: Model Size, Dataset Size and Quality, and Test-Time Inference in the CADOT Challenge
By Team Double J (Jie): Yi Jie WONG & Jing Jie TAN et al
Our team ranked 1st globally in the IEEE Big Data Cup 2024 - BEGC2024 challenge! 🏅🎉🥳 Our approach is simple, scale everything! We proposed a systematic Tri-Axial Scaling to approach Aerial Object Detection via:
- Model Size
- Dataset Size & Quality
- Test-Time Inference
Basically, we achieve this Tri-Axial Scaling by:
- Scaling model size
- Diffusion Augmentation & Balanced Data Sampling
- Test-Time Inference = Test-Time Augmentation + Ensemble Models
Basically, we notice that:
- A larger model can learn more effectively from a noisy and imbalanced dataset compared to a smaller model.
- A larger model benefits more from dataset size scaling.
- A smaller model can also achieve performance comparable to a larger model through balanced data sampling.
- A larger model tends to overfit when using a balanced data sampling strategy, but this can be mitigated by increasing the amount of data (hence, data scaling).
⬆️ Our diffusion augmentation pipeline converts annotations into synthetic image.
This figure is adopted from my proposed method from another competition.
I modified the pipeline to support bbox -> segmentation mask -> image generation.
A more up to date figure will be updated here soon!
To avoid overcomplicating this repo, we separate the code for diffusion augmentation in a separate repo.
⬆️ Scaling Model Size vs Scaling Data Size vs Scaling Test-Time Inference
Larger model is more effective in learning from imbalanced dataset.
Larger model also benefits from data size scaling even in the presence of imbalanced class.
⬆️ Scaling Model Size vs Scaling Data Quality vs Sacling Test-Time Inference
Smaller model benefits more from balanced sampling as opposed to larger models.
However, we see evidence of larger model (YOLO12s) to be better than smaller model (YOLO12n).
We hyphothesized that bigger dataset is required to unlock full potential of YOLO12x.
⬆️ Finally, we unleash the full potential of test-time scaling using ensemble model and TTA.
We apply Test Time Augmentation to all models in our ensemble to increase the detection rate.
Detailed elaboration on our solution can be found in our preprint.
- Step 1: Setup Repo
- Step 2: Setup Dataset
- Step 3 (shortcut): Download Our Trained Models
- Step 3: Training
- Step 4: Inference
👆 Please refer our Colab link to try out our code seamlessly! You might need Colab Pro to train the larger YOLO variants.
Conda environment
conda create --name yolo python=3.10.12 -y
conda activate yoloClone this repo
# clone this repo
git clone https://github.com/yjwong1999/Double_J_CADOT_Challenge.git
cd Double_J_CADOT_ChallengeInstall dependencies
# Please adjust the torch version accordingly depending on your OS
pip install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 --index-url https://download.pytorch.org/whl/cu121
# Install Jupyter Notebook
pip install jupyter notebook==7.1.0
# install this version of ultralytics (for its dependencies)
pip install ultralytics==8.3.111
# uninstall default ultralytics and install my ultralytics that support Timm pretrained models
pip uninstall ultralytics -y
pip install git+https://github.com/DoubleY-BEGC2024/ultralytics-timm.git
# install this to use wighted box fusion
pip install ensemble-boxes
# Remaining dependencies
pip install pycocotools
pip install requests==2.32.3
pip install click==8.1.7# to convert the CADOT dataset from the default COCO format to YOLO annotation form
python setup_data.py🚅 To avoid training all our models which are time-consuming, you can download our trained models using the provided bash script.
Alternatively, you can manually search our models from dropbox (in case the .sh file is not working in Windows machine).
bash download_our_model.sh❗Note that due to time constraints, we did not train all possible experiments. Hence, in general, our hyperparameters are chosen based on:
- If trained via balanced sampling, batch size = 8, image size = 960, epochs =
100 for smallest YOLO12n, 50 for YOLO12s, 30 for YOLO12x - If trained without balanced sampling, batch size = 16, image size = 640, epochs = 100
Actually, we should set all image sizes to 960, but we only considered this step at a later stage. Meanwhile, setting a higher image size increases GPU memory requirements, so we have to lower the batch size. As for epochs, we set them all to 100 for training without balanced sampling. If trained with balanced sampling, we found that larger models tend to overfit, so we have to reduce the number of epochs.
# train ResNext101-YOLO12 naively without tricks
python3 train.py --model-name "../cfg/yolo12-resnext101-timm.yaml" --epoch 100 --batch 16 --imgsz 640
# train yolo12n using balanced sampling
python3 train_balanced.py --model-name "yolo12n.pt" --epoch 100 --batch 8 --imgsz 960
# train yolo12s using balanced sampling
python3 train_balanced.py --model-name "yolo12s.pt" --epoch 50 --batch 8 --imgsz 960# setup our synthetic dataset (generated via diffusion augmentation)
python setup_synthetic_data.py
# train yolo12x with synthetic data only
python3 train_balanced.py --model-name "yolo12x.pt" --epoch 100 --batch 16 --imgsz 640
# train yolo12x using balanced sampling and synthetic data
python3 train_balanced.py --model-name "yolo12x.pt" --epoch 100 --batch 8 --imgsz 960Move all 5 trained models into Double_J_CADOT_Challenge/models directory for ensemble model inference
python3 move_models.pyEnsemble model + Test Time Augmentation
# Run the inference code
python3 infer.py --tta all❗Note that:
- Even when using the exact same dependencies (torch/numpy/ultralytics/etc), you might not obtain the same results.
- This is because
different machines,different CUDA,different random seedcan also contribute to variations in results. - For instance, we tested training the exact
same modelandsame hyperparameter configurationsusing thesame A100 GPU, but onGoogle ColabandLightning AI. - However, the
performance discrepanciesbetween the two models trained ondifferent platformswere noticeable. - Hence, you might not be able to reproduce the exact same results.
- Nevertheless, we believe our results on
tri-axial scalingare valuable to the community 🤗
We would like to express our gratitude to the CADOT organizers for hosting this exciting challenge!
Our solution has been invited to IEEE ICIP 2025! Please cite our paper if this repo helps your research. The preprint is available here.
@InProceedings{Wong2024,
title = {Tri-Axial Scaling in Aerial Object Detection: Model Size, Dataset Size and Quality, and Test-Time Inference in the CADOT Challenge},
author = {Yi Jie Wong and Jing Jie Tan and Mau-Luen Tham and Ban-Hoe Kwan and Yan Chai Hum},
booktitle={2025 IEEE International Conference on Image Processing (ICIP)},
year={2025}}