Skip to content

Add DINOv3 Backbone#389

Open
YoussefAboelwafa wants to merge 4 commits intoIDEA-Research:mainfrom
YoussefAboelwafa:main
Open

Add DINOv3 Backbone#389
YoussefAboelwafa wants to merge 4 commits intoIDEA-Research:mainfrom
YoussefAboelwafa:main

Conversation

@YoussefAboelwafa
Copy link

Overview

This PR adds support for using DINOv3 (self-supervised vision transformers) as backbones in the detrex framework, specifically integrated with the DETA object detection model.

DINOv3 represents a new generation of self-supervised vision models that achieve state-of-the-art performance across various vision tasks. By integrating DINOv3 backbones, users can leverage powerful pretrained representations for object detection.

Motivation

  1. State-of-the-art Pretraining: DINOv3 models are pretrained on massive datasets (LVD-1689M) using self-supervised learning, providing robust visual representations
  2. Model Variety: Supports multiple architectures (ViT-S/B/L/H and ConvNeXt variants) allowing flexibility in model capacity vs. efficiency tradeoffs
  3. Community Interest: Growing adoption of DINOv3 in computer vision research warrants native support in detrex
  4. Consistent API: Follows detrex's existing backbone patterns (similar to EVA integration)

Changes Made

New Files

  1. detrex/modeling/backbone/dinov3_backbone.py

    • Main backbone wrapper implementing DINOv3Backbone and DINOv3SimpleFeaturePyramid
    • Supports 7 ViT variants and 4 ConvNeXt variants
    • Provides multi-scale feature extraction from intermediate layers
    • Handles checkpoint loading and weight freezing
    • Full documentation with usage examples
  2. projects/deta/configs/models/deta_dinov3.py

    • Model configuration for DETA with DINOv3 backbone
    • Defines default architecture using ViT-Base
    • Configures ChannelMapper neck for feature pyramid generation
    • Sets up DeformableDETR transformer and DETA criterion
  3. projects/deta/configs/deta_dinov3_vitb16.py

    • Complete training configuration for DETA + DINOv3-ViT-Base
    • Includes extensive comments and documentation
    • Provides examples for switching between different DINOv3 variants
    • Configures optimizer, dataloader, and training hyperparameters
  4. projects/deta/configs/README_DINOV3.md

    • Comprehensive documentation for using DINOv3 with DETA
    • Setup instructions and checkpoint download links
    • Configuration examples for different model variants
    • Training tips and best practices

Modified Files

  1. detrex/modeling/backbone/__init__.py
    • Added imports for DINOv3Backbone and DINOv3SimpleFeaturePyramid

Features

Supported Models

Vision Transformers (ViT):

  • ViT-Small/16 (384 dims, 12 layers)
  • ViT-Base/16 (768 dims, 12 layers)
  • ViT-Large/16 (1024 dims, 24 layers)
  • ViT-Huge+/16 (1280 dims, 32 layers)

ConvNeXt:

  • ConvNeXt-Tiny
  • ConvNeXt-Small
  • ConvNeXt-Base
  • ConvNeXt-Large

Key Capabilities

  1. Flexible Feature Extraction: Extract features from any intermediate layer
  2. Frozen or Fine-tuned Modes: Support for transfer learning with frozen backbones or end-to-end fine-tuning
  3. Multi-scale Features: Generate hierarchical features similar to traditional CNN backbones
  4. Checkpoint Loading: Seamless loading of official DINOv3 pretrained weights
  5. Memory Efficient: Optional gradient checkpointing for large models

Usage Example

# Basic usage with default ViT-Base
python projects/deta/train_net.py \
    --config-file projects/deta/configs/deta_dinov3_vitb16.py \
    --num-gpus 4

Backward Compatibility

  • No breaking changes to existing code
  • Purely additive changes
  • Existing configs and models unaffected

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant