Comprehensive documentation for object detection models, from two-stage detectors to modern transformer-based approaches.
- DETR - Detection Transformer: End-to-end object detection with transformers
- RT-DETR - Real-Time DETR: Fast transformer detector
- Grounding DINO - Open-set detection with language grounding
- YOLO-World - Open-vocabulary YOLO detector
- Faster R-CNN - Region-based CNN with Region Proposal Network
- Cascade R-CNN - Multi-stage refinement for better localization
- Mask R-CNN - Instance segmentation via RoI masking
- Keypoint R-CNN - Human pose estimation extension
- YOLOv10 - Latest YOLO with NMS-free training
| Model | Type | Speed (FPS) | mAP | Key Feature |
|---|---|---|---|---|
| Faster R-CNN | Two-stage | 7 | 42.0 | RPN + RoI pooling |
| Cascade R-CNN | Two-stage | 5 | 44.9 | Progressive refinement |
| Mask R-CNN | Two-stage | 5 | 43.2 | Instance segmentation |
| DETR | Transformer | 28 | 42.0 | Set prediction |
| RT-DETR | Transformer | 108 | 53.1 | Real-time |
| Grounding DINO | Transformer | 15 | 52.5 | Open-vocabulary |
| YOLO-World | Single-stage | 52 | 35.4 | Zero-shot detection |
| YOLOv10 | Single-stage | 80 | 54.4 | NMS-free |
- High speed: YOLOv10-N/S or RT-DETR-R18
- Balanced: RT-DETR-R50 or YOLOv10-M
- Best accuracy: RT-DETR-R101
- Closed-set: Cascade R-CNN + Swin-L backbone
- Open-vocabulary: Grounding DINO-L
- With segmentation: Mask R-CNN + FPN
- Text prompts: Grounding DINO
- Category names: YOLO-World
- Phrase grounding: Grounding DINO with BERT
- Instance segmentation: Mask R-CNN
- Keypoint detection: Keypoint R-CNN
- Crowd detection: Cascade R-CNN (handles overlap well)
- Small objects: FPN backbone + multi-scale training
Image → Backbone → FPN → RPN → RoI Align → Detection Head → Boxes + Classes
↓
Region Proposals
Image → Backbone → Transformer Encoder → Object Queries → Decoder → Boxes + Classes
Image → Backbone → Neck (FPN/PAN) → Detection Heads → Boxes + Classes
# Standard augmentation for detectors
transforms = [
RandomFlip(prob=0.5),
RandomResize(scales=[0.8, 1.0, 1.2]),
RandomCrop(size=640),
ColorJitter(brightness=0.2, contrast=0.2),
Mosaic(prob=0.5), # For YOLO
]- Two-stage: 1e-3 with step decay at 8, 11 epochs (12 total)
- DETR: 1e-4 with drop at 100 epochs (150 total)
- YOLO: Cosine decay with warmup
- Classification: Focal Loss (single-stage), Cross Entropy (two-stage)
- Localization: GIoU Loss or CIoU Loss
- Matching: Hungarian matching (DETR), Max IoU (R-CNN)
All implementations available in Nexus/nexus/models/cv/:
- DETR:
detr.py - R-CNN family:
rcnn/directory - RT-DETR:
rt_detr.py - Grounding DINO:
grounding_dino.py - YOLO-World:
yolo_world.py - YOLOv10:
yolov10.py
Two-Stage Detectors:
| Model | Backbone | mAP | AP50 | AP75 |
|---|---|---|---|---|
| Faster R-CNN | ResNet-50 | 40.2 | 61.0 | 43.8 |
| Cascade R-CNN | ResNet-50 | 43.0 | 61.2 | 46.3 |
| Mask R-CNN | ResNet-50 | 41.0 | 61.7 | 44.9 |
Transformer Detectors:
| Model | Backbone | mAP | AP50 | AP75 | FPS |
|---|---|---|---|---|---|
| DETR | ResNet-50 | 42.0 | 62.4 | 44.2 | 28 |
| RT-DETR-R50 | ResNet-50 | 53.1 | 71.3 | 57.6 | 108 |
| Grounding DINO-T | Swin-T | 48.4 | 67.2 | 52.1 | 15 |
Single-Stage Detectors:
| Model | Input Size | mAP | FPS |
|---|---|---|---|
| YOLOv10-N | 640 | 38.5 | 142 |
| YOLOv10-S | 640 | 46.3 | 120 |
| YOLOv10-M | 640 | 51.1 | 92 |
| YOLOv10-L | 640 | 54.4 | 80 |
See individual model documentation for detailed papers and implementations.