Add rtdetr support (#44)

IgorSusmelj · web-flow · commit 1504b2ae9978 · 2025-09-29T22:55:26.000+02:00
* Add rt-detr v1 and v2 support

* Add docs for rt-detr v1 and v2

* Implement feedback
diff --git a/README.md b/README.md
@@ -24,6 +24,8 @@ and time-consuming. Labelformat aims to solve this pain.
     - [Labelbox](https://labelformat.com/formats/object-detection/labelbox/) (input only)
     - [Lightly](https://labelformat.com/formats/object-detection/lightly/)
     - [PascalVOC](https://labelformat.com/formats/object-detection/pascalvoc/)
+    - [RT-DETR](https://labelformat.com/formats/object-detection/rtdetr/)
+    - [RT-DETRv2](https://labelformat.com/formats/object-detection/rtdetrv2/)
     - [YOLOv5](https://labelformat.com/formats/object-detection/yolov5/)
     - [YOLOv6](https://labelformat.com/formats/object-detection/yolov6/)
     - [YOLOv7](https://labelformat.com/formats/object-detection/yolov7/)
diff --git a/docs/features.md b/docs/features.md
@@ -36,6 +36,8 @@ Labelformat offers a robust set of features tailored to meet the diverse needs o
 - **[Labelbox](formats/object-detection/labelbox.md)** (input only)
 - **[Lightly](formats/object-detection/lightly.md)**
 - **[PascalVOC](formats/object-detection/pascalvoc.md)**
+- **[RT-DETR](formats/object-detection/rtdetr.md)**
+- **[RT-DETRv2](formats/object-detection/rtdetrv2.md)**
 - **[YOLOv5](formats/object-detection/yolov5.md)**
 - **[YOLOv6](formats/object-detection/yolov6.md)**
 - **[YOLOv7](formats/object-detection/yolov7.md)**
diff --git a/docs/formats/index.md b/docs/formats/index.md
@@ -6,6 +6,8 @@
 - [Labelbox](./object-detection/labelbox.md)
 - [Lightly](./object-detection/lightly.md)
 - [PascalVOC](./object-detection/pascalvoc.md)
+- [RT-DETR](./object-detection/rtdetr.md)
+- [RT-DETRv2](./object-detection/rtdetrv2.md)
 - [YOLOv5](./object-detection/yolov5.md)
 - [YOLOv6](./object-detection/yolov6.md)
 - [YOLOv7](./object-detection/yolov7.md)
diff --git a/docs/formats/object-detection/index.md b/docs/formats/object-detection/index.md
@@ -15,6 +15,8 @@ Labelformat supports converting between major object detection annotation format
 - [Labelbox](./labelbox.md)
 - [Lightly](./lightly.md)
 - [PascalVOC](./pascalvoc.md)
+- [RT-DETR](./rtdetr.md)
+- [RT-DETRv2](./rtdetrv2.md)
 - [YOLOv5](./yolov5.md)
 - [YOLOv6](./yolov6.md)
 - [YOLOv7](./yolov7.md)
diff --git a/docs/formats/object-detection/rtdetr.md b/docs/formats/object-detection/rtdetr.md
@@ -0,0 +1,157 @@
+# RT-DETR Object Detection Format
+
+## Overview
+
+**RT-DETR (Real-Time DEtection TRansformer)** is a groundbreaking end-to-end object detection framework introduced in the paper [DETRs Beat YOLOs on Real-time Object Detection](https://arxiv.org/abs/2304.08069). RT-DETR represents the first real-time end-to-end object detector that successfully challenges the dominance of YOLO detectors in real-time applications. Unlike traditional detectors that require Non-Maximum Suppression (NMS) post-processing, RT-DETR eliminates NMS entirely while achieving superior speed and accuracy performance.
+
+> **Info:** RT-DETR was introduced through the academic paper "DETRs Beat YOLOs on Real-time Object Detection" published in 2023.
+  For the full paper, see: [arXiv:2304.08069](https://arxiv.org/abs/2304.08069)
+  For implementation details and code, see: [GitHub Repository: lyuwenyu/RT-DETR](https://github.com/lyuwenyu/RT-DETR)
+
+> **Availability:** RT-DETR is now available in multiple frameworks:
+  - [Hugging Face Transformers](https://huggingface.co/docs/transformers/model_doc/rt_detr)
+  - [Ultralytics](https://docs.ultralytics.com/models/rtdetr/)
+
+## Key RT-DETR Model Features
+
+RT-DETR uses the standard **COCO annotation format** while introducing revolutionary architectural innovations for real-time detection:
+
+- **End-to-End Architecture:** First real-time detector to completely eliminate NMS post-processing, providing more stable and predictable inference times.
+- **Efficient Hybrid Encoder:** Novel encoder design that decouples intra-scale interaction and cross-scale fusion to significantly reduce computational overhead.
+- **Uncertainty-Minimal Query Selection:** Advanced query initialization scheme that optimizes both classification and localization confidence for improved detection quality.
+- **Flexible Speed Tuning:** Supports adjustable inference speed by modifying the number of decoder layers without retraining.
+- **Superior Performance:** Achieves state-of-the-art results (e.g., RT-DETR-R50 reaches 53.1% mAP @ 108 FPS on T4 GPU, outperforming YOLOv8-L in both speed and accuracy).
+- **Multiple Model Scales:** Available in various scales (R18, R34, R50, R101) to accommodate different computational requirements.
+
+These architectural innovations are handled internally by the model design and training pipeline, requiring no changes to the standard COCO annotation format described below.
+
+## Specification of RT-DETR Detection Format
+
+RT-DETR uses the standard **COCO format** for annotations, ensuring seamless integration with existing COCO datasets and tools. The format consists of a single JSON file containing three main components:
+
+### `images`
+Defines metadata for each image in the dataset:
+```json
+{
+  "id": 0,                    // Unique image ID
+  "file_name": "image1.jpg",  // Image filename
+  "width": 640,               // Image width in pixels
+  "height": 416               // Image height in pixels
+}
+```
+
+### `categories`
+Defines the object classes:
+```json
+{
+  "id": 0,                    // Unique category ID
+  "name": "cat"               // Category name
+}
+```
+
+### `annotations`
+Defines object instances:
+```json
+{
+  "image_id": 0,              // Reference to image
+  "category_id": 2,           // Reference to category
+  "bbox": [540.0, 295.0, 23.0, 18.0]  // [x, y, width, height] in absolute pixels
+}
+```
+
+## Directory Structure of RT-DETR Dataset
+
+```
+dataset/
+├── images/                   # Image files
+│   ├── image1.jpg
+│   └── image2.jpg
+└── annotations.json         # Single JSON file containing all annotations
+```
+
+## Benefits of RT-DETR Format
+
+- **Standard Compatibility:** Uses the widely-adopted COCO format, ensuring compatibility with existing tools and frameworks.
+- **Flexibility:** Supports adjustable inference speeds without retraining, making it adaptable to various real-time scenarios.
+- **Superior Accuracy:** Achieves better accuracy than comparable YOLO detectors while maintaining competitive speed.
+
+## Converting Annotations to RT-DETR Format with Labelformat
+
+Since RT-DETR uses the standard COCO format, converting annotations to RT-DETR format is equivalent to converting to COCO format.
+
+### Installation
+
+First, ensure that Labelformat is installed:
+
+```shell
+pip install labelformat
+```
+
+### Conversion Example: YOLOv8 to RT-DETR
+
+Assume you have annotations in YOLOv8 format and wish to convert them to RT-DETR. Here's how you can achieve this using Labelformat.
+
+**Step 1: Prepare Your Dataset**
+
+Ensure your dataset follows the standard YOLOv8 structure with `data.yaml` and label files.
+
+**Step 2: Run the Conversion Command**
+
+Use the Labelformat CLI to convert YOLOv8 annotations to RT-DETR (COCO format):
+```bash
+labelformat convert \
+    --task object-detection \
+    --input-format yolov8 \
+    --input-file dataset/data.yaml \
+    --input-split train \
+    --output-format rtdetr \
+    --output-file dataset/rtdetr_annotations.json
+```
+
+**Step 3: Verify the Converted Annotations**
+
+After conversion, your dataset structure will be:
+```
+dataset/
+├── images/
+│   ├── image1.jpg
+│   ├── image2.jpg
+│   └── ...
+└── rtdetr_annotations.json    # COCO format annotations for RT-DETR
+```
+
+### Python API Example
+
+```python
+from pathlib import Path
+from labelformat.formats import YOLOv8ObjectDetectionInput, RTDETRObjectDetectionOutput
+
+# Load YOLOv8 format
+label_input = YOLOv8ObjectDetectionInput(
+    input_file=Path("dataset/data.yaml"),
+    input_split="train"
+)
+
+# Convert to RT-DETR format
+RTDETRObjectDetectionOutput(
+    output_file=Path("dataset/rtdetr_annotations.json")
+).save(label_input=label_input)
+```
+
+## Error Handling in Labelformat
+
+Since RT-DETR uses the COCO format, the same validation and error handling applies:
+
+- **Invalid JSON Structure:** Proper error reporting for malformed JSON files
+- **Missing Required Fields:** Validation ensures all required COCO fields are present
+- **Reference Integrity:** Checks that image_id and category_id references are valid
+- **Bounding Box Validation:** Ensures bounding boxes are within image boundaries
+
+Example of a properly formatted annotation:
+```json
+{
+  "images": [{"id": 0, "file_name": "image1.jpg", "width": 640, "height": 480}],
+  "categories": [{"id": 1, "name": "person"}],
+  "annotations": [{"image_id": 0, "category_id": 1, "bbox": [100, 120, 50, 80]}]
+}
+```
diff --git a/docs/formats/object-detection/rtdetrv2.md b/docs/formats/object-detection/rtdetrv2.md
@@ -0,0 +1,164 @@
+# RT-DETRv2 Object Detection Format
+
+## Overview
+
+**RT-DETRv2** is an enhanced version of the Real-Time DEtection TRansformer ([RT-DETR](https://arxiv.org/abs/2304.08069)), introduced in the paper [RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer](https://arxiv.org/abs/2407.17140). Building upon the groundbreaking end-to-end object detection framework of the original RT-DETR, RT-DETRv2 continues the legacy of eliminating Non-Maximum Suppression (NMS) post-processing while introducing additional improvements in accuracy and efficiency for real-time object detection scenarios.
+
+> **Info:** RT-DETRv2 was introduced through the technical report "RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer" published in 2024.
+  For the full paper, see: [arXiv:2407.17140](https://arxiv.org/abs/2407.17140)
+  For RT-DETR foundation, see: [RT-DETR Paper (arXiv:2304.08069)](https://arxiv.org/abs/2304.08069)
+  For implementation details and code, see: [GitHub Repository: lyuwenyu/RT-DETR](https://github.com/lyuwenyu/RT-DETR)
+
+> **Availability:** RT-DETRv2 is now available in multiple frameworks:
+  - [Hugging Face Transformers](https://huggingface.co/docs/transformers/model_doc/rt_detr_v2)
+  - [Ultralytics](https://docs.ultralytics.com/models/rtdetr/)
+
+## Key RT-DETRv2 Model Features
+
+RT-DETRv2 maintains compatibility with the standard **COCO annotation format** while introducing specific technical improvements over RT-DETR:
+
+- **Distinct Sampling Points for Different Scales:** Introduces flexible multi-scale feature extraction by setting different numbers of sampling points for features at different scales in the deformable attention module, rather than using the same number across all scales.
+- **Discrete Sampling Operator:** Provides an optional discrete sampling operator to replace the grid_sample operator, removing deployment constraints typically associated with DETRs and improving practical applicability across different deployment platforms.
+- **Dynamic Data Augmentation:** Implements adaptive data augmentation strategy that applies stronger augmentation in early training periods and reduces it in later stages to improve model robustness and target domain adaptation.
+- **Scale-Adaptive Hyperparameters:** Customizes optimizer hyperparameters based on model scale, using higher learning rates for lighter models (e.g., ResNet18) and lower rates for larger models (e.g., ResNet101) to achieve optimal performance.
+- **Bag-of-Freebies Approach:** Incorporates multiple training improvements that enhance performance without increasing inference cost or model complexity.
+- **Consistent Performance Gains:** Achieves improved accuracy across all model scales (S: +1.4 mAP, M: +1.0 mAP, L: +0.3 mAP) while maintaining the same inference speed as RT-DETR.
+
+These enhancements are handled internally by the model design and training pipeline, requiring no changes to the standard COCO annotation format described below.
+
+## Specification of RT-DETRv2 Detection Format
+
+RT-DETRv2 uses the standard **COCO format** for annotations, ensuring complete compatibility with existing COCO datasets and tools. The format specification is identical to the original COCO format:
+
+### `images`
+Defines metadata for each image in the dataset:
+```json
+{
+  "id": 0,                    // Unique image ID
+  "file_name": "image1.jpg",  // Image filename
+  "width": 640,               // Image width in pixels
+  "height": 416               // Image height in pixels
+}
+```
+
+### `categories`
+Defines the object classes:
+```json
+{
+  "id": 0,                    // Unique category ID
+  "name": "cat"               // Category name
+}
+```
+
+### Annotations
+Defines object instances:
+```json
+{
+  "image_id": 0,              // Reference to image
+  "category_id": 2,           // Reference to category
+  "bbox": [540.0, 295.0, 23.0, 18.0]  // [x, y, width, height] in absolute pixels
+}
+```
+
+## Directory Structure of RT-DETRv2 Dataset
+
+```
+dataset/
+├── images/                   # Image files
+│   ├── image1.jpg
+│   └── image2.jpg
+└── annotations.json         # Single JSON file containing all annotations
+```
+
+## Benefits of RT-DETRv2 Format
+
+- **Standard Compatibility:** Uses the widely-adopted COCO format, ensuring compatibility with existing tools and frameworks.
+- **End-to-End Processing:** Maintains the NMS-free architecture for stable and predictable inference performance.
+- **Enhanced Performance:** Improved accuracy and efficiency compared to the original RT-DETR.
+
+## Converting Annotations to RT-DETRv2 Format with Labelformat
+
+Since RT-DETRv2 uses the standard COCO format, converting annotations to RT-DETRv2 format is equivalent to converting to COCO format.
+
+### Installation
+
+First, ensure that Labelformat is installed:
+
+```shell
+pip install labelformat
+```
+
+### Conversion Example: YOLOv8 to RT-DETRv2
+
+**Step 1: Prepare Your Dataset**
+
+Ensure your dataset follows the standard YOLOv8 structure with `data.yaml` and label files.
+
+**Step 2: Run the Conversion Command**
+
+Use the Labelformat CLI to convert YOLOv8 annotations to RT-DETRv2 (COCO format):
+```bash
+labelformat convert \
+    --task object-detection \
+    --input-format yolov8 \
+    --input-file dataset/data.yaml \
+    --input-split train \
+    --output-format rtdetrv2 \
+    --output-file dataset/rtdetrv2_annotations.json
+```
+
+**Step 3: Verify the Converted Annotations**
+
+After conversion, your dataset structure will be:
+```
+dataset/
+├── images/
+│   ├── image1.jpg
+│   ├── image2.jpg
+│   └── ...
+└── rtdetrv2_annotations.json    # COCO format annotations for RT-DETRv2
+```
+
+### Python API Example
+
+```python
+from pathlib import Path
+from labelformat.formats import YOLOv8ObjectDetectionInput, RTDETRv2ObjectDetectionOutput
+
+# Load YOLOv8 format
+label_input = YOLOv8ObjectDetectionInput(
+    input_file=Path("dataset/data.yaml"),
+    input_split="train"
+)
+
+# Convert to RT-DETRv2 format
+RTDETRv2ObjectDetectionOutput(
+    output_file=Path("dataset/rtdetrv2_annotations.json")
+).save(label_input=label_input)
+```
+
+## RT-DETRv2 vs RT-DETR
+
+RT-DETRv2 builds upon the foundation of RT-DETR with several key improvements:
+
+- **Enhanced Architecture:** Refined encoder and decoder designs for better performance
+- **Improved Training:** Advanced training strategies and optimization techniques
+- **Better Accuracy:** Higher detection accuracy across various model scales
+
+## Error Handling in Labelformat
+
+Since RT-DETRv2 uses the COCO format, the same validation and error handling applies:
+
+- **Invalid JSON Structure:** Proper error reporting for malformed JSON files
+- **Missing Required Fields:** Validation ensures all required COCO fields are present
+- **Invalid JSON Structure:** Proper error reporting for malformed JSON files.
+- **Missing Required Fields:** Validation ensures all required COCO fields are present.
+- **Reference Integrity:** Checks that image_id and category_id references are valid.
+- **Bounding Box Validation:** Ensures bounding boxes are within image boundaries.
+```json
+{
+  "images": [{"id": 0, "file_name": "image1.jpg", "width": 640, "height": 480}],
+  "categories": [{"id": 1, "name": "person"}],
+  "annotations": [{"image_id": 0, "category_id": 1, "bbox": [100, 120, 50, 80]}]
+}
+```
diff --git a/docs/formats/object-detection/yolov12.md b/docs/formats/object-detection/yolov12.md
@@ -40,8 +40,8 @@ The **YOLOv12 detection format** remains consistent with previous versions (v5-v
 
 - **Object Representation:**
   Each line in the text file represents a single object detected within the image, following the format: `<class_id> <x_center> <y_center> <width> <height>`
-    - **`<class_id>` (Integer):**   An integer representing the object's class.
-    - **`<x_center>` and `<y_center>` (Float):** The normalized coordinates of the object's center relative to the image's width and    height.
+    - **`<class_id>` (Integer):** An integer representing the object's class.
+    - **`<x_center>` and `<y_center>` (Float):** The normalized coordinates of the object's center relative to the image's width and height.
     - **`<width>` and `<height>` (Float):** The normalized width and height of the bounding box encompassing the object.
 
 - **Normalization of Values:**
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -35,6 +35,8 @@ nav:
               - Labelbox Format: formats/object-detection/labelbox.md
               - Lightly Format: formats/object-detection/lightly.md
               - PascalVOC Format: formats/object-detection/pascalvoc.md
+              - RT-DETR Format: formats/object-detection/rtdetr.md
+              - RT-DETRv2 Format: formats/object-detection/rtdetrv2.md
               - YOLOv5 Format: formats/object-detection/yolov5.md
               - YOLOv6 Format: formats/object-detection/yolov6.md
               - YOLOv7 Format: formats/object-detection/yolov7.md
diff --git a/src/labelformat/cli/cli.py b/src/labelformat/cli/cli.py
@@ -22,7 +22,7 @@ def main() -> None:
 
 Supported label formats for object detection:
 - YOLOv5, YOLOv6, YOLOv7, YOLOv8, YOLOv9, YOLOv10, YOLOv11, YOLOv12, YOLOv26
-- COCO
+- COCO, RT-DETR, RT-DETRv2
 - VOC
 - Labelbox
 - and many more
diff --git a/src/labelformat/formats/__init__.py b/src/labelformat/formats/__init__.py
diff --git a/src/labelformat/formats/rtdetr.py b/src/labelformat/formats/rtdetr.py
diff --git a/src/labelformat/formats/rtdetrv2.py b/src/labelformat/formats/rtdetrv2.py