Skip to content

Commit d663fd7

Browse files
OTX D-Fine Detection Algorithm Integration (#4142)
* init * remove convertbox * Refactor D-FINE detector: remove unused components and update model configuration * update * update * Update * update recipes * Add d-fine-m * Fix recipes * dfine-l * Add dfine m - no aug * format changes * learnable params + disable teacher distillation * update * add recipes * update * update * update recipes * add dfine_hgnetv2_x * Update recipes * add tile DFine recipes * update recipes and tile batch size * update * update LR * DFine revert LR changes * make multi-scale optional * update tile recipes * update tiling recipes * add backbone pretrained weights * updawte * update * loss * update * Update * refactor d-fine criterion * * Fix docstring punctuation and remove unused aux_loss parameter in DFINETransformerModule * Refactor DFineCriterion * Update style changes * conv batchnorm fuse * update hybrid encoder * Refactor DFINE HybridEncoderModule to improve code clarity and remove redundant parameters * minor update * Refactor D-FINE module structure by removing obsolete detector file and reorganizing imports * Refactor import paths in D-FINE module and clean up unused code * Refactor D-FINE module by removing commented code, cleaning up imports, and updating documentation * Refactor D-FINE module by updating type hints, improving error messages, and enhancing documentation for RandomIoUCrop * Refactor D-FINE module by improving the weighting function's return structure and updating type hints in DFINECriterion * Update d-fine unit test * Refactor D-FINE module by enhancing docstrings for clarity and updating parameter names for consistency * Add D-Fine Detection Algorithm entries to CHANGELOG and object detection documentation * Fix device assignment for positional embeddings in HybridEncoderModule * Refactor D-FINE module by removing unused functions and integrating dfine_bbox2distance in DFINECriterion * Update codeowners * Add advanced parameters to optimization config in DFine model * Remove DFINE M, S, N model configuration files * disable tiling mem cache * Update codeowners * revert codeowner changes * Remove unused DFINE model configurations from unit tests * Add heavy unit test workflow and mark tests accordingly * Add container configuration for Heavy-Unit-Test job in pre_merge.yaml * Add additional transformations to D-Fine configuration and update test skips for unsupported models * Reduce batch size and remove heavy markers from unit tests in test_tiling.py * Revert "Add additional transformations to D-Fine configuration and update test skips for unsupported models" This reverts commit d5c66f5. * Revert "Reduce batch size and remove heavy markers from unit tests in test_tiling.py" This reverts commit 563e033. * Add additional transformations to D-Fine configuration in YAML files * disable pytest heavy tag * update * Remove unused DFine-L model configurations and update unit tests * Add DFine-X model template for class-incremental object detection * Update docs/source/guide/explanation/algorithms/object_detection/object_detection.rst Co-authored-by: Samet Akcay <[email protected]> * Update copyright years from 2024 to 2025 in multiple files * Rename heavy unit tests to intense unit tests and update related configurations * Update container image in pre_merge.yaml for Intense-Unit-Test job * update pre-merge * update ubuntu container image * update container image * Add new object detection model configuration for DFine HGNetV2 X * update image * Update pre-merge workflow to use Ubuntu 24.04 and simplify unit test coverage reporting * install sqlite * Remove sudo from apt-get command in pre-merge workflow * Remove sudo from apt-get command in pre-merge workflow * Update pre-merge workflow to install additional dependencies and correct model name in converter * Update detection configuration: increase warmup steps and patience, add min_lr, and remove unused callbacks * Remove D-Fine model recipes from object detection documentation * Skip tests for unsupported models: add check for D-Fine * Skip tests for unsupported models: add check for D-Fine * Skip tests for unsupported models: add check for DFine * Refactor DFine model: remove unused checkpoint loading and update optimizer configuration documentation; change reg_scale to float in DFINETransformer. --------- Co-authored-by: Samet Akcay <[email protected]>
1 parent a6d5795 commit d663fd7

File tree

24 files changed

+3736
-11
lines changed

24 files changed

+3736
-11
lines changed

.github/workflows/pre_merge.yaml

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,38 @@ jobs:
8484
curl -Os https://uploader.codecov.io/latest/linux/codecov
8585
chmod +x codecov
8686
./codecov -t ${{ secrets.CODECOV_TOKEN }} --sha $COMMIT_ID -U $HTTP_PROXY -f .tox/coverage_unit-test-${{ matrix.tox-env }}.xml -F ${{ matrix.tox-env }}
87+
Intense-Unit-Test:
88+
runs-on: [otx-gpu-a10g-1]
89+
container:
90+
image: "ubuntu:24.04"
91+
needs: Code-Quality-Checks
92+
timeout-minutes: 120
93+
strategy:
94+
fail-fast: false
95+
matrix:
96+
include:
97+
- python-version: "3.10"
98+
tox-env: "py310"
99+
- python-version: "3.11"
100+
tox-env: "py311"
101+
name: Intense-Unit-Test-with-Python${{ matrix.python-version }}
102+
steps:
103+
- name: Install dependencies
104+
run: apt-get update && apt-get install -y libsqlite3-0 libsqlite3-dev libgl1 libglib2.0-0
105+
- name: Checkout repository
106+
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
107+
- name: Install Python
108+
uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
109+
with:
110+
python-version: ${{ matrix.python-version }}
111+
- name: Install tox
112+
run: |
113+
python -m pip install --require-hashes --no-deps -r .ci/requirements.txt
114+
pip-compile --generate-hashes --output-file=/tmp/requirements.txt --extra=ci_tox pyproject.toml
115+
python -m pip install --require-hashes --no-deps -r /tmp/requirements.txt
116+
rm /tmp/requirements.txt
117+
- name: Run unit test
118+
run: tox -vv -e intense-unit-test-${{ matrix.tox-env }}
87119
Integration-Test:
88120
if: |
89121
github.event.pull_request.draft == false &&

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,8 @@ All notable changes to this project will be documented in this file.
2222
(<https://github.com/openvinotoolkit/training_extensions/pull/3979>)
2323
- Add OpenVINO inference for 3D Object Detection task
2424
(<https://github.com/openvinotoolkit/training_extensions/pull/4017>)
25+
- Add D-Fine Detection Algorithm
26+
(<https://github.com/openvinotoolkit/training_extensions/pull/4142>)
2527

2628
### Enhancements
2729

docs/source/guide/explanation/algorithms/object_detection/object_detection.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,8 @@ We support the following ready-to-use model recipes:
7373
+------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+---------------------+-----------------+
7474
| `Object_Detection_ResNeXt101_ATSS <https://github.com/openvinotoolkit/training_extensions/blob/develop/src/otx/recipe/detection/atss_resnext101.yaml>`_ | ResNeXt101-ATSS | 434.75 | 344.0 |
7575
+------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+---------------------+-----------------+
76+
| `D-Fine X Detection <https://github.com/openvinotoolkit/training_extensions/blob/develop/src/otx/recipe/detection/dfine_x.yaml>` | D-Fine X | 202.486 | 240.0 |
77+
+------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+---------------------+-----------------+
7678

7779
Above table can be found using the following command
7880

pyproject.toml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -398,6 +398,7 @@ convention = "google"
398398
markers = [
399399
"gpu", # mark tests which require NVIDIA GPU
400400
"cpu",
401-
"xpu", # mark tests which require Intel dGPU
401+
"xpu", # mark tests which require Intel dGPU,
402+
"intense", # intense unit tests which require better CI machines
402403
]
403404
python_files = "tests/**/*.py"

src/otx/algo/common/layers/transformer_layers.py

Lines changed: 147 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Copyright (C) 2024 Intel Corporation
1+
# Copyright (C) 2024-2025 Intel Corporation
22
# SPDX-License-Identifier: Apache-2.0
33
#
44
"""Implementation of common transformer layers."""
@@ -10,6 +10,7 @@
1010
from typing import Callable
1111

1212
import torch
13+
import torch.nn.functional as f
1314
from otx.algo.common.utils.utils import get_clones
1415
from otx.algo.modules.transformer import deformable_attention_core_func
1516
from torch import Tensor, nn
@@ -306,6 +307,151 @@ def forward(
306307
return self.output_proj(output)
307308

308309

310+
class MSDeformableAttentionV2(nn.Module):
311+
"""Multi-Scale Deformable Attention Module V2.
312+
313+
Note:
314+
This is different from vanilla MSDeformableAttention where it uses
315+
distinct number of sampling points for features at different scales.
316+
Refer to RTDETRv2.
317+
318+
Args:
319+
embed_dim (int): The number of expected features in the input.
320+
num_heads (int): The number of heads in the multiheadattention models.
321+
num_levels (int): The number of levels in MSDeformableAttention.
322+
num_points_list (list[int]): Number of distinct points for each layer. Defaults to [3, 6, 3].
323+
"""
324+
325+
def __init__(
326+
self,
327+
embed_dim: int = 256,
328+
num_heads: int = 8,
329+
num_levels: int = 4,
330+
num_points_list: list[int] = [3, 6, 3], # noqa: B006
331+
) -> None:
332+
super().__init__()
333+
self.embed_dim = embed_dim
334+
self.num_heads = num_heads
335+
self.num_levels = num_levels
336+
self.num_points_list = num_points_list
337+
338+
num_points_scale = [1 / n for n in num_points_list for _ in range(n)]
339+
self.register_buffer(
340+
"num_points_scale",
341+
torch.tensor(num_points_scale, dtype=torch.float32),
342+
)
343+
344+
self.total_points = num_heads * sum(num_points_list)
345+
self.head_dim = embed_dim // num_heads
346+
347+
self.sampling_offsets = nn.Linear(embed_dim, self.total_points * 2)
348+
self.attention_weights = nn.Linear(embed_dim, self.total_points)
349+
350+
self._reset_parameters()
351+
352+
def _reset_parameters(self) -> None:
353+
"""Reset parameters of the model."""
354+
init.constant_(self.sampling_offsets.weight, 0)
355+
thetas = torch.arange(self.num_heads, dtype=torch.float32) * (2.0 * math.pi / self.num_heads)
356+
grid_init = torch.stack([thetas.cos(), thetas.sin()], -1)
357+
grid_init = grid_init / grid_init.abs().max(-1, keepdim=True).values # noqa: PD011
358+
grid_init = grid_init.reshape(self.num_heads, 1, 2).tile([1, sum(self.num_points_list), 1])
359+
scaling = torch.concat([torch.arange(1, n + 1) for n in self.num_points_list]).reshape(1, -1, 1)
360+
grid_init *= scaling
361+
self.sampling_offsets.bias.data[...] = grid_init.flatten()
362+
363+
# attention_weights
364+
init.constant_(self.attention_weights.weight, 0)
365+
init.constant_(self.attention_weights.bias, 0)
366+
367+
def forward(
368+
self,
369+
query: Tensor,
370+
reference_points: Tensor,
371+
value: Tensor,
372+
value_spatial_shapes: list[list[int]],
373+
) -> Tensor:
374+
"""Forward function of MSDeformableAttention.
375+
376+
Args:
377+
query (Tensor): [bs, query_length, C]
378+
reference_points (Tensor): [bs, query_length, n_levels, 2], range in [0, 1], top-left (0,0),
379+
bottom-right (1, 1), including padding area
380+
value (Tensor): [bs, value_length, C]
381+
value_spatial_shapes (List): [n_levels, 2], [(H_0, W_0), (H_1, W_1), ..., (H_{L-1}, W_{L-1})]
382+
383+
Returns:
384+
output (Tensor): [bs, Length_{query}, C]
385+
"""
386+
bs, len_q = query.shape[:2]
387+
_, n_head, c, _ = value[0].shape
388+
num_points_list = self.num_points_list
389+
390+
sampling_offsets = self.sampling_offsets(query).reshape(
391+
bs,
392+
len_q,
393+
self.num_heads,
394+
sum(self.num_points_list),
395+
2,
396+
)
397+
398+
attention_weights = self.attention_weights(query).reshape(
399+
bs,
400+
len_q,
401+
self.num_heads,
402+
sum(self.num_points_list),
403+
)
404+
attention_weights = f.softmax(attention_weights, dim=-1)
405+
406+
if reference_points.shape[-1] == 2:
407+
offset_normalizer = torch.tensor(value_spatial_shapes)
408+
offset_normalizer = offset_normalizer.flip([1]).reshape(1, 1, 1, self.num_levels, 1, 2)
409+
sampling_locations = (
410+
reference_points.reshape(
411+
bs,
412+
len_q,
413+
1,
414+
self.num_levels,
415+
1,
416+
2,
417+
)
418+
+ sampling_offsets / offset_normalizer
419+
)
420+
elif reference_points.shape[-1] == 4:
421+
num_points_scale = self.num_points_scale.to(query).unsqueeze(-1)
422+
offset = sampling_offsets * num_points_scale * reference_points[:, :, None, :, 2:] * 0.5
423+
sampling_locations = reference_points[:, :, None, :, :2] + offset
424+
else:
425+
msg = (f"Last dim of reference_points must be 2 or 4, but get {reference_points.shape[-1]} instead.",)
426+
raise ValueError(msg)
427+
428+
# sampling_offsets [8, 480, 8, 12, 2]
429+
sampling_grids = 2 * sampling_locations - 1
430+
431+
sampling_grids = sampling_grids.permute(0, 2, 1, 3, 4).flatten(0, 1)
432+
sampling_locations_list = sampling_grids.split(num_points_list, dim=-2)
433+
434+
sampling_value_list = []
435+
for level, (h, w) in enumerate(value_spatial_shapes):
436+
value_l = value[level].reshape(bs * n_head, c, h, w)
437+
sampling_grid_l = sampling_locations_list[level]
438+
sampling_value_l = f.grid_sample(
439+
value_l,
440+
sampling_grid_l,
441+
mode="bilinear",
442+
padding_mode="zeros",
443+
align_corners=False,
444+
)
445+
446+
sampling_value_list.append(sampling_value_l)
447+
448+
attn_weights = attention_weights.permute(0, 2, 1, 3).reshape(bs * n_head, 1, len_q, sum(num_points_list))
449+
weighted_sample_locs = torch.concat(sampling_value_list, dim=-1) * attn_weights
450+
output = weighted_sample_locs.sum(-1).reshape(bs, n_head * c, len_q)
451+
452+
return output.permute(0, 2, 1)
453+
454+
309455
class VisualEncoderLayer(nn.Module):
310456
"""VisualEncoderLayer module consisting of MSDeformableAttention and feed-forward network.
311457

0 commit comments

Comments
 (0)