Commit d19fe1d
authored
Support for conversion of legacy anomaly datasets (#1901)
This PR introduces changes needed to convert legacy anomaly datasets to
the new format.
The conversion is a bit tricky because the legacy anomaly datasets are
under-specified so there is no trivial way to identify that they are
anomaly datasets. At the same time, we cannot convert anomaly datasets
like for like because in the new format, we expect a global
normal/anomalous label for every sample while the old one defines a
label annotation only for normal samples; anomalous samples have a
mask/bbox/polygon annotation instead. In addition, there is no easy way
to know which label categories are normal/anomalous and OTX actually
tries to handle multiple conventions (between mvtec and Geti). So I
found that one way to detect anomaly datasets is to check if the dataset
has both a global classification label and shapes such as masks or
polygons. Then I convert to a (hopefully) well-specified format:
```
# Example anomaly sample
class AnomalySample(Sample):
image: np.ndarray = image_field(dtype=pl.UInt8)
# Binary label: normal/anomalous
label: np.ndarray = label_field(dtype=pl.Int32)
# Optional semantic segmentation mask for anomalous images
defect_mask: np.ndarray | None = mask_field(dtype=pl.UInt8, semantic=Semantic.Anomaly)
# Optional list of bounding boxes and associated labels for anomalous images
defect_bboxes: np.ndarray | None = bbox_field(dtype=pl.Float32)
defect_categories: np.ndarray | None = label_field(dtype=pl.Int32, is_list=True)
```
When creating a dataset, we can explicitly specify which label
correspond to what type of image (normal vs anomalous):
```
categories = {
"label": LabelCategories(
labels=["normal", "anomalous"],
label_semantics={LabelSemantic.NORMAL: "normal", LabelSemantic.ANOMALOUS: "anomalous"},
)
}
```
If the desired label index is different from the source dataset,
Datumaro will automatically remap the indexes.
To support those changes, I had to refactor the legacy converters a bit
to support a name_prefix and semantic. This is because we can have two
sets of labels in an anomaly dataset: a global normal/anomalous label
and a per-defect label category. Since both can’t have the same name and
semantic (by design of the new dataset format), I added this name_prefix
so that the second label type can be called anomaly_labels (name_prefix
= "anomaly_") and we use the semantic `Semantic.Anomaly`.
The conversion of legacy datasets to new format is not pretty but this
is temporary as the ambiguity will disappear when we implement native
import/export in the new dataset class instead of trying to convert from
the old one which has an impedance mismatch.
Corresponding PR in OTX:
open-edge-platform/training_extensions#4770
<!-- Contributing guide:
https://github.com/open-edge-platform/datumaro/blob/develop/CONTRIBUTING.md
-->
<!--
Please add a summary of changes. You may use Copilot to auto-generate
the PR description but please consider including any other relevant
facts which Copilot may be unaware of (such as design choices and
testing procedure).
Add references to the relevant issues and pull requests if any like so:
Resolves #111 and #222.
Depends on #1000 (for series of dependent commits).
-->
### Checklist
<!-- Put an 'x' in all the boxes that apply -->
- [x] I have added tests to cover my changes or documented any manual
tests.
- [ ] I have updated the
[documentation](https://github.com/open-edge-platform/datumaro/tree/develop/docs)
accordingly
---------
Signed-off-by: Grégoire Payen de La Garanderie <[email protected]>1 parent 10aa9db commit d19fe1d
File tree
4 files changed
+240
-103
lines changed- src/datumaro/experimental
- tests/unit/experimental
4 files changed
+240
-103
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
397 | 397 | | |
398 | 398 | | |
399 | 399 | | |
400 | | - | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
401 | 403 | | |
402 | 404 | | |
403 | 405 | | |
| |||
417 | 419 | | |
418 | 420 | | |
419 | 421 | | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
420 | 425 | | |
421 | 426 | | |
422 | 427 | | |
| |||
0 commit comments