Skip to content

Commit d19fe1d

Browse files
authored
Support for conversion of legacy anomaly datasets (#1901)
This PR introduces changes needed to convert legacy anomaly datasets to the new format. The conversion is a bit tricky because the legacy anomaly datasets are under-specified so there is no trivial way to identify that they are anomaly datasets. At the same time, we cannot convert anomaly datasets like for like because in the new format, we expect a global normal/anomalous label for every sample while the old one defines a label annotation only for normal samples; anomalous samples have a mask/bbox/polygon annotation instead. In addition, there is no easy way to know which label categories are normal/anomalous and OTX actually tries to handle multiple conventions (between mvtec and Geti). So I found that one way to detect anomaly datasets is to check if the dataset has both a global classification label and shapes such as masks or polygons. Then I convert to a (hopefully) well-specified format: ``` # Example anomaly sample class AnomalySample(Sample): image: np.ndarray = image_field(dtype=pl.UInt8) # Binary label: normal/anomalous label: np.ndarray = label_field(dtype=pl.Int32) # Optional semantic segmentation mask for anomalous images defect_mask: np.ndarray | None = mask_field(dtype=pl.UInt8, semantic=Semantic.Anomaly) # Optional list of bounding boxes and associated labels for anomalous images defect_bboxes: np.ndarray | None = bbox_field(dtype=pl.Float32) defect_categories: np.ndarray | None = label_field(dtype=pl.Int32, is_list=True) ``` When creating a dataset, we can explicitly specify which label correspond to what type of image (normal vs anomalous): ``` categories = { "label": LabelCategories( labels=["normal", "anomalous"], label_semantics={LabelSemantic.NORMAL: "normal", LabelSemantic.ANOMALOUS: "anomalous"}, ) } ``` If the desired label index is different from the source dataset, Datumaro will automatically remap the indexes. To support those changes, I had to refactor the legacy converters a bit to support a name_prefix and semantic. This is because we can have two sets of labels in an anomaly dataset: a global normal/anomalous label and a per-defect label category. Since both can’t have the same name and semantic (by design of the new dataset format), I added this name_prefix so that the second label type can be called anomaly_labels (name_prefix = "anomaly_") and we use the semantic `Semantic.Anomaly`. The conversion of legacy datasets to new format is not pretty but this is temporary as the ambiguity will disappear when we implement native import/export in the new dataset class instead of trying to convert from the old one which has an impedance mismatch. Corresponding PR in OTX: open-edge-platform/training_extensions#4770 <!-- Contributing guide: https://github.com/open-edge-platform/datumaro/blob/develop/CONTRIBUTING.md --> <!-- Please add a summary of changes. You may use Copilot to auto-generate the PR description but please consider including any other relevant facts which Copilot may be unaware of (such as design choices and testing procedure). Add references to the relevant issues and pull requests if any like so: Resolves #111 and #222. Depends on #1000 (for series of dependent commits). --> ### Checklist <!-- Put an 'x' in all the boxes that apply --> - [x] I have added tests to cover my changes or documented any manual tests. - [ ] I have updated the [documentation](https://github.com/open-edge-platform/datumaro/tree/develop/docs) accordingly --------- Signed-off-by: Grégoire Payen de La Garanderie <[email protected]>
1 parent 10aa9db commit d19fe1d

File tree

4 files changed

+240
-103
lines changed

4 files changed

+240
-103
lines changed

src/datumaro/experimental/dataset.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -397,7 +397,9 @@ def __setitem__(self, row_idx: int, sample: DType):
397397
)
398398

399399
def convert_to_schema(
400-
self, target_dtype_or_schema: Union[Schema, Type[DTargetType]]
400+
self,
401+
target_dtype_or_schema: Union[Schema, Type[DTargetType]],
402+
target_categories: Dict[str, Categories] = None,
401403
) -> "Dataset[DTargetType]":
402404
"""
403405
Convert this dataset to a new schema using registered converters.
@@ -417,6 +419,9 @@ def convert_to_schema(
417419
else:
418420
target_schema = target_dtype_or_schema.infer_schema()
419421

422+
if target_categories is not None:
423+
target_schema = target_schema.with_categories(target_categories)
424+
420425
# Early return if schemas are already compatible
421426
if has_schema(self, target_dtype_or_schema):
422427
# Same schema but mismatching dtype.

0 commit comments

Comments
 (0)