Skip to content

feat: Remove legacy CvatDatasetBuilder code, use modernized code #174

Merged
cau-git merged 13 commits intomainfrom
cau/cvat_preannotation_cleanup
Nov 28, 2025
Merged

feat: Remove legacy CvatDatasetBuilder code, use modernized code #174
cau-git merged 13 commits intomainfrom
cau/cvat_preannotation_cleanup

Conversation

@cau-git
Copy link
Member

@cau-git cau-git commented Nov 13, 2025

  • Aligns CVAT codes to use the modern CVAT to DoclingDocument converter
  • Updates the test cases and usages
  • Massively reduces the disk size of CVAT folders required (no redundant page images across multiple dirs, no DoclingDocument JSON encoding images)

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
…n_groundtruth

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Replace legacy CVAT annotation processing code (~1400 lines) with modern
convert_cvat_folder_to_docling() converter in CvatDatasetBuilder. This
significantly simplifies the codebase and aligns with the modern CVAT
converter architecture.

- Remove legacy annotation parsing and document creation methods
- Use convert_cvat_folder_to_docling() for CVAT-to-Docling conversion
- Improve path handling in CvatPreannotationBuilder for moved datasets
- Remove unused original and original_prediction storage in json_dataset_joiner
- Update test data annotations to match new converter output

BREAKING CHANGE: CvatDatasetBuilder now requires modern CVAT folder
structure and uses convert_cvat_folder_to_docling() internally.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
@mergify
Copy link

mergify bot commented Nov 13, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@github-actions
Copy link
Contributor

github-actions bot commented Nov 13, 2025

DCO Check Passed

Thanks @cau-git, all your commits are properly signed off. 🎉

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
@cau-git cau-git changed the title feat: Remove legcay CvatDatasetBuilder code, use modernized code feat: Remove legacy CvatDatasetBuilder code, use modernized code Nov 13, 2025
- Refactor cvat_deliveries_to_hf to build single combined dataset with subset tags
- Add tags field to DatasetRecord for subset tracking
- Add page counting to annotation task creation
- Support multipage TIFF and additional image formats (BMP, GIF)
- Add configurable JSON directory names to pipeline
- Fix caption/footnote target detection logic in CVAT converter

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
- Use safe label mapping access (.get()) to prevent KeyError exceptions
- Filter unsupported labels using DEFAULT_EXPORT_LABELS as safeguard
- Add custom directory name support for CVAT deliveries via CLI flags
- Preserve original filenames in staging directory
- Refactor DeliveryExportKind enum to cleaner value names
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Copy link
Member

@PeterStaar-IBM PeterStaar-IBM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎖️

@cau-git cau-git merged commit 693c224 into main Nov 28, 2025
9 of 10 checks passed
@cau-git cau-git deleted the cau/cvat_preannotation_cleanup branch November 28, 2025 10:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants