docling-project
diff --git a/‎docling_eval/campaign_tools/README_cvat_evaluation_pipeline.md‎
Lines changed: 177 additions & 0 deletions b/‎docling_eval/campaign_tools/README_cvat_evaluation_pipeline.md‎
Lines changed: 177 additions & 0 deletions
diff --git a/‎docling_eval/campaign_tools/collect_images_from_cvat_xml.py‎
Lines changed: 143 additions & 0 deletions b/‎docling_eval/campaign_tools/collect_images_from_cvat_xml.py‎
Lines changed: 143 additions & 0 deletions
@@ -0,0 +1,177 @@
+# CVAT Evaluation Pipeline Utility
+
+A flexible pipeline for evaluating CVAT annotations that converts CVAT XML files to DoclingDocument format and runs layout and document structure evaluations.
+
+## Features
+
+- Convert CVAT XML annotations to DoclingDocument JSON format
+- Create ground truth datasets from CVAT annotations
+- Create prediction datasets for evaluation
+- Run layout and document structure evaluations
+- Support for step-by-step or end-to-end execution
+- Configurable evaluation modalities
+
+## Requirements
+
+The utility requires the following inputs:
+1. **Images Directory**: Directory containing PNG image files
+2. **Ground Truth XML**: CVAT XML file with ground truth annotations
+3. **Prediction XML**: CVAT XML file with prediction annotations (different from ground truth)
+4. **Output Directory**: Directory where all pipeline outputs will be saved
+
+## Usage
+
+### Command Line Interface
+
+```bash
+python cvat_evaluation_pipeline.py <images_dir> <output_dir> [OPTIONS]
+```
+
+### Required Arguments
+
+- `images_dir`: Directory containing PNG image files
+- `output_dir`: Output directory for pipeline results
+
+### Optional Arguments
+
+- `--gt-xml PATH`: Path to ground truth CVAT XML file
+- `--pred-xml PATH`: Path to prediction CVAT XML file
+- `--step {gt,pred,eval,full}`: Pipeline step to run (default: full)
+- `--modalities {layout,document_structure}`: Evaluation modalities to run (default: both)
+- `--verbose, -v`: Enable verbose logging
+
+## Examples
+
+### 1. Run Full Pipeline
+
+Convert both ground truth and prediction CVAT XMLs, create datasets, and run evaluations:
+
+```bash
+python cvat_evaluation_pipeline.py \
+    /path/to/images \
+    /path/to/output \
+    --gt-xml /path/to/ground_truth.xml \
+    --pred-xml /path/to/predictions.xml
+```
+
+### 2. Run Step by Step
+
+**Step 1: Create Ground Truth Dataset**
+```bash
+python cvat_evaluation_pipeline.py \
+    /path/to/images \
+    /path/to/output \
+    --gt-xml /path/to/ground_truth.xml \
+    --step gt
+```
+
+**Step 2: Create Prediction Dataset**
+```bash
+python cvat_evaluation_pipeline.py \
+    /path/to/images \
+    /path/to/output \
+    --pred-xml /path/to/predictions.xml \
+    --step pred
+```
+
+**Step 3: Run Evaluation**
+```bash
+python cvat_evaluation_pipeline.py \
+    /path/to/images \
+    /path/to/output \
+    --step eval
+```
+
+### 3. Run Specific Evaluation Modalities
+
+Run only layout evaluation:
+```bash
+python cvat_evaluation_pipeline.py \
+    /path/to/images \
+    /path/to/output \
+    --gt-xml /path/to/ground_truth.xml \
+    --pred-xml /path/to/predictions.xml \
+    --modalities layout
+```
+
+Run only document structure evaluation:
+```bash
+python cvat_evaluation_pipeline.py \
+    /path/to/images \
+    /path/to/output \
+    --gt-xml /path/to/ground_truth.xml \
+    --pred-xml /path/to/predictions.xml \
+    --modalities document_structure
+```
+
+## Output Structure
+
+The pipeline creates the following directory structure in the output directory:
+
+```
+output_dir/
+├── ground_truth_json/          # Ground truth DoclingDocument JSON files
+│   ├── gt_image1.json
+│   └── gt_image2.json
+├── predictions_json/           # Prediction DoclingDocument JSON files
+│   ├── pred_image1.json
+│   └── pred_image2.json
+├── gt_dataset/                # Ground truth dataset
+│   ├── test/
+│   └── visualizations/
+├── eval_dataset/              # Evaluation dataset
+│   ├── test/
+│   └── visualizations/
+└── evaluation_results/        # Evaluation results
+    ├── layout_evaluation/
+    └── document_structure_evaluation/
+```
+
+## Pipeline Steps Explained
+
+### Step 1: Ground Truth Dataset Creation
+- Converts ground truth CVAT XML to DoclingDocument JSON format
+- Creates a ground truth dataset using FileDatasetBuilder
+- Generates visualizations for quality inspection
+
+### Step 2: Prediction Dataset Creation
+- Converts prediction CVAT XML to DoclingDocument JSON format
+- Creates a prediction dataset using FilePredictionProvider
+- Links predictions to the ground truth dataset for evaluation
+
+### Step 3: Evaluation
+- Runs layout evaluation (mean Average Precision metrics)
+- Runs document structure evaluation (edit distance metrics)
+- Saves detailed evaluation results and visualizations
+
+## Error Handling
+
+The utility includes comprehensive error handling:
+- Validates input paths and file existence
+- Provides clear error messages for missing requirements
+- Continues processing other files if individual conversions fail
+- Logs warnings for failed conversions without stopping the pipeline
+
+## Logging
+
+The utility provides detailed logging with timestamps:
+- INFO level: Progress updates and results
+- WARNING level: Non-critical issues (e.g., failed conversions)
+- ERROR level: Critical errors that stop execution
+- Use `--verbose` flag for DEBUG level logging
+
+## Integration with Existing Codebase
+
+This utility is designed to work with the existing docling-eval framework and uses:
+- `docling_eval.cvat_tools.cvat_to_docling` for CVAT conversion
+- `docling_eval.dataset_builders.file_dataset_builder` for dataset creation
+- `docling_eval.prediction_providers.file_provider` for prediction datasets
+- `docling_eval.cli.main.evaluate` for running evaluations
+
+## Tips for Best Results
+
+1. **Image Naming**: Ensure PNG files have consistent naming that matches the CVAT annotations
+2. **XML Validation**: Verify that both ground truth and prediction XML files are valid CVAT exports
+3. **Output Space**: Ensure sufficient disk space for intermediate JSON files and datasets
+4. **Step-by-Step**: For large datasets, consider running steps separately for better resource management
+5. **Visualization**: Check the generated visualizations to verify conversion quality 
@@ -0,0 +1,143 @@
+#!/usr/bin/env python3
+"""
+Script to collect images from CVAT XML annotation file.
+
+This script:
+1. Parses a CVAT XML annotation file to extract image filenames
+2. Searches for these images in subdirectories containing cvat_tasks folders
+3. Only considers subdirectories that contain a 'cvat_tasks' folder
+4. Copies found images to an output directory
+"""
+
+import argparse
+import shutil
+import sys
+import xml.etree.ElementTree as ET
+from pathlib import Path
+from typing import List, Set
+
+
+def extract_image_filenames(xml_path: Path) -> Set[str]:
+    """Extract image filenames from CVAT XML file."""
+    try:
+        tree = ET.parse(xml_path)
+        root = tree.getroot()
+
+        # Find all image elements and extract their 'name' attributes
+        image_filenames = set()
+        for image_elem in root.findall(".//image"):
+            name_attr = image_elem.get("name")
+            if name_attr:
+                image_filenames.add(name_attr)
+
+        return image_filenames
+    except ET.ParseError as e:
+        print(f"Error parsing XML file: {e}", file=sys.stderr)
+        sys.exit(1)
+    except Exception as e:
+        print(f"Unexpected error reading XML file: {e}", file=sys.stderr)
+        sys.exit(1)
+
+
+def find_images_in_subdirectories(
+    root_dir: Path, image_filenames: Set[str]
+) -> dict[str, Path]:
+    """Find images in subdirectories that contain 'cvat_tasks' folder."""
+    found_images = {}
+
+    # Walk through all subdirectories
+    for subdir in root_dir.rglob("*"):
+        if not subdir.is_dir():
+            continue
+
+        # Check if this subdirectory contains a 'cvat_tasks' folder
+        cvat_tasks_path = subdir / "cvat_tasks"
+        if not cvat_tasks_path.exists() or not cvat_tasks_path.is_dir():
+            continue
+
+        # Search recursively within this subdirectory for images
+        for image_filename in image_filenames:
+            # Look for the image in this directory and all its subdirectories
+            for potential_image_path in subdir.rglob(image_filename):
+                if potential_image_path.is_file():
+                    found_images[image_filename] = potential_image_path
+                    break  # Found this image, move to next filename
+
+    return found_images
+
+
+def copy_images_to_output(found_images: dict[str, Path], output_dir: Path) -> None:
+    """Copy found images to output directory."""
+    output_dir.mkdir(parents=True, exist_ok=True)
+
+    copied_count = 0
+    for image_filename, source_path in found_images.items():
+        dest_path = output_dir / image_filename
+
+        try:
+            shutil.copy2(source_path, dest_path)
+            print(f"Copied: {source_path} -> {dest_path}")
+            copied_count += 1
+        except Exception as e:
+            print(f"Error copying {source_path}: {e}", file=sys.stderr)
+
+    print(f"\nSuccessfully copied {copied_count} images to {output_dir}")
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Collect images from CVAT XML annotation file"
+    )
+    parser.add_argument("xml_file", type=Path, help="Path to CVAT XML annotation file")
+    parser.add_argument(
+        "root_dir", type=Path, help="Root directory to search for images"
+    )
+    parser.add_argument(
+        "output_dir", type=Path, help="Output directory for collected images"
+    )
+
+    args = parser.parse_args()
+
+    # Validate input file exists
+    if not args.xml_file.exists():
+        print(f"Error: XML file '{args.xml_file}' does not exist", file=sys.stderr)
+        sys.exit(1)
+
+    if not args.root_dir.exists():
+        print(
+            f"Error: Root directory '{args.root_dir}' does not exist", file=sys.stderr
+        )
+        sys.exit(1)
+
+    print(f"Parsing XML file: {args.xml_file}")
+    image_filenames = extract_image_filenames(args.xml_file)
+    print(f"Found {len(image_filenames)} image filenames in XML")
+
+    print(f"Searching for images in: {args.root_dir}")
+    found_images = find_images_in_subdirectories(args.root_dir, image_filenames)
+    print(
+        f"Found {len(found_images)} images in subdirectories with 'cvat_tasks' folders"
+    )
+
+    if not found_images:
+        print("No images found. Exiting.")
+        return
+
+    # Show which images were found
+    print("\nFound images:")
+    for filename, path in found_images.items():
+        print(f"  {filename} -> {path}")
+
+    # Show missing images
+    missing_images = image_filenames - set(found_images.keys())
+    if missing_images:
+        print(f"\nMissing images ({len(missing_images)}):")
+        for filename in sorted(missing_images):
+            print(f"  {filename}")
+
+    print(f"\nCopying images to: {args.output_dir}")
+    copy_images_to_output(found_images, args.output_dir)
+
+
+if __name__ == "__main__":
+    main()