The HOTOSM-EO-TT (Humanitarian OpenStreetMap Team - Earth Observation - Top Talented) project is a comprehensive deep learning solution for automated building detection, segmentation, and material classification using satellite and aerial imagery. This project is designed to support humanitarian mapping, disaster response, and infrastructure monitoring by providing robust tools for extracting building footprints and classifying roof materials.
- Building Detection: Accurately identify and localize buildings in satellite imagery
- Building Segmentation: Generate precise building footprints and boundaries
- Material Classification: Classify roof materials for infrastructure assessment
- Scalable Pipeline: Create reusable pipelines for different geographic regions
The project is structured into two distinct phases, each addressing specific challenges in geospatial AI:
Phase 1 focuses on developing robust building detection and segmentation capabilities using state-of-the-art computer vision models. The approach combines object detection (YOLO) with segmentation (SAM) to achieve high-precision building extraction.
- Initial Testing: Evaluated multiple YOLO versions (v8, v11, v12)
- Dataset: Trained on RAMP dataset for building detection
- Innovation Team Insights: Leveraged previous team's research and findings
- Final Selection: YOLO v11n and YOLO v11s based on performance metrics
- Results: Achieved 78% mAP on RAMP dataset
- Fine-tuning: Customized SAM model on RAMP dataset
- Integration Strategy: Used YOLO detection results as input to SAM
- Input Methods Tested:
- Points: Single point prompts for each building
- Boxes: Bounding box prompts for each building
- Point-by-Point: Individual point processing
- Box-by-Box: Individual bounding box processing
- Evaluation: Calculated IoU (Intersection over Union) for each method
- HOTOSM Areas: Tested on multiple geographic regions provided by HOTOSM team
- Transfer Learning: Applied RAMP-trained models to new areas
- Performance: Achieved improved mAP and SAM IoU scores
- Conclusion: YOLO + SAM pipeline successfully generalized across different regions
| Model | Training Time | Epochs | mAP | SAM IoU | Status |
|---|---|---|---|---|---|
| YOLO v11n | 72 hrs | 100 | 78% | Improved | โ Selected |
| YOLO v11s | 72 hrs | 100 | 78% | Improved | โ Selected |
| SAM (Fine-tuned) | - | - | - | High | โ Integrated |
Phase 2 extends the building detection capabilities to include material classification, enabling detailed infrastructure assessment through roof material identification.
Methodology:
- Stage 1: YOLO object detection to identify buildings
- Stage 2: Crop detected buildings and classify materials using:
- EfficientNet-B5
- VGG16
Dataset: RoofNet (15 material classes) Results:
- Overall accuracy: <65%
- Multiple classes showed low accuracy
- Status: โ Insufficient performance
Methodology:
- Model: YOLO for multi-class object detection and segmentation
- Dataset: Nacala dataset (5 material classes)
Results:
- mAP: 65%
- Segmentation quality: Much better than Approach 1
- Status: โ Best performing approach
Methodology:
- Dataset Conversion: Transformed RoofNet from classification to object detection
- Process:
- Used YOLO v11n (trained on RAMP) to detect buildings in RoofNet images
- Created annotations based on folder structure and detected buildings
- Training: Fine-tuned YOLO on converted RoofNet dataset
Results:
- Performance: Not promising
- Status: โ Insufficient results
Innovation: Introduced Ground DINO for building detection Advantages:
- Zero-shot capability: No additional training required
- Out-of-the-box performance: Excellent results without fine-tuning
- Comparison: Superior to YOLO which required fine-tuning on HOTOSM areas
Final Approach:
- Backbone: DINOv3 for feature extraction
- Segmentation Head: ViTPerHead for precise segmentation
- Training Data: 5% of RAMP dataset
- Results:
- IoU: 70%
- Dice Loss: 82%
| Approach | Model | Dataset | Classes | mAP/IoU | Status |
|---|---|---|---|---|---|
| 1 | YOLO + EfficientNet/VGG16 | RoofNet | 15 | <65% | โ |
| 2 | YOLO Multi-class | Nacala | 5 | 65% mAP | โ Best |
| 3 | YOLO (Converted) | RoofNet | 15 | Low | โ |
| Enhancement | Ground DINO | - | - | High | โ Zero-shot |
| Final | DINOv3 + ViTPerHead | RAMP (5%) | - | 70% IoU | โ |
- Versions: v8, v11, v12
- Selected: v11n, v11s
- Architecture: Single-stage object detector
- Advantages: Fast inference, good accuracy
- Use Case: Building detection and material classification
- Type: Foundation model for segmentation
- Fine-tuning: Customized on RAMP dataset
- Input Types: Points, bounding boxes
- Integration: Receives YOLO detection results
- Output: Precise building segmentation masks
- Type: Self-supervised vision transformer
- Use Case: Feature extraction backbone
- Advantages: Strong feature representation
- Integration: Combined with ViTPerHead for segmentation
- Type: Zero-shot object detection model
- Advantages: No training required, excellent generalization
- Use Case: Building detection without fine-tuning
- Performance: Superior to fine-tuned YOLO
- Type: Classification models
- Use Case: Material classification from cropped building images
- Performance: Limited effectiveness for this task
- Definition: Average precision across all classes
- Range: 0-100%
- Interpretation: Higher values indicate better detection accuracy
- Our Results: 65% mAP on Nacala dataset
- Definition: Ratio of intersection to union of predicted and ground truth masks
- Range: 0-1 (0-100%)
- Interpretation: Higher values indicate better segmentation quality
- Our Results: 70% IoU with DINOv3 + ViTPerHead
- Definition: 1 - Dice coefficient, measures overlap between predicted and ground truth
- Range: 0-1
- Interpretation: Lower values indicate better segmentation
- Our Results: 82% Dice coefficient (18% Dice loss)
- Download RAMP Dataset: Download and setup RAMP dataset for building detection
- Download Nacala Dataset: Download Nacala dataset for material classification
- Download RoofNet Dataset: Download RoofNet dataset for roof material classification
- Original Paper Nacala Roof Inference: Reproduce original Nacala paper results
- Original Paper RoofNet Inference: Reproduce original RoofNet paper results
- YOLO Training Pipeline: Complete YOLO training pipeline for building detection
- ETL Pipeline for YOLO: Data preprocessing and annotation conversion for YOLO format
- YOLO Requirements: Dependencies for YOLO training
- SAM Fine-tuning Pipeline: Fine-tune SAM model on RAMP dataset
- ETL Pipeline for SAM: Data preprocessing for SAM training
- SAM Requirements: Dependencies for SAM training
- DINOv3 Training Pipeline: Train DINOv3 model for segmentation
- ETL Pipeline for DINOv3: Data preprocessing for DINOv3 training
- YOLO-SAM Inference Pipeline: Combined YOLO detection + SAM segmentation pipeline
- Ground DINO Inference Pipeline: Zero-shot building detection using Ground DINO
- YOLO-SAM Requirements: Dependencies for inference pipeline
- YOLO-SAM Evaluation: Comprehensive evaluation of YOLO-SAM pipeline results
- Evaluation Requirements: Dependencies for evaluation
- Material Classification Pipeline: Complete pipeline for roof material classification
- Material Classification Config: Configuration file for material classification
- Nacala Object Detection Pipeline: YOLO-based material detection on Nacala dataset
- RoofNet Object Detection Pipeline: YOLO-based material detection on RoofNet dataset
- RoofNet to Object Detection Mapping: Convert RoofNet classification dataset to object detection format
| Model | Training Required | mAP | Inference Speed | Generalization |
|---|---|---|---|---|
| YOLO v11n | Yes (RAMP + HOTOSM) | High | Fast | Good |
| YOLO v11s | Yes (RAMP + HOTOSM) | High | Fast | Good |
| Ground DINO | No (Zero-shot) | High | Medium | Excellent |
| Model | IoU | Dice Coefficient | Training Data | Specialization |
|---|---|---|---|---|
| SAM (Fine-tuned) | High | High | RAMP | Building-specific |
| DINOv3 + ViTPerHead | 70% | 82% | RAMP (5%) | General |
| Approach | Dataset | Classes | Accuracy | Complexity |
|---|---|---|---|---|
| YOLO + EfficientNet/VGG16 | RoofNet | 15 | <65% | High |
| YOLO Multi-class | Nacala | 5 | 65% mAP | Medium |
- Use Ground DINO: For new regions, start with Ground DINO for zero-shot building detection
- Fine-tune SAM: For specific regions, fine-tune SAM on local data
- Material Classification: Focus on Nacala dataset approach (5 classes) for better results
- DINOv3 Integration: Consider DINOv3 + ViTPerHead for high-precision segmentation
- YOLO - Object detection framework
- Segment Anything Model (SAM) - Foundation model for segmentation
- DINOv3 - Self-supervised vision transformer
- Ground DINO - Zero-shot object detection