Skip to content

[Refactor] Replace Corrupted MNIST with EuroSAT dataset #564

@SkafteNicki

Description

@SkafteNicki

Overview

Replace the corrupted MNIST dataset with EuroSAT (satellite imagery classification) throughout the entire DTU MLOps course. This change will provide richer data engineering opportunities and better align with the new data engineering modules being introduced in #549.

Motivation

Why Replace Corrupted MNIST?

The current corrupted MNIST dataset (rotated MNIST digits) has several limitations:

  • Too simple: Just 28x28 grayscale images with artificial rotation
  • Limited data engineering potential: Single format, small size, no real-world complexity
  • Not engaging: Students aren't satisfied with digit classification
  • Misaligned with new modules: Doesn't showcase data pipelines, versioning, or labeling effectively

Why EuroSAT?

EuroSAT is a land use/land cover classification dataset from Sentinel-2 satellite imagery:

Dataset Characteristics:

  • 10 classes: AnnualCrop, Forest, HerbaceousVegetation, Highway, Industrial, Pasture, PermanentCrop, Residential, River, SeaLake
  • 27,000 labeled images (2,000-3,000 per class)
  • Two versions:
    • RGB: 64x64x3 images (~90MB) - torchvision support available
    • Multi-Spectral: 64x64x13 images (~2GB) - requires custom loader or TorchGeo
  • Real-world application: Satellite imagery classification
  • Data engineering rich: Multiple bands, larger size, geospatial metadata
  • MIT License: Freely available on Zenodo

Benefits for the Course:

  1. ✅ Real-world relevance (satellite imagery for land use monitoring)
  2. ✅ Progressive complexity (start with RGB, introduce MS for advanced topics)
  3. ✅ Perfect for new data engineering modules (pipelines, versioning, validation)
  4. ✅ Built-in torchvision support for RGB version
  5. ✅ Larger dataset enables meaningful data loading/pipeline exercises
  6. ✅ Multiple spectral bands enable data preprocessing/feature engineering
  7. ✅ Better alignment with MLOps lifecycle (monitoring Earth observation changes)

Alignment with Issue #549:
The new S4 Data Engineering session includes data labeling, pipelines, and validation. EuroSAT provides:

  • Multi-spectral data for versioning exercises (track different band combinations)
  • Larger dataset size makes data pipelines relevant (vs 5KB MNIST chunks)
  • Geospatial context for data quality/validation exercises
  • Realistic preprocessing needs (band normalization, composites)

Scope of Changes

Based on comprehensive codebase analysis, corrupted MNIST appears in:

Documentation Files (20+ files)

  • S1: deep_learning_software.md - Main project introduction
  • S2: code_structure.md, dvc.md - Project organization
  • S3: config_files.md, docker.md - Reproducibility
  • S4: debugging.md, profiling.md, logging.md, boilerplate.md - Debugging/logging
  • S5: unittesting.md, cml.md - Testing and CI/CD
  • S6: using_the_cloud.md - Cloud storage
  • S7: testing_apis.md - Deployment
  • S8: data_drifting.md - Monitoring
  • S10: hyperparameters.md - Optimization

Python Files (30+ files)

  • Data loaders: data_solution.py, data.py, dataset.py
  • Training scripts: main_solution.py, train_solution.py, evaluate_solution.py
  • W&B integration: weights_and_bias_solution*.py
  • Teaching examples: vae_mnist*.py (6 files) - used for debugging/profiling examples
  • Tools: corrupt_mnist.py - dataset generation script

Jupyter Notebooks (6 files)

  • PyTorch introduction notebooks (S1)
  • Fashion MNIST notebooks (keep as-is, used for different exercises)

Supporting Files

  • Test files in S5
  • CML scripts in S5
  • Drift detection scripts in S8

Implementation Plan

Phase 1: Create EuroSAT Infrastructure

1.1 Create Dataset Generation Tool

File: tools/prepare_eurosat.py

"""
Download and prepare EuroSAT dataset for the course.

Creates both RGB and Multi-Spectral versions with proper splits and formats.
"""
import torch
from torchvision import datasets, transforms
from pathlib import Path

# RGB Version (for most exercises)
def prepare_eurosat_rgb(output_dir="data/eurosat"):
    """Download RGB version using torchvision"""
    dataset_train = datasets.EuroSAT(
        root=output_dir, 
        download=True
    )
    # Create train/val/test splits
    # Save in .pt format for consistency with current course structure
    ...

# Multi-Spectral Version (for advanced data engineering exercises)
def prepare_eurosat_ms(output_dir="data/eurosat_ms"):
    """Download MS version from Zenodo, prepare for advanced exercises"""
    # Download from Zenodo: https://zenodo.org/record/7711810
    # Process 13-band imagery
    # Demonstrate data pipeline preprocessing
    ...

Tasks:

  • Create tools/prepare_eurosat.py script
  • Implement RGB version download (torchvision)
  • Implement MS version download (Zenodo)
  • Create train/val/test splits (60/20/20)
  • Save in .pt format for backward compatibility
  • Add data statistics and visualization
  • Document band information for MS version
  • Test on multiple platforms

1.2 Update Data Loading Infrastructure

Files:

  • s1_development_environment/exercise_files/final_exercise/data_solution.py
  • s2_organisation_and_version_control/exercise_files/data_solution.py

Replace corrupt_mnist() function with eurosat():

def eurosat(rgb_only=True):
    """
    Load EuroSAT dataset for land use classification.
    
    Args:
        rgb_only: If True, load RGB version. If False, load Multi-Spectral.
    
    Returns:
        train_dataset, test_dataset: PyTorch Dataset objects
    """
    DATA_PATH = "data/eurosat" if rgb_only else "data/eurosat_ms"
    # Load train/test splits
    # Return TensorDataset objects
    ...

Tasks:

  • Update data_solution.py in S1 with EuroSAT loader
  • Update data.py template in S1 for students
  • Update normalized dataset class in S2
  • Create MnistDatasetEuroSATDataset class
  • Update data statistics (27K images, 10 classes, 64x64x3)
  • Add data exploration utilities (show class distribution, sample images)

1.3 Create Reference Solutions

Tasks:

  • Update model_solution.py - CNN architecture for 64x64 RGB input
  • Update main_solution.py - training/evaluation with EuroSAT
  • Update all W&B integration scripts
  • Update Lightning boilerplate examples
  • Test solutions achieve reasonable accuracy (target: >90%)

Phase 2: Update All Modules Systematically

2.1 Module S1 - Development Environment

File: s1_development_environment/deep_learning_software.md

Changes Needed:

  • Lines 179-421: Replace corrupted MNIST introduction with EuroSAT
  • Update dataset description:
    • Old: "rotated MNIST digits (28x28 grayscale)"
    • New: "EuroSAT satellite imagery for land use classification (64x64 RGB)"
  • Update download instructions (remove Google Drive, use new script)
  • Update file structure: data/corruptmnist/data/eurosat/
  • Update class information: 10 digit classes → 10 land use classes
  • Update accuracy target: ≥85% → ≥90% (EuroSAT is well-structured)
  • Update starter code templates (model input size, channels)
  • Add EuroSAT class descriptions and visualization
  • Remove "identify the corruption" exercise (no longer relevant)

Notebooks:

  • Keep notebooks 2-6 as-is (use standard MNIST/Fashion MNIST for PyTorch intro)
  • These are teaching examples, not the main project

2.2 Module S2 - Organisation and Version Control

File: s2_organisation_and_version_control/code_structure.md

Changes Needed:

  • Line 220, 307: Update references to "MNIST classifier" → "EuroSAT classifier"
  • Update data path: ../data/corruptmnist../data/eurosat
  • Update processing instructions (normalization for RGB imagery)

File: s2_organisation_and_version_control/dvc.md

Changes Needed:

2.3 Module S3 - Reproducibility

File: s3_reproducibility/config_files.md

Changes Needed:

  • Lines 130-276: Update all "MNIST" references to "EuroSAT"
  • Keep VAE MNIST examples as-is (teaching example, not main project)
  • Update final exercise: "Make your EuroSAT code reproducible!"
  • Update hyperparameters in config examples (image size, channels, etc.)

File: s3_reproducibility/docker.md

Changes Needed:

  • Lines 69, 176: Update "MNIST repository" → "EuroSAT repository"
  • Update dockerfile examples (data paths, model architecture)

2.4 Module S4 - Debugging and Logging

Files: Keep VAE MNIST examples unchanged (teaching tools)

  • vae_mnist_bugs.py
  • vae_mnist_working.py
  • vae_mnist_pytorch_profiler_solution.py

These are separate teaching examples for debugging/profiling, not the main project.

File: s4_debugging_and_logging/logging.md

Changes Needed:

  • Lines 274, 512: Update W&B project name
    • Old: "corrupt_mnist", corrupt_mnist_models
    • New: "eurosat", eurosat_models
  • Update logged images (satellite imagery instead of digits)

File: s4_debugging_and_logging/boilerplate.md

Changes Needed:

  • Lines 59, 157, 176: Update examples to use EuroSAT
  • Update LightningModule for RGB input

File: Update W&B solution scripts:

  • weights_and_bias_solution.py
  • weights_and_bias_solution2.py
  • weights_and_bias_solution3.py

Changes:

  • Update project name: "corrupt_mnist""eurosat"
  • Update artifact name: "corrupt_mnist_model""eurosat_model"
  • Update description: "classify corrupt MNIST images" → "classify satellite imagery"
  • Update logged images (RGB satellite images)

2.5 Module S5 - Continuous Integration

File: s5_continuous_integration/unittesting.md

Changes Needed:

  • Lines 56, 128-144: Update test specifications
    • Old: 30,000/50,000 train, 5,000 test, [1,28,28] or [784], labels 0-9
    • New: ~21,600 train, ~5,400 test, [3,64,64] or [12288], 10 land use classes
  • Update import: from my_project.data import eurosat
  • Update test examples (shape assertions, class validation)

File: s5_continuous_integration/cml.md

Changes Needed:

  • Lines 95, 218: Update dataset class
    • Old: Dataset for corrupted MNIST
    • New: Dataset for EuroSAT
  • Update visualization: mnist_images.pngeurosat_images.png
  • Update statistics generation (RGB channel stats, class distribution)

File: s5_continuous_integration/exercise_files/dataset.py

Changes:

  • Rename MnistDatasetEuroSATDataset
  • Update file loading (new .pt file structure)
  • Update image shapes and channels

File: s5_continuous_integration/exercise_files/dataset_statistics.py

Changes:

  • Update to use EuroSATDataset
  • Generate eurosat_images.png
  • Update statistics (channel means/stds for RGB)

2.6 Module S6 - Cloud

File: s6_the_cloud/using_the_cloud.md

Changes Needed:

  • Line 282: Update "corrupt MNIST repository" → "EuroSAT repository"
  • Update bucket examples (larger dataset size, different structure)

2.7 Module S7 - Deployment

File: s7_deployment/testing_apis.md

Changes Needed:

  • Lines 94, 111: Update API messages
    • Old: "Welcome to the MNIST model inference API!"
    • New: "Welcome to the EuroSAT model inference API!"
  • Update example payloads (RGB image format)

2.8 Module S8 - Monitoring

File: s8_monitoring/data_drifting.md

Changes Needed:

  • Line 286: Update drift detection example
    • Consider: EuroSAT RGB vs EuroSAT MS bands
    • Or: EuroSAT vs different satellite imagery dataset
  • Opportunity: Use multi-spectral bands to demonstrate feature drift

File: s8_monitoring/exercise_files/image_drift.py

Changes:

  • Replace MNIST vs FashionMNIST comparison
  • Use EuroSAT RGB vs MS, or different band combinations
  • Update feature extraction for RGB

2.9 Module S9 - Scalable Applications

File: s9_scalable_applications/distributed_training.md

Changes:

  • Line 97: Keep Fashion MNIST example as-is (teaching example)
  • Update main project references to EuroSAT

2.10 Module S10 - Extra

File: s10_extra/hyperparameters.md

Changes Needed:

  • Lines 65, 115, 206: Update main project references
  • Keep sklearn digits and Fashion MNIST examples (teaching tools)
  • Update hyperparameter tuning to work with EuroSAT

Phase 3: Update Supporting Materials

3.1 Documentation

Tasks:

  • Update README.md - course overview
  • Update pages/timeplan.md - project descriptions
  • Update pages/overview.md - dataset mention
  • Update pages/projects.md - project tips and examples
  • Update reports/README.md - project report template

3.2 Assets and Images

Tasks:

  • Replace s1_development_environment/exercise_files/assets/mnist.png
  • Add EuroSAT class visualization
  • Update any architecture diagrams (input dimensions)

3.3 Helper Functions

File: s1_development_environment/exercise_files/helper.py

Changes:

  • Lines 40, 50: Update view_classify() for RGB images
  • Update visualization for 10 land use classes

Phase 4: Clean Up and Deprecation

4.1 Remove Old Tools

Tasks:

  • Delete or archive tools/corrupt_mnist.py
  • Add deprecation notice explaining the transition

4.2 Update Dependencies

File: pyproject.toml

Changes:

  • Ensure torchvision>=0.25 (for EuroSAT support)
  • Add any new dependencies for MS version (if using TorchGeo)
  • Document optional dependencies

4.3 Migration Guide

Create: docs/MIGRATION_EUROSAT.md

Contents:

  • Explanation of why we switched
  • Comparison table (MNIST vs EuroSAT)
  • Migration guide for existing students
  • Troubleshooting common issues
  • Links to EuroSAT papers and documentation

Data Engineering Integration (Issue #549)

Perfect Alignment with New S4 Data Engineering Session

EuroSAT provides excellent opportunities for the new data engineering modules:

Module S4.1 - Data Labeling (Label Studio)

EuroSAT Use Cases:

  • Label new satellite tiles collected from different regions
  • Demonstrate active learning (label uncertain predictions)
  • Multi-class labeling workflow for land use
  • Export and version labeled datasets

Module S4.2 - Data Pipelines (Prefect/MageAI)

EuroSAT Use Cases:

  • RGB Pipeline: Download → Extract → Split → Normalize → Store
  • MS Pipeline: Download → Band Selection → Composite → Normalize → Store
  • Progressive Pipeline: Start with RGB, add MS bands incrementally
  • Data Versioning: Track different band combinations (RGB, NIR+RGB, All 13 bands)
  • Scheduled Updates: Simulate new satellite imagery arriving
  • Preprocessing: Multi-band normalization, composite generation

Module S4.3 - Data Quality/Validation (Optional)

EuroSAT Use Cases:

  • Validate band ranges and distributions
  • Check for class imbalance
  • Detect corrupted or missing bands
  • Geospatial metadata validation

Recommended Approach:

  1. S1-S3: Use RGB version (64x64x3) - torchvision support, easy to start
  2. S4: Introduce MS version (64x64x13) - demonstrate data pipelines
  3. S5-S10: Continue with RGB, optionally use MS for advanced exercises

Testing Strategy

Validation Checklist

  • All exercises work with EuroSAT RGB version
  • Model architectures handle 64x64x3 input
  • Accuracy targets are achievable (>90% on RGB)
  • Data loading is efficient (comparable to MNIST)
  • All cross-references updated
  • No broken links
  • Docker builds work
  • Cloud storage integration works
  • W&B logging displays correctly
  • Unit tests pass
  • CML workflows function
  • API deployment works

Performance Benchmarks

  • Training time on CPU (acceptable for course)
  • Training time on GPU (should be fast)
  • Data download time (< 5 minutes)
  • Storage requirements (< 200MB for RGB)

Implementation Timeline

Recommended Phased Rollout:

  1. Week 1-2: Phase 1 - Infrastructure (tools, data loaders, solutions)
  2. Week 3-4: Phase 2.1-2.5 - Core modules (S1-S5)
  3. Week 5-6: Phase 2.6-2.10 - Advanced modules (S6-S10)
  4. Week 7: Phase 3 - Supporting materials, documentation
  5. Week 8: Phase 4 - Clean up, migration guide, testing
  6. Week 9: Integration with [Refactor] New learning session on data engineering #549 data engineering modules
  7. Week 10: Final validation and course pilot

Parallel Workstreams:


Success Metrics

  • All 20+ markdown files updated
  • All 30+ Python files updated
  • 0 broken links or references
  • Student feedback positive (more engaging than rotated digits)
  • Data engineering modules benefit from richer dataset
  • Course learning objectives still achieved
  • No increase in course difficulty for beginners
  • Advanced students have more to explore (MS version)

Risks and Mitigations

Risk Impact Mitigation
Larger images increase training time Medium Provide pre-computed features; recommend GPU
Students struggle with new domain Low Add satellite imagery primer; visualize classes
Torchvision API changes Low Pin torchvision version; document alternatives
MS version too complex Medium Make MS optional; focus on RGB for core course
Download bandwidth issues Medium Provide cached downloads; multiple mirrors
Existing student projects break High Provide migration guide; support both for 1 semester

Dependencies and Blockers

Depends On:

Blocks:

  • Course material updates for next semester
  • Student project template repository

Coordinate With:


References


Next Steps

  1. Review and approve this plan
  2. Create sub-issues for each phase (optional, or tackle as one large issue)
  3. Begin Phase 1 implementation
  4. Coordinate with [Refactor] New learning session on data engineering #549 for S4 data engineering integration
  5. Set up testing environment
  6. Create student migration guide

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions