[Refactor] Replace Corrupted MNIST with EuroSAT dataset

## Overview

Replace the corrupted MNIST dataset with **EuroSAT** (satellite imagery classification) throughout the entire DTU MLOps course. This change will provide richer data engineering opportunities and better align with the new data engineering modules being introduced in #549.

## Motivation

### Why Replace Corrupted MNIST?

The current corrupted MNIST dataset (rotated MNIST digits) has several limitations:
- **Too simple**: Just 28x28 grayscale images with artificial rotation
- **Limited data engineering potential**: Single format, small size, no real-world complexity
- **Not engaging**: Students aren't satisfied with digit classification
- **Misaligned with new modules**: Doesn't showcase data pipelines, versioning, or labeling effectively

### Why EuroSAT?

[EuroSAT](https://github.com/phelber/eurosat) is a land use/land cover classification dataset from Sentinel-2 satellite imagery:

**Dataset Characteristics:**
- **10 classes**: AnnualCrop, Forest, HerbaceousVegetation, Highway, Industrial, Pasture, PermanentCrop, Residential, River, SeaLake
- **27,000 labeled images** (2,000-3,000 per class)
- **Two versions**:
  - RGB: 64x64x3 images (~90MB) - torchvision support available
  - Multi-Spectral: 64x64x13 images (~2GB) - requires custom loader or TorchGeo
- **Real-world application**: Satellite imagery classification
- **Data engineering rich**: Multiple bands, larger size, geospatial metadata
- **MIT License**: Freely available on Zenodo

**Benefits for the Course:**
1. ✅ Real-world relevance (satellite imagery for land use monitoring)
2. ✅ Progressive complexity (start with RGB, introduce MS for advanced topics)
3. ✅ Perfect for new data engineering modules (pipelines, versioning, validation)
4. ✅ Built-in torchvision support for RGB version
5. ✅ Larger dataset enables meaningful data loading/pipeline exercises
6. ✅ Multiple spectral bands enable data preprocessing/feature engineering
7. ✅ Better alignment with MLOps lifecycle (monitoring Earth observation changes)

**Alignment with Issue #549:**
The new S4 Data Engineering session includes data labeling, pipelines, and validation. EuroSAT provides:
- Multi-spectral data for versioning exercises (track different band combinations)
- Larger dataset size makes data pipelines relevant (vs 5KB MNIST chunks)
- Geospatial context for data quality/validation exercises
- Realistic preprocessing needs (band normalization, composites)

## Scope of Changes

Based on comprehensive codebase analysis, corrupted MNIST appears in:

### Documentation Files (20+ files)
- **S1**: `deep_learning_software.md` - Main project introduction
- **S2**: `code_structure.md`, `dvc.md` - Project organization
- **S3**: `config_files.md`, `docker.md` - Reproducibility
- **S4**: `debugging.md`, `profiling.md`, `logging.md`, `boilerplate.md` - Debugging/logging
- **S5**: `unittesting.md`, `cml.md` - Testing and CI/CD
- **S6**: `using_the_cloud.md` - Cloud storage
- **S7**: `testing_apis.md` - Deployment
- **S8**: `data_drifting.md` - Monitoring
- **S10**: `hyperparameters.md` - Optimization

### Python Files (30+ files)
- **Data loaders**: `data_solution.py`, `data.py`, `dataset.py`
- **Training scripts**: `main_solution.py`, `train_solution.py`, `evaluate_solution.py`
- **W&B integration**: `weights_and_bias_solution*.py`
- **Teaching examples**: `vae_mnist*.py` (6 files) - used for debugging/profiling examples
- **Tools**: `corrupt_mnist.py` - dataset generation script

### Jupyter Notebooks (6 files)
- PyTorch introduction notebooks (S1)
- Fashion MNIST notebooks (keep as-is, used for different exercises)

### Supporting Files
- Test files in S5
- CML scripts in S5
- Drift detection scripts in S8

## Implementation Plan

### Phase 1: Create EuroSAT Infrastructure

#### 1.1 Create Dataset Generation Tool
**File**: `tools/prepare_eurosat.py`

```python
"""
Download and prepare EuroSAT dataset for the course.

Creates both RGB and Multi-Spectral versions with proper splits and formats.
"""
import torch
from torchvision import datasets, transforms
from pathlib import Path

# RGB Version (for most exercises)
def prepare_eurosat_rgb(output_dir="data/eurosat"):
    """Download RGB version using torchvision"""
    dataset_train = datasets.EuroSAT(
        root=output_dir, 
        download=True
    )
    # Create train/val/test splits
    # Save in .pt format for consistency with current course structure
    ...

# Multi-Spectral Version (for advanced data engineering exercises)
def prepare_eurosat_ms(output_dir="data/eurosat_ms"):
    """Download MS version from Zenodo, prepare for advanced exercises"""
    # Download from Zenodo: https://zenodo.org/record/7711810
    # Process 13-band imagery
    # Demonstrate data pipeline preprocessing
    ...
```

**Tasks:**
- [ ] Create `tools/prepare_eurosat.py` script
- [ ] Implement RGB version download (torchvision)
- [ ] Implement MS version download (Zenodo)
- [ ] Create train/val/test splits (60/20/20)
- [ ] Save in `.pt` format for backward compatibility
- [ ] Add data statistics and visualization
- [ ] Document band information for MS version
- [ ] Test on multiple platforms

#### 1.2 Update Data Loading Infrastructure
**Files**: 
- `s1_development_environment/exercise_files/final_exercise/data_solution.py`
- `s2_organisation_and_version_control/exercise_files/data_solution.py`

Replace `corrupt_mnist()` function with `eurosat()`:

```python
def eurosat(rgb_only=True):
    """
    Load EuroSAT dataset for land use classification.
    
    Args:
        rgb_only: If True, load RGB version. If False, load Multi-Spectral.
    
    Returns:
        train_dataset, test_dataset: PyTorch Dataset objects
    """
    DATA_PATH = "data/eurosat" if rgb_only else "data/eurosat_ms"
    # Load train/test splits
    # Return TensorDataset objects
    ...
```

**Tasks:**
- [ ] Update `data_solution.py` in S1 with EuroSAT loader
- [ ] Update `data.py` template in S1 for students
- [ ] Update normalized dataset class in S2
- [ ] Create `MnistDataset` → `EuroSATDataset` class
- [ ] Update data statistics (27K images, 10 classes, 64x64x3)
- [ ] Add data exploration utilities (show class distribution, sample images)

#### 1.3 Create Reference Solutions
**Tasks:**
- [ ] Update `model_solution.py` - CNN architecture for 64x64 RGB input
- [ ] Update `main_solution.py` - training/evaluation with EuroSAT
- [ ] Update all W&B integration scripts
- [ ] Update Lightning boilerplate examples
- [ ] Test solutions achieve reasonable accuracy (target: >90%)

---

### Phase 2: Update All Modules Systematically

#### 2.1 Module S1 - Development Environment
**File**: `s1_development_environment/deep_learning_software.md`

**Changes Needed:**
- [ ] Lines 179-421: Replace corrupted MNIST introduction with EuroSAT
- [ ] Update dataset description:
  - Old: "rotated MNIST digits (28x28 grayscale)"
  - New: "EuroSAT satellite imagery for land use classification (64x64 RGB)"
- [ ] Update download instructions (remove Google Drive, use new script)
- [ ] Update file structure: `data/corruptmnist/` → `data/eurosat/`
- [ ] Update class information: 10 digit classes → 10 land use classes
- [ ] Update accuracy target: ≥85% → ≥90% (EuroSAT is well-structured)
- [ ] Update starter code templates (model input size, channels)
- [ ] Add EuroSAT class descriptions and visualization
- [ ] Remove "identify the corruption" exercise (no longer relevant)

**Notebooks:**
- [ ] Keep notebooks 2-6 as-is (use standard MNIST/Fashion MNIST for PyTorch intro)
- [ ] These are teaching examples, not the main project

#### 2.2 Module S2 - Organisation and Version Control
**File**: `s2_organisation_and_version_control/code_structure.md`

**Changes Needed:**
- [ ] Line 220, 307: Update references to "MNIST classifier" → "EuroSAT classifier"
- [ ] Update data path: `../data/corruptmnist` → `../data/eurosat`
- [ ] Update processing instructions (normalization for RGB imagery)

**File**: `s2_organisation_and_version_control/dvc.md`

**Changes Needed:**
- [ ] Line 121: Update "In your MNIST repository" → "In your EuroSAT repository"
- [ ] **Note**: This module will be removed in #560, but update for now

#### 2.3 Module S3 - Reproducibility
**File**: `s3_reproducibility/config_files.md`

**Changes Needed:**
- [ ] Lines 130-276: Update all "MNIST" references to "EuroSAT"
- [ ] Keep VAE MNIST examples as-is (teaching example, not main project)
- [ ] Update final exercise: "Make your EuroSAT code reproducible!"
- [ ] Update hyperparameters in config examples (image size, channels, etc.)

**File**: `s3_reproducibility/docker.md`

**Changes Needed:**
- [ ] Lines 69, 176: Update "MNIST repository" → "EuroSAT repository"
- [ ] Update dockerfile examples (data paths, model architecture)

#### 2.4 Module S4 - Debugging and Logging
**Files**: Keep VAE MNIST examples unchanged (teaching tools)
- `vae_mnist_bugs.py`
- `vae_mnist_working.py`
- `vae_mnist_pytorch_profiler_solution.py`

These are separate teaching examples for debugging/profiling, not the main project.

**File**: `s4_debugging_and_logging/logging.md`

**Changes Needed:**
- [ ] Lines 274, 512: Update W&B project name
  - Old: `"corrupt_mnist"`, `corrupt_mnist_models`
  - New: `"eurosat"`, `eurosat_models`
- [ ] Update logged images (satellite imagery instead of digits)

**File**: `s4_debugging_and_logging/boilerplate.md`

**Changes Needed:**
- [ ] Lines 59, 157, 176: Update examples to use EuroSAT
- [ ] Update LightningModule for RGB input

**File**: Update W&B solution scripts:
- `weights_and_bias_solution.py`
- `weights_and_bias_solution2.py`
- `weights_and_bias_solution3.py`

**Changes:**
- [ ] Update project name: `"corrupt_mnist"` → `"eurosat"`
- [ ] Update artifact name: `"corrupt_mnist_model"` → `"eurosat_model"`
- [ ] Update description: "classify corrupt MNIST images" → "classify satellite imagery"
- [ ] Update logged images (RGB satellite images)

#### 2.5 Module S5 - Continuous Integration
**File**: `s5_continuous_integration/unittesting.md`

**Changes Needed:**
- [ ] Lines 56, 128-144: Update test specifications
  - Old: 30,000/50,000 train, 5,000 test, [1,28,28] or [784], labels 0-9
  - New: ~21,600 train, ~5,400 test, [3,64,64] or [12288], 10 land use classes
- [ ] Update import: `from my_project.data import eurosat`
- [ ] Update test examples (shape assertions, class validation)

**File**: `s5_continuous_integration/cml.md`

**Changes Needed:**
- [ ] Lines 95, 218: Update dataset class
  - Old: Dataset for corrupted MNIST
  - New: Dataset for EuroSAT
- [ ] Update visualization: `mnist_images.png` → `eurosat_images.png`
- [ ] Update statistics generation (RGB channel stats, class distribution)

**File**: `s5_continuous_integration/exercise_files/dataset.py`

**Changes:**
- [ ] Rename `MnistDataset` → `EuroSATDataset`
- [ ] Update file loading (new .pt file structure)
- [ ] Update image shapes and channels

**File**: `s5_continuous_integration/exercise_files/dataset_statistics.py`

**Changes:**
- [ ] Update to use `EuroSATDataset`
- [ ] Generate `eurosat_images.png`
- [ ] Update statistics (channel means/stds for RGB)

#### 2.6 Module S6 - Cloud
**File**: `s6_the_cloud/using_the_cloud.md`

**Changes Needed:**
- [ ] Line 282: Update "corrupt MNIST repository" → "EuroSAT repository"
- [ ] Update bucket examples (larger dataset size, different structure)

#### 2.7 Module S7 - Deployment
**File**: `s7_deployment/testing_apis.md`

**Changes Needed:**
- [ ] Lines 94, 111: Update API messages
  - Old: "Welcome to the MNIST model inference API!"
  - New: "Welcome to the EuroSAT model inference API!"
- [ ] Update example payloads (RGB image format)

#### 2.8 Module S8 - Monitoring
**File**: `s8_monitoring/data_drifting.md`

**Changes Needed:**
- [ ] Line 286: Update drift detection example
  - Consider: EuroSAT RGB vs EuroSAT MS bands
  - Or: EuroSAT vs different satellite imagery dataset
- [ ] **Opportunity**: Use multi-spectral bands to demonstrate feature drift

**File**: `s8_monitoring/exercise_files/image_drift.py`

**Changes:**
- [ ] Replace MNIST vs FashionMNIST comparison
- [ ] Use EuroSAT RGB vs MS, or different band combinations
- [ ] Update feature extraction for RGB

#### 2.9 Module S9 - Scalable Applications
**File**: `s9_scalable_applications/distributed_training.md`

**Changes:**
- [ ] Line 97: Keep Fashion MNIST example as-is (teaching example)
- [ ] Update main project references to EuroSAT

#### 2.10 Module S10 - Extra
**File**: `s10_extra/hyperparameters.md`

**Changes Needed:**
- [ ] Lines 65, 115, 206: Update main project references
- [ ] Keep sklearn digits and Fashion MNIST examples (teaching tools)
- [ ] Update hyperparameter tuning to work with EuroSAT

---

### Phase 3: Update Supporting Materials

#### 3.1 Documentation
**Tasks:**
- [ ] Update `README.md` - course overview
- [ ] Update `pages/timeplan.md` - project descriptions
- [ ] Update `pages/overview.md` - dataset mention
- [ ] Update `pages/projects.md` - project tips and examples
- [ ] Update `reports/README.md` - project report template

#### 3.2 Assets and Images
**Tasks:**
- [ ] Replace `s1_development_environment/exercise_files/assets/mnist.png`
- [ ] Add EuroSAT class visualization
- [ ] Update any architecture diagrams (input dimensions)

#### 3.3 Helper Functions
**File**: `s1_development_environment/exercise_files/helper.py`

**Changes:**
- [ ] Lines 40, 50: Update `view_classify()` for RGB images
- [ ] Update visualization for 10 land use classes

---

### Phase 4: Clean Up and Deprecation

#### 4.1 Remove Old Tools
**Tasks:**
- [ ] Delete or archive `tools/corrupt_mnist.py`
- [ ] Add deprecation notice explaining the transition

#### 4.2 Update Dependencies
**File**: `pyproject.toml`

**Changes:**
- [ ] Ensure `torchvision>=0.25` (for EuroSAT support)
- [ ] Add any new dependencies for MS version (if using TorchGeo)
- [ ] Document optional dependencies

#### 4.3 Migration Guide
**Create**: `docs/MIGRATION_EUROSAT.md`

**Contents:**
- [ ] Explanation of why we switched
- [ ] Comparison table (MNIST vs EuroSAT)
- [ ] Migration guide for existing students
- [ ] Troubleshooting common issues
- [ ] Links to EuroSAT papers and documentation

---

## Data Engineering Integration (Issue #549)

### Perfect Alignment with New S4 Data Engineering Session

EuroSAT provides excellent opportunities for the new data engineering modules:

#### Module S4.1 - Data Labeling (Label Studio)
**EuroSAT Use Cases:**
- Label new satellite tiles collected from different regions
- Demonstrate active learning (label uncertain predictions)
- Multi-class labeling workflow for land use
- Export and version labeled datasets

#### Module S4.2 - Data Pipelines (Prefect/MageAI)
**EuroSAT Use Cases:**
- **RGB Pipeline**: Download → Extract → Split → Normalize → Store
- **MS Pipeline**: Download → Band Selection → Composite → Normalize → Store
- **Progressive Pipeline**: Start with RGB, add MS bands incrementally
- **Data Versioning**: Track different band combinations (RGB, NIR+RGB, All 13 bands)
- **Scheduled Updates**: Simulate new satellite imagery arriving
- **Preprocessing**: Multi-band normalization, composite generation

#### Module S4.3 - Data Quality/Validation (Optional)
**EuroSAT Use Cases:**
- Validate band ranges and distributions
- Check for class imbalance
- Detect corrupted or missing bands
- Geospatial metadata validation

**Recommended Approach:**
1. **S1-S3**: Use RGB version (64x64x3) - torchvision support, easy to start
2. **S4**: Introduce MS version (64x64x13) - demonstrate data pipelines
3. **S5-S10**: Continue with RGB, optionally use MS for advanced exercises

---

## Testing Strategy

### Validation Checklist
- [ ] All exercises work with EuroSAT RGB version
- [ ] Model architectures handle 64x64x3 input
- [ ] Accuracy targets are achievable (>90% on RGB)
- [ ] Data loading is efficient (comparable to MNIST)
- [ ] All cross-references updated
- [ ] No broken links
- [ ] Docker builds work
- [ ] Cloud storage integration works
- [ ] W&B logging displays correctly
- [ ] Unit tests pass
- [ ] CML workflows function
- [ ] API deployment works

### Performance Benchmarks
- [ ] Training time on CPU (acceptable for course)
- [ ] Training time on GPU (should be fast)
- [ ] Data download time (< 5 minutes)
- [ ] Storage requirements (< 200MB for RGB)

---

## Implementation Timeline

**Recommended Phased Rollout:**

1. **Week 1-2**: Phase 1 - Infrastructure (tools, data loaders, solutions)
2. **Week 3-4**: Phase 2.1-2.5 - Core modules (S1-S5)
3. **Week 5-6**: Phase 2.6-2.10 - Advanced modules (S6-S10)
4. **Week 7**: Phase 3 - Supporting materials, documentation
5. **Week 8**: Phase 4 - Clean up, migration guide, testing
6. **Week 9**: Integration with #549 data engineering modules
7. **Week 10**: Final validation and course pilot

**Parallel Workstreams:**
- Can work on S1-S3 immediately (independent)
- S4 data engineering modules (#549) can proceed in parallel
- S5-S10 depend on S1-S3 completion

---

## Success Metrics

- [ ] All 20+ markdown files updated
- [ ] All 30+ Python files updated  
- [ ] 0 broken links or references
- [ ] Student feedback positive (more engaging than rotated digits)
- [ ] Data engineering modules benefit from richer dataset
- [ ] Course learning objectives still achieved
- [ ] No increase in course difficulty for beginners
- [ ] Advanced students have more to explore (MS version)

---

## Risks and Mitigations

| Risk | Impact | Mitigation |
|------|--------|------------|
| Larger images increase training time | Medium | Provide pre-computed features; recommend GPU |
| Students struggle with new domain | Low | Add satellite imagery primer; visualize classes |
| Torchvision API changes | Low | Pin torchvision version; document alternatives |
| MS version too complex | Medium | Make MS optional; focus on RGB for core course |
| Download bandwidth issues | Medium | Provide cached downloads; multiple mirrors |
| Existing student projects break | High | Provide migration guide; support both for 1 semester |

---

## Dependencies and Blockers

**Depends On:**
- Issue #549 (Data Engineering Session) - should coordinate but not block
- Issue #560 (Remove DVC) - can proceed independently

**Blocks:**
- Course material updates for next semester
- Student project template repository

**Coordinate With:**
- #549 Sub-issues for S4 data engineering modules
- Any ongoing course restructuring

---

## References

- **EuroSAT Dataset**: https://github.com/phelber/eurosat
- **EuroSAT Paper**: [ResearchGate](https://www.researchgate.net/publication/319463676_EuroSAT_A_Novel_Dataset_and_Deep_Learning_Benchmark_for_Land_Use_and_Land_Cover_Classification)
- **Zenodo Download**: https://zenodo.org/record/7711810
- **Torchvision Docs**: https://pytorch.org/vision/stable/generated/torchvision.datasets.EuroSAT.html
- **TorchGeo** (for MS version): https://torchgeo.readthedocs.io/

---

## Next Steps

1. Review and approve this plan
2. Create sub-issues for each phase (optional, or tackle as one large issue)
3. Begin Phase 1 implementation
4. Coordinate with #549 for S4 data engineering integration
5. Set up testing environment
6. Create student migration guide

Risk	Impact	Mitigation
Larger images increase training time	Medium	Provide pre-computed features; recommend GPU
Students struggle with new domain	Low	Add satellite imagery primer; visualize classes
Torchvision API changes	Low	Pin torchvision version; document alternatives
MS version too complex	Medium	Make MS optional; focus on RGB for core course
Download bandwidth issues	Medium	Provide cached downloads; multiple mirrors
Existing student projects break	High	Provide migration guide; support both for 1 semester

[Refactor] Replace Corrupted MNIST with EuroSAT dataset #564

Description

Overview

Motivation

Why Replace Corrupted MNIST?

Why EuroSAT?

Scope of Changes

Documentation Files (20+ files)

Python Files (30+ files)

Jupyter Notebooks (6 files)

Supporting Files

Implementation Plan

Phase 1: Create EuroSAT Infrastructure

1.1 Create Dataset Generation Tool

1.2 Update Data Loading Infrastructure

1.3 Create Reference Solutions

Phase 2: Update All Modules Systematically

2.1 Module S1 - Development Environment

2.2 Module S2 - Organisation and Version Control

2.3 Module S3 - Reproducibility

2.4 Module S4 - Debugging and Logging

2.5 Module S5 - Continuous Integration

2.6 Module S6 - Cloud

2.7 Module S7 - Deployment

2.8 Module S8 - Monitoring

2.9 Module S9 - Scalable Applications

2.10 Module S10 - Extra

Phase 3: Update Supporting Materials

3.1 Documentation

3.2 Assets and Images

3.3 Helper Functions

Phase 4: Clean Up and Deprecation

4.1 Remove Old Tools

4.2 Update Dependencies

4.3 Migration Guide

Data Engineering Integration (Issue #549)

Perfect Alignment with New S4 Data Engineering Session

Module S4.1 - Data Labeling (Label Studio)

Module S4.2 - Data Pipelines (Prefect/MageAI)

Module S4.3 - Data Quality/Validation (Optional)

Testing Strategy

Validation Checklist

Performance Benchmarks

Implementation Timeline

Success Metrics

Risks and Mitigations

Dependencies and Blockers

References

Next Steps

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions