Overview
Replace the corrupted MNIST dataset with EuroSAT (satellite imagery classification) throughout the entire DTU MLOps course. This change will provide richer data engineering opportunities and better align with the new data engineering modules being introduced in #549 .
Motivation
Why Replace Corrupted MNIST?
The current corrupted MNIST dataset (rotated MNIST digits) has several limitations:
Too simple : Just 28x28 grayscale images with artificial rotation
Limited data engineering potential : Single format, small size, no real-world complexity
Not engaging : Students aren't satisfied with digit classification
Misaligned with new modules : Doesn't showcase data pipelines, versioning, or labeling effectively
Why EuroSAT?
EuroSAT is a land use/land cover classification dataset from Sentinel-2 satellite imagery:
Dataset Characteristics:
10 classes : AnnualCrop, Forest, HerbaceousVegetation, Highway, Industrial, Pasture, PermanentCrop, Residential, River, SeaLake
27,000 labeled images (2,000-3,000 per class)
Two versions :
RGB: 64x64x3 images (~90MB) - torchvision support available
Multi-Spectral: 64x64x13 images (~2GB) - requires custom loader or TorchGeo
Real-world application : Satellite imagery classification
Data engineering rich : Multiple bands, larger size, geospatial metadata
MIT License : Freely available on Zenodo
Benefits for the Course:
✅ Real-world relevance (satellite imagery for land use monitoring)
✅ Progressive complexity (start with RGB, introduce MS for advanced topics)
✅ Perfect for new data engineering modules (pipelines, versioning, validation)
✅ Built-in torchvision support for RGB version
✅ Larger dataset enables meaningful data loading/pipeline exercises
✅ Multiple spectral bands enable data preprocessing/feature engineering
✅ Better alignment with MLOps lifecycle (monitoring Earth observation changes)
Alignment with Issue #549 :
The new S4 Data Engineering session includes data labeling, pipelines, and validation. EuroSAT provides:
Multi-spectral data for versioning exercises (track different band combinations)
Larger dataset size makes data pipelines relevant (vs 5KB MNIST chunks)
Geospatial context for data quality/validation exercises
Realistic preprocessing needs (band normalization, composites)
Scope of Changes
Based on comprehensive codebase analysis, corrupted MNIST appears in:
Documentation Files (20+ files)
S1 : deep_learning_software.md - Main project introduction
S2 : code_structure.md, dvc.md - Project organization
S3 : config_files.md, docker.md - Reproducibility
S4 : debugging.md, profiling.md, logging.md, boilerplate.md - Debugging/logging
S5 : unittesting.md, cml.md - Testing and CI/CD
S6 : using_the_cloud.md - Cloud storage
S7 : testing_apis.md - Deployment
S8 : data_drifting.md - Monitoring
S10 : hyperparameters.md - Optimization
Python Files (30+ files)
Data loaders : data_solution.py, data.py, dataset.py
Training scripts : main_solution.py, train_solution.py, evaluate_solution.py
W&B integration : weights_and_bias_solution*.py
Teaching examples : vae_mnist*.py (6 files) - used for debugging/profiling examples
Tools : corrupt_mnist.py - dataset generation script
Jupyter Notebooks (6 files)
PyTorch introduction notebooks (S1)
Fashion MNIST notebooks (keep as-is, used for different exercises)
Supporting Files
Test files in S5
CML scripts in S5
Drift detection scripts in S8
Implementation Plan
Phase 1: Create EuroSAT Infrastructure
1.1 Create Dataset Generation Tool
File : tools/prepare_eurosat.py
"""
Download and prepare EuroSAT dataset for the course.
Creates both RGB and Multi-Spectral versions with proper splits and formats.
"""
import torch
from torchvision import datasets , transforms
from pathlib import Path
# RGB Version (for most exercises)
def prepare_eurosat_rgb (output_dir = "data/eurosat" ):
"""Download RGB version using torchvision"""
dataset_train = datasets .EuroSAT (
root = output_dir ,
download = True
)
# Create train/val/test splits
# Save in .pt format for consistency with current course structure
...
# Multi-Spectral Version (for advanced data engineering exercises)
def prepare_eurosat_ms (output_dir = "data/eurosat_ms" ):
"""Download MS version from Zenodo, prepare for advanced exercises"""
# Download from Zenodo: https://zenodo.org/record/7711810
# Process 13-band imagery
# Demonstrate data pipeline preprocessing
...
Tasks:
1.2 Update Data Loading Infrastructure
Files :
s1_development_environment/exercise_files/final_exercise/data_solution.py
s2_organisation_and_version_control/exercise_files/data_solution.py
Replace corrupt_mnist() function with eurosat():
def eurosat (rgb_only = True ):
"""
Load EuroSAT dataset for land use classification.
Args:
rgb_only: If True, load RGB version. If False, load Multi-Spectral.
Returns:
train_dataset, test_dataset: PyTorch Dataset objects
"""
DATA_PATH = "data/eurosat" if rgb_only else "data/eurosat_ms"
# Load train/test splits
# Return TensorDataset objects
...
Tasks:
1.3 Create Reference Solutions
Tasks:
Phase 2: Update All Modules Systematically
2.1 Module S1 - Development Environment
File : s1_development_environment/deep_learning_software.md
Changes Needed:
Notebooks:
2.2 Module S2 - Organisation and Version Control
File : s2_organisation_and_version_control/code_structure.md
Changes Needed:
File : s2_organisation_and_version_control/dvc.md
Changes Needed:
2.3 Module S3 - Reproducibility
File : s3_reproducibility/config_files.md
Changes Needed:
File : s3_reproducibility/docker.md
Changes Needed:
2.4 Module S4 - Debugging and Logging
Files : Keep VAE MNIST examples unchanged (teaching tools)
vae_mnist_bugs.py
vae_mnist_working.py
vae_mnist_pytorch_profiler_solution.py
These are separate teaching examples for debugging/profiling, not the main project.
File : s4_debugging_and_logging/logging.md
Changes Needed:
File : s4_debugging_and_logging/boilerplate.md
Changes Needed:
File : Update W&B solution scripts:
weights_and_bias_solution.py
weights_and_bias_solution2.py
weights_and_bias_solution3.py
Changes:
2.5 Module S5 - Continuous Integration
File : s5_continuous_integration/unittesting.md
Changes Needed:
Lines 56, 128-144: Update test specifications
Old: 30,000/50,000 train, 5,000 test, [1,28,28] or [784], labels 0-9
New: ~21,600 train, ~5,400 test, [3,64,64] or [12288], 10 land use classes
Update import: from my_project.data import eurosat
Update test examples (shape assertions, class validation)
File : s5_continuous_integration/cml.md
Changes Needed:
File : s5_continuous_integration/exercise_files/dataset.py
Changes:
File : s5_continuous_integration/exercise_files/dataset_statistics.py
Changes:
2.6 Module S6 - Cloud
File : s6_the_cloud/using_the_cloud.md
Changes Needed:
2.7 Module S7 - Deployment
File : s7_deployment/testing_apis.md
Changes Needed:
2.8 Module S8 - Monitoring
File : s8_monitoring/data_drifting.md
Changes Needed:
File : s8_monitoring/exercise_files/image_drift.py
Changes:
2.9 Module S9 - Scalable Applications
File : s9_scalable_applications/distributed_training.md
Changes:
2.10 Module S10 - Extra
File : s10_extra/hyperparameters.md
Changes Needed:
Phase 3: Update Supporting Materials
3.1 Documentation
Tasks:
3.2 Assets and Images
Tasks:
3.3 Helper Functions
File : s1_development_environment/exercise_files/helper.py
Changes:
Phase 4: Clean Up and Deprecation
4.1 Remove Old Tools
Tasks:
4.2 Update Dependencies
File : pyproject.toml
Changes:
4.3 Migration Guide
Create : docs/MIGRATION_EUROSAT.md
Contents:
Data Engineering Integration (Issue #549 )
Perfect Alignment with New S4 Data Engineering Session
EuroSAT provides excellent opportunities for the new data engineering modules:
Module S4.1 - Data Labeling (Label Studio)
EuroSAT Use Cases:
Label new satellite tiles collected from different regions
Demonstrate active learning (label uncertain predictions)
Multi-class labeling workflow for land use
Export and version labeled datasets
Module S4.2 - Data Pipelines (Prefect/MageAI)
EuroSAT Use Cases:
RGB Pipeline : Download → Extract → Split → Normalize → Store
MS Pipeline : Download → Band Selection → Composite → Normalize → Store
Progressive Pipeline : Start with RGB, add MS bands incrementally
Data Versioning : Track different band combinations (RGB, NIR+RGB, All 13 bands)
Scheduled Updates : Simulate new satellite imagery arriving
Preprocessing : Multi-band normalization, composite generation
Module S4.3 - Data Quality/Validation (Optional)
EuroSAT Use Cases:
Validate band ranges and distributions
Check for class imbalance
Detect corrupted or missing bands
Geospatial metadata validation
Recommended Approach:
S1-S3 : Use RGB version (64x64x3) - torchvision support, easy to start
S4 : Introduce MS version (64x64x13) - demonstrate data pipelines
S5-S10 : Continue with RGB, optionally use MS for advanced exercises
Testing Strategy
Validation Checklist
Performance Benchmarks
Implementation Timeline
Recommended Phased Rollout:
Week 1-2 : Phase 1 - Infrastructure (tools, data loaders, solutions)
Week 3-4 : Phase 2.1-2.5 - Core modules (S1-S5)
Week 5-6 : Phase 2.6-2.10 - Advanced modules (S6-S10)
Week 7 : Phase 3 - Supporting materials, documentation
Week 8 : Phase 4 - Clean up, migration guide, testing
Week 9 : Integration with [Refactor] New learning session on data engineering #549 data engineering modules
Week 10 : Final validation and course pilot
Parallel Workstreams:
Success Metrics
Risks and Mitigations
Risk
Impact
Mitigation
Larger images increase training time
Medium
Provide pre-computed features; recommend GPU
Students struggle with new domain
Low
Add satellite imagery primer; visualize classes
Torchvision API changes
Low
Pin torchvision version; document alternatives
MS version too complex
Medium
Make MS optional; focus on RGB for core course
Download bandwidth issues
Medium
Provide cached downloads; multiple mirrors
Existing student projects break
High
Provide migration guide; support both for 1 semester
Dependencies and Blockers
Depends On:
Blocks:
Course material updates for next semester
Student project template repository
Coordinate With:
References
Next Steps
Review and approve this plan
Create sub-issues for each phase (optional, or tackle as one large issue)
Begin Phase 1 implementation
Coordinate with [Refactor] New learning session on data engineering #549 for S4 data engineering integration
Set up testing environment
Create student migration guide
Overview
Replace the corrupted MNIST dataset with EuroSAT (satellite imagery classification) throughout the entire DTU MLOps course. This change will provide richer data engineering opportunities and better align with the new data engineering modules being introduced in #549.
Motivation
Why Replace Corrupted MNIST?
The current corrupted MNIST dataset (rotated MNIST digits) has several limitations:
Why EuroSAT?
EuroSAT is a land use/land cover classification dataset from Sentinel-2 satellite imagery:
Dataset Characteristics:
Benefits for the Course:
Alignment with Issue #549:
The new S4 Data Engineering session includes data labeling, pipelines, and validation. EuroSAT provides:
Scope of Changes
Based on comprehensive codebase analysis, corrupted MNIST appears in:
Documentation Files (20+ files)
deep_learning_software.md- Main project introductioncode_structure.md,dvc.md- Project organizationconfig_files.md,docker.md- Reproducibilitydebugging.md,profiling.md,logging.md,boilerplate.md- Debugging/loggingunittesting.md,cml.md- Testing and CI/CDusing_the_cloud.md- Cloud storagetesting_apis.md- Deploymentdata_drifting.md- Monitoringhyperparameters.md- OptimizationPython Files (30+ files)
data_solution.py,data.py,dataset.pymain_solution.py,train_solution.py,evaluate_solution.pyweights_and_bias_solution*.pyvae_mnist*.py(6 files) - used for debugging/profiling examplescorrupt_mnist.py- dataset generation scriptJupyter Notebooks (6 files)
Supporting Files
Implementation Plan
Phase 1: Create EuroSAT Infrastructure
1.1 Create Dataset Generation Tool
File:
tools/prepare_eurosat.pyTasks:
tools/prepare_eurosat.pyscript.ptformat for backward compatibility1.2 Update Data Loading Infrastructure
Files:
s1_development_environment/exercise_files/final_exercise/data_solution.pys2_organisation_and_version_control/exercise_files/data_solution.pyReplace
corrupt_mnist()function witheurosat():Tasks:
data_solution.pyin S1 with EuroSAT loaderdata.pytemplate in S1 for studentsMnistDataset→EuroSATDatasetclass1.3 Create Reference Solutions
Tasks:
model_solution.py- CNN architecture for 64x64 RGB inputmain_solution.py- training/evaluation with EuroSATPhase 2: Update All Modules Systematically
2.1 Module S1 - Development Environment
File:
s1_development_environment/deep_learning_software.mdChanges Needed:
data/corruptmnist/→data/eurosat/Notebooks:
2.2 Module S2 - Organisation and Version Control
File:
s2_organisation_and_version_control/code_structure.mdChanges Needed:
../data/corruptmnist→../data/eurosatFile:
s2_organisation_and_version_control/dvc.mdChanges Needed:
2.3 Module S3 - Reproducibility
File:
s3_reproducibility/config_files.mdChanges Needed:
File:
s3_reproducibility/docker.mdChanges Needed:
2.4 Module S4 - Debugging and Logging
Files: Keep VAE MNIST examples unchanged (teaching tools)
vae_mnist_bugs.pyvae_mnist_working.pyvae_mnist_pytorch_profiler_solution.pyThese are separate teaching examples for debugging/profiling, not the main project.
File:
s4_debugging_and_logging/logging.mdChanges Needed:
"corrupt_mnist",corrupt_mnist_models"eurosat",eurosat_modelsFile:
s4_debugging_and_logging/boilerplate.mdChanges Needed:
File: Update W&B solution scripts:
weights_and_bias_solution.pyweights_and_bias_solution2.pyweights_and_bias_solution3.pyChanges:
"corrupt_mnist"→"eurosat""corrupt_mnist_model"→"eurosat_model"2.5 Module S5 - Continuous Integration
File:
s5_continuous_integration/unittesting.mdChanges Needed:
from my_project.data import eurosatFile:
s5_continuous_integration/cml.mdChanges Needed:
mnist_images.png→eurosat_images.pngFile:
s5_continuous_integration/exercise_files/dataset.pyChanges:
MnistDataset→EuroSATDatasetFile:
s5_continuous_integration/exercise_files/dataset_statistics.pyChanges:
EuroSATDataseteurosat_images.png2.6 Module S6 - Cloud
File:
s6_the_cloud/using_the_cloud.mdChanges Needed:
2.7 Module S7 - Deployment
File:
s7_deployment/testing_apis.mdChanges Needed:
2.8 Module S8 - Monitoring
File:
s8_monitoring/data_drifting.mdChanges Needed:
File:
s8_monitoring/exercise_files/image_drift.pyChanges:
2.9 Module S9 - Scalable Applications
File:
s9_scalable_applications/distributed_training.mdChanges:
2.10 Module S10 - Extra
File:
s10_extra/hyperparameters.mdChanges Needed:
Phase 3: Update Supporting Materials
3.1 Documentation
Tasks:
README.md- course overviewpages/timeplan.md- project descriptionspages/overview.md- dataset mentionpages/projects.md- project tips and examplesreports/README.md- project report template3.2 Assets and Images
Tasks:
s1_development_environment/exercise_files/assets/mnist.png3.3 Helper Functions
File:
s1_development_environment/exercise_files/helper.pyChanges:
view_classify()for RGB imagesPhase 4: Clean Up and Deprecation
4.1 Remove Old Tools
Tasks:
tools/corrupt_mnist.py4.2 Update Dependencies
File:
pyproject.tomlChanges:
torchvision>=0.25(for EuroSAT support)4.3 Migration Guide
Create:
docs/MIGRATION_EUROSAT.mdContents:
Data Engineering Integration (Issue #549)
Perfect Alignment with New S4 Data Engineering Session
EuroSAT provides excellent opportunities for the new data engineering modules:
Module S4.1 - Data Labeling (Label Studio)
EuroSAT Use Cases:
Module S4.2 - Data Pipelines (Prefect/MageAI)
EuroSAT Use Cases:
Module S4.3 - Data Quality/Validation (Optional)
EuroSAT Use Cases:
Recommended Approach:
Testing Strategy
Validation Checklist
Performance Benchmarks
Implementation Timeline
Recommended Phased Rollout:
Parallel Workstreams:
Success Metrics
Risks and Mitigations
Dependencies and Blockers
Depends On:
Blocks:
Coordinate With:
References
Next Steps