Skip to content

PaulH97/Sen12Landslides

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

161 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

License: CC BY 4.0 Paper: Nature Dataset: Hugging Face

Sen12Landslides: Spatio-Temporal Landslide & Anomaly Detection Dataset

A large-scale, multi-modal, multi-temporal collection of 128×128px Sentinel-1/2 + DEM patches with 10m spatial resolution and with 75k landslide annotations.

Quick Start

# Clone & setup
git clone https://github.com/PaulH97/Sen12Landslides.git
cd Sen12Landslides
pip install -e .
pip install --upgrade huggingface_hub

# Authenticate (only once)
hf auth login

# Download harmonized dataset or raw dataset
mkdir -p data
hf download paulhoehn/Sen12Landslides \
  --repo-type dataset \
  --local-dir data \
  --include "data_harmonized/**" 

# Extract and clean up archives
for sensor in s1asc s1dsc s2; do
  for archive in data/data_harmonized/$sensor/*.tar.gz; do
    [ -f "$archive" ] && tar -xzf "$archive" -C "data/data_harmonized/$sensor" && rm "$archive"
  done
done

Update the path of root_dir in the global configuration file (Sen12Landslides/configs/config.yaml) to point to your Sen12Landslides folder, ensuring that it contains the aforementioned data.

Dataset Overview

Full Dataset

Modality Samples Annotated Ann. Rate
S1-asc 13,306 6,492 48.8%
S1-dsc 12,622 6,347 50.3%
S2 13,628 6,737 49.4%
Aligned 11,719 6,026 51.4%

Task Splits

Modality S12LS-LD S12LS-AD
S1-asc 4,793 (100%) 13,306 (48.8%)
S1-dsc 4,666 (100%) 12,622 (50.3%)
S2 4,988 (100%) 13,628 (49.4%)
Aligned 4,392 (100%) 11,719 (51.4%)
  • S12LS-LD: Landslide detection with only annotated patches (>50 annotated pixels per patch)
  • S12LS-AD: Anomaly detection with mixed annotated/non-annotated samples to learn normal vs. anomalous patterns
  • See Sen12Landslides/tasks/<task>/config.json for split details

Dataset Versions

Harmonized (recommended)

The harmonized version contains radiometrically consistent data that has been pre-processed and bounded for stable model training:

  • Sentinel-1 (Backscatter):
    • VH and VV bands converted from linear power to decibels (dB) via $10 \cdot \log_{10}(x)$
    • Values bounded to [-50, 10] dB to remove extreme noise and specular outliers
  • Sentinel-2 (Reflectance):
    • Bands B02–B12 corrected for the +1000 DN radiometric offset introduced by ESA Baseline 04.00 (January 25, 2022 onward)
    • Values bounded to [0, 10000] DN to ensure physical reflectance consistency
  • DEM (Elevation):
    • Values bounded to [0, 8800] m to maintain a global terrain baseline

Raw (original)

The raw version preserves the data exactly as published in the original dataset paper, ensuring full reproducibility of reported results:

  • Sentinel-1: Linear power scale (not converted to dB)
  • Sentinel-2: No radiometric offset correction applied
  • DEM: Unmodified

The conversion functions for both corrections are available in the utils.py file of the GitHub repository.

Data Structure

Sen12Landslides/
├── data/
│   ├── data_harmonized/                    ← recommended for training
│   │   ├── inventories.shp.zip
│   │   ├── s1asc/                          Sentinel-1 Ascending (dB)
│   │   │   └── <region>_s1asc_<id>.nc
│   │   ├── s1dsc/                          Sentinel-1 Descending (dB)
│   │   │   └── <region>_s1dsc_<id>.nc
│   │   └── s2/                             Sentinel-2 (offset corrected)
│   │       └── <region>_s2_<id>.nc
│   └── data_raw/                           ← original paper version
│       ├── inventories.shp.zip
│       ├── s1asc/
│       ├── s1dsc/
│       └── s2/
├── tasks/
│   ├── S12LS-LD/                           Landslide detection
│   │   ├── config.json
│   │   ├── harmonized/
│   │   │   └── <modality>/
│   │   │       ├── splits.json
│   │   │       ├── norm.json
│   │   │       └── patch_locations.geojson
│   │   └── raw/
│   │       └── <modality>/
│   │           ├── splits.json
│   │           ├── norm.json
│   │           └── patch_locations.geojson
│   └── S12LS-AD/                           Anomaly detection
│       ├── harmonized/
│       │   └── ...
│       └── raw/
│           └── ...
└── src/                                    Data loaders, models, training

Patch Format

Each .nc file contains 128×128 px across 15 time steps:

Modality Bands Additional
Sentinel-1-NRB VV, VH DEM, MASK
Sentinel-2-L2A B02-B08, B8A, B11-B12 SCL, DEM, MASK
>>> import xarray as xr
>>> ds = xr.open_dataset("Sen12Landslides/data/s2/italy_s2_6982.nc")
>>> ds
<xarray.Dataset> Size: 6MB
Dimensions:      (time: 15, x: 128, y: 128)
Coordinates:
  * x            (x) float64 1kB 7.552e+057.565e+05
  * y            (y) float64 1kB 4.882e+064.881e+06
  * time         (time) datetime64[ns] 2022-10-052023-09-10
Data variables: (12/14)
    B02          (time, x, y) int16B03          (time, x, y) int16 …
    …             
    B12          (time, x, y) int16SCL          (time, x, y) int16MASK         (time, x, y) uint8DEM          (time, x, y) int16spatial_ref  int64 8B  
Attributes:
    ann_id:           41125,41124,…  
    ann_bbox:         (755867.58,4880640.0,…)  
    event_date:       2023-05-16  
    date_confidence:  1.0  
    pre_post_dates:   {'pre': 7, 'post': 8}  
    annotated:        True  
    satellite:        s2  
    center_lat:       4881280.0  
    center_lon:       755840.0  
    crs:              EPSG:32632  

Tasks

We provide two task-specific configurations:

Creating custom splits:

python src/data/create_splits.py  # Configure in configs/splits/config.yaml

Always Generated (Root Level)

File Description
config.json Filter criteria, split ratios, and stratification settings

Per-Satellite Folders (s1asc/, s1dsc/, s2/)

File Description
splits.json Train/val/test splits for this satellite modality
norm.json Per-band normalization statistics (mean/std) for this satellite
patch_locations.geojson Geographic patch locations with train/val/test assignments for this satellite

Multi-Modal Files

File Description
splits_aligned.json Train/val/test splits containing only patches available across all satellites
norm_aligned.json Normalization statistics computed from aligned patches only
patch_locations_aligned.geojson Geographic locations of patches available across all satellites

Usage:

  • Single-modal training: Load <satellite>/splits.json + <satellite>/norm.json
  • Multi-modal training: Load splits_aligned.json + norm_aligned.json for cross-modal fusion
  • Visualization: Open patch_locations.geojson in QGIS or mapping tools

Training

This project uses Hydra for configuration management. See Hydra documentation for more details. Note that the standard parameters of some classes are overwritten by those in the configuration files. Therefore, ensure that you always update the config files under configs/ accordingly for your hardware and requirements.

Available Configurations

Config Options
model utae, convgru, unet3d, unet_convlstm
dataset sen12ls_s2, sen12ls_s1asc, sen12ls_s1dsc
trainer cpu, gpu, ddp
lit_module binary, multiclass

Examples

# Train ConvGRU on Sentinel-2
python src/pipeline/train.py model=convgru dataset=sen12ls_s2

# Train UTAE on Sentinel-1 with DEM
python src/pipeline/train.py model=utae dataset=sen12ls_s1asc dataset.dem=true dataset.num_channels=3

# Multi-GPU training
python src/pipeline/train.py trainer.devices=4 trainer.strategy=ddp dataset=sen12ls_s2

# Multirun with three models
python src/pipeline/train.py --multirun model=utae,convlstm,convgru dataset=sen12ls_s2     

Baselines

Due to class imbalance (~3% landslides), we provide, additionaly to our macro-avg metrics in the paper, binary metrics on the landslide class for benchmarking against other detection methods.

Note: To compare landslide detection performance, use the binary metrics below rather than the macro-averaged metrics from the paper.

Benchmark Results (S12LS-LD)

Benchmark using paper architectures with binary metrics on Sentinel-2 + DEM:

Model AP F1 IoU Precision Recall
U-ConvLSTM 65,13 61,95 44,88 60,59 63,92
Unet3d 62,08 58,82 41,66 55,75 62,56
ConvGRU 60,00 59,06 41,91 56,72 61,77
U-TAE 67,75 61,80 44,74 53,19 74,90

Three training runs (seed=42,123,777) were performed for each model on the harmonized S12LS-LD split with lit_module=binary for 75 epochs. Test metrics were averaged across seeds on the held-out test set. See configs/ for full settings.

⚠️ These models serve as a "quick-run" proof of concept using a baseline threshold of 0.5, though optimizing this threshold significantly improves metrics alongside future architectural scaling and feature fusion.

Reproducibility

# Train all baselines
python pipeline/train.py --multirun \
  model=unet3d,convgru,utae,unet_convlstm \
  seed=42,123,777 \
  dataset=sen12ls_s2

Challenges

What makes this dataset demanding and a good resource for new methodological improvements to beat the baselines:

  1. Severe class imbalance (~3% landslides)
  2. Small spatial extent - landslides often span few pixels at 10m resolution
  3. Multi-temporal complexity - effective temporal fusion remains challenging
  4. Geographic diversity - varied terrain, vegetation, and climate

About

Official code repository of Sen12Landslides

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages