Skip to content

Commit 06df224

Browse files
emmanuelmathotd-v-blhoupert
authored
Update documentation for Sentinel-2 optimized conversion (#83)
* add option for using new multiscales convention in optimized conversion * remove OOP converter and use plain functions * use pydantic models for fingerprinting sentinel2 product * fix multiscales to use da.coarsen and propagate encoding * ensure that dtype is preserved after resampling * add new multiscales JSON example * add mypy pydantic plugin * lint * add s1 and s2 demo data to tests, and don't test against remote urls * fix e2e tests * remove network test workflow from CI * remove extra type definition and update tests * remove explicit zarr groups in favor of dynamic test fixtures * docstrings * Enhance CRS initialization and update S2 optimization commands - Added `initialize_crs_from_dataset` function to extract CRS from dataset metadata. - Updated S2 optimization commands to include new CRS handling. - Removed unused arguments related to geometry and meteorology groups. - Added comprehensive tests for CRS initialization from various sources. * Refactor code formatting for clarity in S2 optimization functions * fix failing / warning tests * add strict JSON schema equality check to e2e tests * support both flavors of multiscale metadata * dont manage return codes in cli functions * add s2 optimized test * add optimized geozarr exmaple hierarchies * format JSON documents * mid-debug of e2e tests * WIP e2e fixes * make cf standard name validator become a pass-through when no internet connection * update example schemas * narrow type to just tuples in types.py * refactor consolidation * use consolidated=False in conversion * update tests * lint * add both multiscales types to output * update comments in tests * add Sentinel-2 optimization functions and update documentation * update documentation for Sentinel-2 optimized conversion, detailing V1 approach and differences from V0 * update Sentinel-2 band analysis examples to include V1 approach and deprecate V0 structure * update quickstart guide to reflect V1 converter structure and usage for Sentinel-2 data * update documentation for multiscale dataset creation and resolution levels in GeoZarr converter --------- Co-authored-by: Davis Vann Bennett <[email protected]> Co-authored-by: Loïc Houpert <[email protected]>
1 parent 260a99a commit 06df224

File tree

7 files changed

+348
-41
lines changed

7 files changed

+348
-41
lines changed

.vscode/launch.json

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -169,9 +169,11 @@
169169
"args": [
170170
"convert-s2-optimized",
171171
// "https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:202509-s02msil2a/08/products/cpm_v256/S2A_MSIL2A_20250908T100041_N0511_R122_T32TQM_20250908T115116.zarr",
172-
"https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:202511-s02msil2a-eu/15/products/cpm_v262/S2B_MSIL2A_20251115T091139_N0511_R050_T35SLU_20251115T111807.zarr",
172+
// "https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:202511-s02msil2a-eu/15/products/cpm_v262/S2B_MSIL2A_20251115T091139_N0511_R050_T35SLU_20251115T111807.zarr",
173+
"https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:202511-s02msil2a-eu/16/products/cpm_v262/S2A_MSIL2A_20251116T085431_N0511_R107_T35SQD_20251116T103813.zarr",
173174
// "s3://esa-zarr-sentinel-explorer-fra/tests-output/sentinel-2-l2a-opt/S2A_MSIL2A_20250908T100041_N0511_R122_T32TQM_20250908T115116.zarr",
174-
"s3://esa-zarr-sentinel-explorer-fra/tests-output/sentinel-2-l2a-pr75/S2B_MSIL2A_20251115T091139_N0511_R050_T35SLU_20251115T111807.zarr",
175+
// "s3://esa-zarr-sentinel-explorer-fra/tests-output/sentinel-2-l2a-pr75/S2B_MSIL2A_20251115T091139_N0511_R050_T35SLU_20251115T111807.zarr",
176+
"s3://esa-zarr-sentinel-explorer-fra/tests-output/sentinel-2-l2a-opt/S2A_MSIL2A_20251116T085431_N0511_R107_T35SQD_20251116T103813.zarr",
175177
// "./tests-output/eopf_geozarr/s2l2_optimized.zarr",
176178
"--spatial-chunk", "512",
177179
"--compression-level", "5",

docs/api-reference.md

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,122 @@ dt_geozarr = create_geozarr_dataset(
5353
)
5454
```
5555

56+
## Sentinel-2 Optimization Functions
57+
58+
### convert_s2_optimized
59+
60+
Main function for optimized Sentinel-2 conversion with multiscale pyramid generation.
61+
62+
```python
63+
# test: skip
64+
def convert_s2_optimized(
65+
dt_input: xr.DataTree,
66+
output_path: str,
67+
enable_sharding: bool = True,
68+
spatial_chunk: int = 256,
69+
compression_level: int = 3,
70+
validate_output: bool = True,
71+
max_retries: int = 3
72+
) -> xr.DataTree
73+
```
74+
75+
**Parameters:**
76+
77+
- `dt_input` (xr.DataTree): Input Sentinel-2 DataTree
78+
- `output_path` (str): Output path for optimized dataset
79+
- `enable_sharding` (bool, optional): Enable Zarr v3 sharding. Default: True
80+
- `spatial_chunk` (int, optional): Spatial chunk size. Default: 256
81+
- `compression_level` (int, optional): Compression level 1-9. Default: 3
82+
- `validate_output` (bool, optional): Validate output after conversion. Default: True
83+
- `max_retries` (int, optional): Maximum retry attempts for operations. Default: 3
84+
85+
**Returns:**
86+
87+
- `xr.DataTree`: Optimized DataTree with multiscale pyramid
88+
89+
**Example:**
90+
91+
```python
92+
# test: skip
93+
from eopf_geozarr.s2_optimization.s2_converter import convert_s2_optimized
94+
import xarray as xr
95+
96+
dt = xr.open_datatree("s2_product.zarr", engine="zarr")
97+
dt_optimized = convert_s2_optimized(
98+
dt_input=dt,
99+
output_path="s2_optimized.zarr",
100+
enable_sharding=True,
101+
spatial_chunk=256
102+
)
103+
```
104+
105+
### create_multiscale_from_datatree
106+
107+
Creates multiscale pyramid from DataTree, reusing native resolution groups.
108+
109+
```python
110+
# test: skip
111+
def create_multiscale_from_datatree(
112+
dt_input: xr.DataTree,
113+
output_path: str,
114+
enable_sharding: bool,
115+
spatial_chunk: int,
116+
crs: CRS | None = None
117+
) -> dict[str, dict]
118+
```
119+
120+
**Parameters:**
121+
122+
- `dt_input` (xr.DataTree): Input DataTree containing native resolution groups (e.g., r10m, r20m, r60m)
123+
- `output_path` (str): Output path for the multiscale dataset
124+
- `enable_sharding` (bool): Enable Zarr v3 sharding for improved performance
125+
- `spatial_chunk` (int): Spatial chunk size for arrays
126+
- `crs` (CRS | None, optional): Coordinate reference system. If None, CRS is extracted from input
127+
128+
**Returns:**
129+
130+
- `dict[str, dict]`: Nested dictionary structure organizing the multiscale levels:
131+
```python
132+
{
133+
"measurements": {
134+
"reflectance": {
135+
"r10m": Dataset, # Native 10m resolution
136+
"r20m": Dataset, # Native 20m resolution
137+
"r60m": Dataset, # Native 60m resolution
138+
"r120m": Dataset, # Computed 120m overview
139+
"r360m": Dataset, # Computed 360m overview
140+
"r720m": Dataset # Computed 720m overview
141+
}
142+
}
143+
}
144+
```
145+
146+
**Example:**
147+
148+
```python
149+
# test: skip
150+
from eopf_geozarr.s2_optimization.s2_multiscale import create_multiscale_from_datatree
151+
from pyproj import CRS
152+
import xarray as xr
153+
154+
# Load Sentinel-2 DataTree with native resolutions
155+
dt = xr.open_datatree("s2_input.zarr", engine="zarr")
156+
157+
# Create multiscale pyramid
158+
multiscale_dict = create_multiscale_from_datatree(
159+
dt_input=dt,
160+
output_path="s2_multiscale.zarr",
161+
enable_sharding=True,
162+
spatial_chunk=256,
163+
crs=CRS.from_epsg(32633) # UTM Zone 33N
164+
)
165+
166+
# Access specific resolution level
167+
r360m_reflectance = multiscale_dict["measurements"]["reflectance"]["r360m"]
168+
```
169+
170+
**Note:** The S2 optimization uses xarray's built-in `.coarsen()` method for efficient downsampling operations, providing better integration with lazy evaluation and memory management.
171+
56172
## Conversion Functions
57173

58174
### setup_datatree_metadata_geozarr_spec_compliant

docs/architecture.md

Lines changed: 22 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -112,12 +112,19 @@ def calculate_aligned_chunk_size(
112112

113113
**Downsampling:**
114114

115+
The library uses xarray's built-in `.coarsen()` method for efficient downsampling operations, providing better integration with lazy evaluation and memory management.
116+
117+
**Sentinel-2 Optimization:**
118+
119+
The S2 optimization module uses a functional programming approach with stateless functions for improved testability and maintainability:
120+
115121
```python
116122
# test: skip
117-
def downsample_2d_array(
118-
data: np.ndarray,
119-
factor: int = 2
120-
) -> np.ndarray
123+
def convert_s2_optimized(
124+
dt_input: xr.DataTree,
125+
output_path: str,
126+
**kwargs
127+
) -> xr.DataTree
121128
```
122129

123130
### 4. Command Line Interface (`cli.py`)
@@ -422,7 +429,17 @@ Flexible configuration through:
422429
- Real dataset processing
423430
- Cloud environment testing
424431

425-
### 3. Validation Tests
432+
### 3. Local Test Data
433+
434+
The library uses an efficient testing approach with **lightweight JSON-based Zarr groups** that contain only the structure and metadata (no chunked array data). This provides:
435+
436+
- **Faster Test Execution**: Tests run locally without downloading large datasets
437+
- **No Remote Dependencies**: Eliminates need for network access during testing
438+
- **Lightweight Fixtures**: JSON files define Zarr group structure using `pydantic-zarr`
439+
440+
Test fixtures are created from JSON schemas stored in `tests/test_data_api/{s1_examples,s2_examples}/` directories, making the test suite both comprehensive and efficient.
441+
442+
### 4. Validation Tests
426443

427444
- GeoZarr specification compliance
428445
- Metadata accuracy verification

docs/converter.md

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,111 @@ The converter supports multiscale datasets, creating overview levels with /2 dow
101101

102102
The converter maintains the native coordinate reference system (CRS) of the dataset, avoiding reprojection to Web Mercator.
103103

104+
## Sentinel-2 Optimized Conversion (V1)
105+
106+
The Sentinel-2 optimized converter (`convert_s2_optimized`) represents the refined approach (V1) that creates an efficient multiscale pyramid by **reusing the original multi-resolution data** (r10m, r20m, r60m) without duplication, and adding coarser overview levels (r120m, r360m, r720m) for efficient visualization at lower resolutions.
107+
108+
### V0 vs V1 Converter: Key Differences
109+
110+
Understanding the structural differences between the old (V0) and new (V1) converter approaches:
111+
112+
#### V0 Approach (Deprecated - `create_geozarr_dataset`)
113+
114+
Creates **pyramids within each resolution group**:
115+
116+
```
117+
output.zarr/
118+
└── measurements/
119+
└── reflectance/
120+
├── r10m/
121+
│ ├── 0/ # Native 10m data
122+
│ ├── 1/ # Downsampled to 20m
123+
│ ├── 2/ # Downsampled to 40m
124+
│ ├── 3/ # Downsampled to 80m
125+
│ ├── 4/ # Downsampled to 160m
126+
│ └── 5/ # Downsampled to 320m
127+
├── r20m/
128+
│ ├── 0/ # Native 20m data
129+
│ ├── 1/ # Downsampled to 40m
130+
│ ├── 2/ # Downsampled to 80m
131+
│ ├── 3/ # Downsampled to 160m
132+
│ └── 4/ # Downsampled to 320m
133+
└── r60m/
134+
├── 0/ # Native 60m data
135+
├── 1/ # Downsampled to 120m
136+
└── 2/ # Downsampled to 240m
137+
```
138+
139+
**Issues with V0:**
140+
- Creates redundant data at overlapping resolutions (e.g., r10m/1 ≈ r20m/0)
141+
- Inefficient storage due to duplication
142+
- Complex hierarchy with nested levels within each resolution group
143+
144+
#### V1 Approach (Current - `convert_s2_optimized`)
145+
146+
Creates a **consolidated pyramid** by reusing native resolutions and adding coarser levels:
147+
148+
```
149+
output.zarr/
150+
└── measurements/
151+
└── reflectance/
152+
├── r10m/ # Native 10m data (reused as-is)
153+
├── r20m/ # Native 20m data (reused as-is)
154+
├── r60m/ # Native 60m data (reused as-is)
155+
├── r120m/ # Computed from r60m (2x downsampling)
156+
├── r360m/ # Computed from r120m (3x downsampling)
157+
└── r720m/ # Computed from r360m (2x downsampling)
158+
```
159+
160+
**Why these specific resolution levels?**
161+
162+
The resolution levels are chosen to balance data preservation with storage optimization:
163+
164+
- **Native ESA resolutions (10m, 20m, 60m)**: These are the original resolutions delivered by ESA for Sentinel-2 data and are reused as-is to preserve the source data without any loss
165+
- **Computed overview levels (120m, 360m, 720m)**: These additional levels were specifically chosen because their downsampling factors allow the data to be chunked and sharded in complete pieces, ensuring:
166+
- **120m** (2x from 60m): Standard doubling for the first computed overview
167+
- **360m** (3x from 120m): Selected for optimal chunking alignment
168+
- **720m** (2x from 360m): Final level for global-scale visualization
169+
170+
This approach maintains the integrity of ESA's original multi-resolution data while adding computationally efficient overview levels for performance at coarser scales.
171+
172+
**Benefits of V1:**
173+
- No data duplication - native resolutions are reused directly
174+
- More efficient storage
175+
- Simpler, flatter hierarchy
176+
- Natural fit for Sentinel-2's multi-resolution data model
177+
178+
### Key Capabilities
179+
180+
- **Smart Resolution Consolidation**: Combines Sentinel-2's native multi-resolution structure (10m, 20m, 60m) into a unified multiscale pyramid
181+
- **Non-Duplicative Downsampling**: Reuses original resolution data instead of recreating it, adding only the coarser levels (120m, 360m, 720m)
182+
- **Variable-Aware Processing**: Applies appropriate resampling methods for different data types (reflectance, classification, quality masks, probabilities)
183+
- **Efficient Testing**: Improved test infrastructure for faster local development
184+
185+
> **Note:** The V0 converter (`create_geozarr_dataset`) is deprecated and will be removed in future versions. All new projects should use the V1 converter (`convert_s2_optimized`).
186+
187+
### Usage Example
188+
189+
```python
190+
from eopf_geozarr.s2_optimization.s2_converter import convert_s2_optimized
191+
import xarray as xr
192+
193+
# Load Sentinel-2 DataTree
194+
dt_input = xr.open_datatree("path/to/s2/product.zarr", engine="zarr")
195+
196+
# Convert to optimized multiscale structure
197+
dt_optimized = convert_s2_optimized(
198+
dt_input=dt_input,
199+
output_path="path/to/output/optimized.zarr",
200+
enable_sharding=True,
201+
spatial_chunk=256,
202+
compression_level=3,
203+
validate_output=True
204+
)
205+
```
206+
207+
The result is a space-efficient multiscale pyramid: `/measurements/reflectance/{r10m, r20m, r60m, r120m, r360m, r720m}` where the native resolutions are preserved as-is and only the coarser levels are computed.
208+
104209
## Error Handling
105210

106211
The converter includes robust error handling and retry logic for network operations, ensuring reliable processing even in challenging environments.

docs/examples.md

Lines changed: 55 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -76,43 +76,81 @@ for group_name in dt_geozarr.groups:
7676

7777
### Sentinel-2 Band Analysis
7878

79-
Access and analyze specific bands from the converted dataset:
79+
> **Note on V0 vs V1:** This example shows both V0 (deprecated) and V1 (current) approaches. See [converter documentation](converter.md#v0-vs-v1-converter-key-differences) for structural differences.
80+
81+
#### V1 Approach (Recommended - `convert_s2_optimized`)
82+
83+
Access bands from the consolidated pyramid structure:
8084

8185
```python
8286
import xarray as xr
8387
import matplotlib.pyplot as plt
88+
from eopf_geozarr.s2_optimization.s2_converter import convert_s2_optimized
89+
90+
# Convert using V1 optimizer (recommended)
91+
dt_input = xr.open_datatree("s2_l2a_input.zarr", engine="zarr")
92+
dt = convert_s2_optimized(
93+
dt_input=dt_input,
94+
output_path="s2_l2a_v1.zarr",
95+
spatial_chunk=256
96+
)
8497

85-
# Open converted GeoZarr dataset
86-
dt = xr.open_datatree("s2_l2a_geozarr.zarr", engine="zarr")
87-
88-
# Access 10m resolution native data
89-
ds_10m = dt["/measurements/r10m/0"].ds
98+
# Access data from different resolution levels
99+
ds_10m = dt["/measurements/reflectance/r10m"].ds # Native 10m
100+
ds_20m = dt["/measurements/reflectance/r20m"].ds # Native 20m
101+
ds_60m = dt["/measurements/reflectance/r60m"].ds # Native 60m
102+
ds_120m = dt["/measurements/reflectance/r120m"].ds # Computed 120m
90103

91-
# Extract RGB bands for visualization
104+
# Extract RGB bands for visualization (10m resolution)
92105
red = ds_10m["b04"] # Red band
93106
green = ds_10m["b03"] # Green band
94107
blue = ds_10m["b02"] # Blue band
95108

96109
# Create RGB composite
97110
rgb = xr.concat([red, green, blue], dim="band")
98111

99-
# Plot the result
100-
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
112+
# Plot comparison of different resolutions
113+
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
101114

102-
# Native resolution
103-
rgb.plot.imshow(ax=ax1, robust=True)
104-
ax1.set_title("Native Resolution (10m)")
115+
# 10m resolution
116+
rgb.plot.imshow(ax=axes[0], robust=True)
117+
axes[0].set_title("10m Resolution (Native)")
105118

106-
# Overview level 1
107-
ds_overview = dt["/measurements/r10m/1"].ds
108-
rgb_overview = xr.concat([ds_overview["b04"], ds_overview["b03"], ds_overview["b02"]], dim="band")
109-
rgb_overview.plot.imshow(ax=ax2, robust=True)
110-
ax2.set_title("Overview Level 1 (20m)")
119+
# 20m resolution (reused native data)
120+
rgb_20m = xr.concat([ds_20m["b04"], ds_20m["b03"], ds_20m["b02"]], dim="band")
121+
rgb_20m.plot.imshow(ax=axes[1], robust=True)
122+
axes[1].set_title("20m Resolution (Native)")
123+
124+
# 60m resolution (reused native data)
125+
rgb_60m = xr.concat([ds_60m["b04"], ds_60m["b03"], ds_60m["b02"]], dim="band")
126+
rgb_60m.plot.imshow(ax=axes[2], robust=True)
127+
axes[2].set_title("60m Resolution (Native)")
111128

112129
plt.tight_layout()
113130
plt.show()
114131
```
115132

133+
#### V0 Approach (Deprecated - `create_geozarr_dataset`)
134+
135+
For reference, the V0 structure with nested pyramid levels:
136+
137+
```python
138+
import xarray as xr
139+
import matplotlib.pyplot as plt
140+
141+
# Open V0 converted GeoZarr dataset (deprecated structure)
142+
dt = xr.open_datatree("s2_l2a_v0.zarr", engine="zarr")
143+
144+
# Access 10m resolution with nested pyramid levels
145+
ds_10m_native = dt["/measurements/r10m/0"].ds # Level 0: native 10m
146+
ds_10m_level1 = dt["/measurements/r10m/1"].ds # Level 1: downsampled to ~20m
147+
ds_10m_level2 = dt["/measurements/r10m/2"].ds # Level 2: downsampled to ~40m
148+
149+
# Note: This creates redundant data since r10m/1 ≈ r20m/0
150+
```
151+
152+
> **Migration Note:** V0 is deprecated. Use V1 (`convert_s2_optimized`) for new projects.
153+
116154
## Cloud Storage Examples
117155

118156
### AWS S3 Integration

0 commit comments

Comments
 (0)