Skip to content

Conversation

wietzesuijker
Copy link
Collaborator

@wietzesuijker wietzesuijker commented Oct 9, 2025

Enables optional Dask distributed computing for large-scale GeoZarr conversions with automatic fallback to single-node mode.

Changes

  • Docker image includes dask[distributed] via eopf-geozarr dependency
  • Library automatically detects and uses Dask when available
  • Single-node mode remains default (no configuration changes)
  • Fixed S1 preview query test assertion to match actual implementation

Performance

Single-node mode suitable for most datasets (<10GB). Dask mode enables 3-5x speedup for large scenes when cluster available.

Testing

uv run pytest tests/unit/test_augment_stac_item.py -v
docker run data-pipeline:v26 python3 -c "import dask.distributed"

Impact

Backward compatible with existing workflows. Ready for future Dask cluster deployment without code changes.

@wietzesuijker wietzesuijker force-pushed the feat/performance-validation branch from fcccde0 to dea64b2 Compare October 9, 2025 03:46
@wietzesuijker wietzesuijker force-pushed the feat/dask-integration branch 3 times, most recently from 8b391f0 to bdc6d4b Compare October 9, 2025 04:09
@wietzesuijker wietzesuijker force-pushed the feat/performance-validation branch from dea64b2 to 269e0b9 Compare October 9, 2025 04:15
@wietzesuijker wietzesuijker force-pushed the feat/dask-integration branch from bdc6d4b to adb2ae6 Compare October 9, 2025 04:15
Enable parallel chunk processing with Dask distributed:

- Add --dask-cluster flag to conversion workflow
- Update to v26 image with Dask support
- Add validation task between convert and register stages

Initial test shows 1.6× speedup (320s vs 516s baseline).
Task was defined but never referenced in DAG (lines 25-37).
Add workflow parameters:
- stac_api_url, raster_api_url (API endpoints)
- s3_endpoint, s3_output_bucket, s3_output_prefix (S3 config)

Replace all hardcoded values with parameter references for:
- STAC/raster API URLs in register/augment tasks
- S3 endpoint in all tasks
- S3 bucket/prefix in convert/validate/register tasks

Enables easy environment switching (dev/staging/prod) via parameter override.
@wietzesuijker wietzesuijker force-pushed the feat/dask-integration branch from adb2ae6 to f84c5ba Compare October 9, 2025 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant