Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
287 changes: 65 additions & 222 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,280 +1,123 @@
# EOPF GeoZarr Data Pipeline

Automated pipeline for converting Sentinel-2 Zarr datasets to cloud-optimized GeoZarr format with STAC catalog integration and interactive visualization.
Automated pipeline for converting Sentinel Zarr datasets to cloud-optimized GeoZarr format with STAC catalog integration and interactive visualization.

## Quick Reference
## Quick Start (30 seconds)

```bash
# 1. Submit a workflow (simplest method)
uv run python examples/submit.py --stac-url "https://stac.core.eopf.eodc.eu/collections/sentinel-2-l2a/items/S2B_..."
# 1. Submit workflow
export KUBECONFIG=.work/kubeconfig
kubectl create -f workflows/run-s1-test.yaml -n devseed-staging

# 2. Monitor progress
kubectl get wf -n devseed -w

# 3. View result
# Check logs for viewer URL: https://api.explorer.eopf.copernicus.eu/raster/viewer?url=...
# 2. Monitor
kubectl logs -n devseed-staging -l workflows.argoproj.io/workflow=<name> -c main -f
```

💡 **Local testing:** Port-forward RabbitMQ first: `kubectl port-forward -n core svc/rabbitmq 5672:5672 &`

## Features

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)
[![Python](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
[![Tests](https://github.com/EOPF-Explorer/data-pipeline/workflows/Tests/badge.svg)](https://github.com/EOPF-Explorer/data-pipeline/actions)

- **Multi-sensor support**: Sentinel-1 GRD and Sentinel-2 L2A
- STAC item registration with retry logic
- GeoZarr format conversion with cloud-optimized overviews
- Cloud-native workflows with Argo
- Interactive visualization with TiTiler
📖 **New here?** [GETTING_STARTED.md](GETTING_STARTED.md) • **Details:** [Full docs below](#submitting-workflows)

## What It Does

Transforms Sentinel satellite data into web-ready visualizations:
**Input:** STAC item URL → **Output:** Interactive web map in ~15-20 minutes

**Input:** STAC item URL → **Output:** Interactive web map (~5-10 min)

**Pipeline:** Convert (5 min) → Register (30 sec) → Augment (10 sec)
```
Convert (15 min) → Register (30 sec) → Augment (10 sec)
```

**Supported sensors:**
- **Sentinel-1** L1 GRD: SAR backscatter (VH/VV polarizations)
- **Sentinel-2** L2A: Multispectral reflectance (10m/20m/60m)
**Supports:** Sentinel-1 GRD (SAR) • Sentinel-2 L2A (optical)

## Quick Start
**Prerequisites:** Kubernetes with [platform-deploy](https://github.com/EOPF-Explorer/platform-deploy) • Python 3.11+ • [GETTING_STARTED.md](GETTING_STARTED.md) for full setup

📖 **New to the project?** See [GETTING_STARTED.md](GETTING_STARTED.md) for complete setup (15 min).
## Submitting Workflows

### Requirements
| Method | Best For | Setup | Status |
|--------|----------|-------|--------|
| 🎯 **kubectl** | Testing, CI/CD | None | ✅ Recommended |
| 📓 **Jupyter** | Learning, exploration | 2 min | ✅ Working |
| ⚡ **Event-driven** | Production (auto) | In-cluster | ✅ Running |
| 🐍 **Python CLI** | Scripting | Port-forward | ⚠️ Advanced |

- **Kubernetes cluster** with [platform-deploy](https://github.com/EOPF-Explorer/platform-deploy) infrastructure
- Argo Workflows (pipeline orchestration)
- RabbitMQ (event-driven automation)
- STAC API & TiTiler (catalog & visualization)
- **Python 3.11+** with `uv` package manager
- **S3 storage** credentials (outputs)
- **Kubeconfig** in `.work/kubeconfig`
<details>
<summary><b>kubectl</b> (recommended)</summary>

Verify:
```bash
export KUBECONFIG=$(pwd)/.work/kubeconfig
kubectl get pods -n core -l app.kubernetes.io/name=argo-workflows
kubectl get pods -n core -l app.kubernetes.io/name=rabbitmq
export KUBECONFIG=.work/kubeconfig
kubectl create -f workflows/run-s1-test.yaml -n devseed-staging -o name
kubectl logs -n devseed-staging -l workflows.argoproj.io/workflow=<wf-name> -c main -f
```
Edit `workflows/run-s1-test.yaml` with your STAC URL and collection.
</details>

### Run Your First Job
<details>
<summary><b>Jupyter</b></summary>

```bash
# 1. Install dependencies
uv sync --all-extras

# 2. Deploy workflows
kubectl apply -f workflows/ -n devseed

# 3. Port-forward RabbitMQ
kubectl port-forward -n core svc/rabbitmq 5672:5672 &

# 4. Submit a STAC item
export AMQP_PASSWORD=$(kubectl get secret rabbitmq-password -n core -o jsonpath='{.data.rabbitmq-password}' | base64 -d)
export AMQP_URL="amqp://user:${AMQP_PASSWORD}@localhost:5672/"

uv run python examples/submit.py \
--stac-url "https://stac.core.eopf.eodc.eu/collections/sentinel-2-l2a/items/S2B_MSIL2A_20250518_T29RLL"

# 5. Monitor
kubectl get wf -n devseed -w
uv sync --extra notebooks
cp notebooks/.env.example notebooks/.env
uv run jupyter lab notebooks/operator.ipynb
```
</details>

**Result:** Interactive map at `https://api.explorer.eopf.copernicus.eu/raster/viewer?url=...`

## How It Works

### Pipeline Stages
<details>
<summary><b>Event-driven</b> (production)</summary>

| Stage | Time | Function |
|-------|------|----------|
| **Convert** | 5 min | Zarr → GeoZarr with spatial indexing & cloud optimization |
| **Register** | 30 sec | Create/update STAC item with metadata & assets |
| **Augment** | 10 sec | Add visualization links (XYZ tiles, TileJSON, viewer) |

### Event-Driven Architecture

```
STAC URL → submit.py → RabbitMQ → AMQP Sensor → Argo Workflow
Convert → Register → Augment
STAC API + Interactive Map
Publish to RabbitMQ `geozarr` exchange:
```json
{"source_url": "https://stac.../items/S1A_...", "item_id": "S1A_IW_GRDH_...", "collection": "sentinel-1-l1-grd-dp-test"}
```
</details>

**Automation:** New Sentinel-2 data publishes to RabbitMQ → Pipeline runs automatically

## Submitting Workflows

**Choose your approach:**

| Method | Best For | Documentation |
|--------|----------|---------------|
| 🎯 **CLI tool** | Quick testing, automation | [examples/README.md](examples/README.md) |
| 📓 **Jupyter notebook** | Learning, exploration | [notebooks/README.md](notebooks/README.md) |
| ⚡ **Event-driven** | Production (auto) | Already running! |
| 🔧 **Custom pika** | Custom integrations | [See Configuration](#configuration) |
<details>
<summary><b>Python CLI</b></summary>

**Quick example:**
```bash
uv run python examples/submit.py --stac-url "https://stac.core.eopf.eodc.eu/collections/sentinel-2-l2a/items/S2B_..."
```

**Monitor:**
```bash
kubectl get wf -n devseed -w # Watch workflows
kubectl logs -n devseed -l sensor-name=geozarr-sensor --tail=50 # Sensor logs
kubectl port-forward -n core svc/rabbitmq 5672:5672
export AMQP_PASSWORD=$(kubectl get secret rabbitmq-password -n core -o jsonpath='{.data.rabbitmq-password}' | base64 -d)
uv run python examples/submit.py --stac-url "..." --collection sentinel-2-l2a
```
</details>

### Related Projects

- **[data-model](https://github.com/EOPF-Explorer/data-model)** - `eopf-geozarr` conversion library (Python)
- **[platform-deploy](https://github.com/EOPF-Explorer/platform-deploy)** - K8s infrastructure (Flux, Argo, RabbitMQ, STAC, TiTiler)
**Related:** [data-model](https://github.com/EOPF-Explorer/data-model) • [platform-deploy](https://github.com/EOPF-Explorer/platform-deploy) • [Testing report](docs/WORKFLOW_SUBMISSION_TESTING.md)

## Configuration

### S3 Storage
<details>
<summary><b>S3 & RabbitMQ</b></summary>

```bash
# S3 credentials
kubectl create secret generic geozarr-s3-credentials -n devseed \
--from-literal=AWS_ACCESS_KEY_ID="<your-key>" \
--from-literal=AWS_SECRET_ACCESS_KEY="<your-secret>"
```

| Setting | Value |
|---------|-------|
| **Endpoint** | `https://s3.de.io.cloud.ovh.net` |
| **Bucket** | `esa-zarr-sentinel-explorer-fra` |
| **Region** | `de` |
--from-literal=AWS_ACCESS_KEY_ID="<key>" \
--from-literal=AWS_SECRET_ACCESS_KEY="<secret>"

### RabbitMQ

Get password:
```bash
# RabbitMQ password
kubectl get secret rabbitmq-password -n core -o jsonpath='{.data.rabbitmq-password}' | base64 -d
```

| Setting | Value |
|---------|-------|
| **URL** | `amqp://user:[email protected]:5672/` |
| **Exchange** | `geozarr` |
| **Routing key** | `eopf.items.*` |

**Message format:**
```json
{
"source_url": "https://stac.core.eopf.eodc.eu/collections/sentinel-2-l2a/items/...",
"item_id": "S2B_MSIL2A_...",
"collection": "sentinel-2-l2a"
}
```

## Web Interfaces

Access via [**EOxHub workspace**](https://workspace.devseed.hub-eopf-explorer.eox.at/) (single sign-on for all services):

| Service | Purpose | URL |
|---------|---------|-----|
| **Argo Workflows** | Monitor pipelines | [argo-workflows.hub-eopf-explorer.eox.at](https://argo-workflows.hub-eopf-explorer.eox.at) |
| **STAC Browser** | Browse catalog | [api.explorer.eopf.copernicus.eu/stac](https://api.explorer.eopf.copernicus.eu/stac) |
| **TiTiler Viewer** | View maps | [api.explorer.eopf.copernicus.eu/raster](https://api.explorer.eopf.copernicus.eu/raster) |
| **JupyterLab** | Operator tools | Via EOxHub workspace |
**Endpoints:** S3: `s3.de.io.cloud.ovh.net/esa-zarr-sentinel-explorer-fra` • RabbitMQ: `geozarr` exchange • [UIs](https://workspace.devseed.hub-eopf-explorer.eox.at/): [Argo](https://argo-workflows.hub-eopf-explorer.eox.at) • [STAC](https://api.explorer.eopf.copernicus.eu/stac) • [Viewer](https://api.explorer.eopf.copernicus.eu/raster)
</details>

💡 **Tip:** Login to EOxHub first for seamless authentication across all services.

## Monitoring & Troubleshooting

### Workflow Status

```bash
# List all workflows
kubectl get wf -n devseed

# Watch real-time updates
kubectl get wf -n devseed -w

# Detailed status
kubectl describe wf <workflow-name> -n devseed
```
## Troubleshooting

### Logs
<details>
<summary><b>Logs & Issues</b></summary>

```bash
# Workflow pod logs
kubectl logs <pod-name> -n devseed

# Sensor (message processing)
kubectl get wf -n devseed-staging -w
kubectl logs -n devseed-staging <pod-name> -c main -f
kubectl logs -n devseed -l sensor-name=geozarr-sensor --tail=50

# EventSource (RabbitMQ connection)
kubectl logs -n devseed -l eventsource-name=rabbitmq-geozarr --tail=50
```

### Common Issues

| Problem | Solution |
|---------|----------|
| **Workflow not starting** | Check sensor/eventsource logs for connection errors |
| **S3 access denied** | Verify secret `geozarr-s3-credentials` exists in `devseed` namespace |
| **RabbitMQ connection refused** | Port-forward required: `kubectl port-forward -n core svc/rabbitmq 5672:5672` |
| **Pod stuck in Pending** | Check node resources and pod limits |
**Common fixes:** Workflow not starting → check sensor logs • S3 denied → verify `geozarr-s3-credentials` secret • RabbitMQ refused → `kubectl port-forward -n core svc/rabbitmq 5672:5672` • Pod pending → check resources
</details>

## Development

### Setup

```bash
uv sync --all-extras
pre-commit install # Optional: enable git hooks
```

### Testing

```bash
make test # Run full test suite
make check # Lint + typecheck + test
pytest tests/ # Run specific tests
pytest -v -k e2e # End-to-end tests only
uv sync --all-extras && pre-commit install
make test # or: pytest tests/ -v -k e2e
```

### Project Structure

```
├── docker/ # Container images
│ ├── Dockerfile # Pipeline runtime
│ └── Dockerfile.test # Test environment
├── scripts/ # Python pipeline scripts
│ ├── register_stac.py # STAC catalog registration
│ ├── augment_stac_item.py # Add visualization links
│ └── get_zarr_url.py # Extract Zarr URL from STAC
├── workflows/ # Argo workflow definitions
│ ├── template.yaml # Main pipeline WorkflowTemplate
│ ├── eventsource.yaml # RabbitMQ AMQP event source
│ ├── sensor.yaml # Workflow trigger on messages
│ └── rbac.yaml # Service account permissions
├── examples/ # Usage examples
│ └── submit.py # Submit job via RabbitMQ
├── tests/ # Unit & integration tests
└── notebooks/ # Operator utilities
```

### Making Changes

1. **Edit workflow:** `workflows/template.yaml`
2. **Update scripts:** `scripts/*.py`
3. **Test locally:** `pytest tests/ -v`
4. **Build image:** `docker buildx build --platform linux/amd64 -t ghcr.io/eopf-explorer/data-pipeline:dev -f docker/Dockerfile . --push`
5. **Deploy:** `kubectl apply -f workflows/template.yaml -n devseed`
6. **Monitor:** `kubectl get wf -n devseed -w`

⚠️ **Important:** Always use `--platform linux/amd64` when building images for Kubernetes clusters.

See [CONTRIBUTING.md](CONTRIBUTING.md) for coding standards and development workflow.
**Deploy:** Edit `workflows/template.yaml` or `scripts/*.py` → `pytest tests/ -v` → `docker buildx build --platform linux/amd64 -t ghcr.io/eopf-explorer/data-pipeline:dev .` → `kubectl apply -f workflows/template.yaml -n devseed` • [CONTRIBUTING.md](CONTRIBUTING.md)

## License

Expand Down
3 changes: 3 additions & 0 deletions docker/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -34,4 +34,7 @@ ARG SCRIPTS_VERSION=2025-10-09T00:00:00Z
COPY scripts/ /app/scripts/
RUN chmod +x /app/scripts/*.py

# Copy workflows (example payloads and templates)
COPY workflows/ /app/workflows/

CMD ["/bin/bash"]
Loading