Skip to content

Conversation

wietzesuijker
Copy link
Collaborator

Prometheus metrics for observability: conversion timing, STAC operations, HTTP requests, AMQP events.

Metrics (7 total):

  • geozarr_conversion_seconds/bytes: Conversion performance
  • stac_registration_total: Success/failure counts
  • stac_http_request_seconds: API latency
  • preview_generation_seconds: Rendering time
  • preview_http_request_seconds: Service performance
  • amqp_publish_total: Publishing success/failure

Files:

  • scripts/metrics.py (new): Metrics definitions, HTTP server
  • scripts/register_stac.py: Add timing/counters
  • scripts/augment_stac_item.py: Add preview timing
  • workflows/template.yaml: Expose :8000/metrics endpoint
  • docs/prometheus-metrics.md: Usage guide

Base: test/e2e-s1

- Configure pytest pythonpath to enable script imports (unblocks 90 tests)
- Add exception tracebacks to get_conversion_params error handlers
- Add error trap to validate-setup.sh for line-level diagnostics
- Replace timestamp-based Docker cache with commit SHA for precision
- Add pre-commit hooks (ruff, mypy) for code quality enforcement

Test results: 90/90 passing, 32% coverage
- Add integration-tests job in GitHub Actions (runs on PRs only)
- Add explicit resource requests/limits to all workflow templates
  - convert-geozarr: 6Gi/10Gi memory, 2/4 CPU
  - validate: 2Gi/4Gi memory, 1/2 CPU
  - register-stac: 1Gi/2Gi memory, 500m/1 CPU
  - augment-stac: 1Gi/2Gi memory, 500m/1 CPU

Prevents pod eviction and enables predictable scheduling
Add OVERRIDE_GROUPS, OVERRIDE_EXTRA_FLAGS, OVERRIDE_SPATIAL_CHUNK,
OVERRIDE_TILE_WIDTH environment variables to override collection registry
defaults at runtime.

Enables production parameter tuning and testing without code deployment.

Tests: 97 passing (+7), coverage: 91% for get_conversion_params.py
Replace logger.error() calls with logger.exception() to capture full stack
traces in production Kubernetes logs. Adds structured context via extra={}
for improved observability:
- load_payload: Include file path on FileNotFoundError/JSONDecodeError
- format_routing_key: Show template and available fields on KeyError
- main: Include exchange/routing_key/host on publish failure

Closes phase3-analysis #1 (error logging completion)
- Add scripts/metrics.py with 7 Prometheus metrics definitions
- Add CLI timing logs to register_stac.py
- Expose metrics endpoint in workflow pods (port 8000)
- Add prometheus-client dependency
- Background metrics server with trap cleanup
Instrument register_stac.py and augment_stac_item.py with Prometheus
metrics for production observability.

Metrics:
- stac_registration_total: track create/update/skip/replace operations
- stac_http_request_duration_seconds: STAC API latency
- preview_generation_duration_seconds: augmentation timing
- preview_http_request_duration_seconds: preview API latency

SLOs: success >99%, STAC API <500ms, preview <10s

Docs: docs/prometheus-metrics.md with queries, alerts, dashboards
Add metrics calls to register_stac.py and augment_stac_item.py:
- Wrap HTTP operations with duration timers
- Increment operation counters (create/update/skip/replace)
- Track preview generation duration
- All 42 unit tests pass, coverage: augment 24%, register 42%, metrics 65%
- Fix Dockerfile: install from pyproject.toml (ensures dep sync)
- Update workflow template to use feat-prometheus-metrics image
- Add metrics port 8000 to register-stac container

Verified: docker smoke tests, metrics endpoint (7 metrics), k8s template applied
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant