Skip to content

Production-readiness refactor: drop MinIO, make Crate IDs more robust, add async validation and offline cache#186

Open
alexhambley wants to merge 16 commits into
developfrom
refactor-production-readiness
Open

Production-readiness refactor: drop MinIO, make Crate IDs more robust, add async validation and offline cache#186
alexhambley wants to merge 16 commits into
developfrom
refactor-production-readiness

Conversation

@alexhambley

Copy link
Copy Markdown
Contributor

TLDR

This is a big refactor that strengthens the validation service by adding:

  • An agnostic S3 storage layer;
  • Reworking the "Crate ID" model;
  • Adding consistent result/error types;
  • Adding a faster metadata path;
  • Fixing some major production concerns (retries, structured logging, health checks, offline validation);
  • Removing reliance on minio-py SDK

PR

I've broken down the changes in more detail below:

Storage and RO-Crates

  • Added a "vendor-neutral" StorageBackend abstraction-layer with a boto3 S3Backend (this should work with AWS / MinIO / RustFS / etc.) and an in-memory fake for tests. minio-py SDK is removed;
  • Added stricter/safer Crate-ID validation (charset/length, no / or ..). This fixes the bugs/fragility with .zip-in-ID, by treating IDs as opaque;
  • Added deterministic crate resolution by direct key checks and explicit zip vs directory, ambiguous (409) and not-found (404) raised;
  • Set a canonical layout with separate crate and result prefixes.

Validation

  • There is now a single ValidationOutcome type (valid/invalid/error) that replaces the old ValidationResult | str; one boundary to rocrate-validator;
  • Metadata-only validation is now synchronous (removes the blocking result.get());
  • Bumped rocrate-validator to 0.10.0 for (opt-in) offline validation. Cache warming at build, and --extra-profiles-path.

APIs

  • ID endpoints take server-side credentials, so there are no storage secrets in request bodies/logs!
  • Resolve-before-queue;
  • The celery worker runs a staged fetch → validate → persist → webhook with retries and persisted error outcomes;
  • Webhook delivery retries with backoff and surfaces any failures.

DevOps Stuff

  • JSON logging with request IDs and secret redaction;
  • /healthz and /readyz added to check the health and readiness for the storage + Redis broker ;
  • I've tidied up dependencies. pyproject.toml + pip-tools locks, ruff lint/format, CI updated;
  • There is now a single parameterised Dockerfile;
  • Flask entrypoint is renamed cratey.pywsgi.py for convention;
  • Image and labels renamed to ro-crate-validation-service;
  • The README.md has been rewritten and has some architecture/flow mermaid diagrams to help future external collaborators with understanding how the validator works;
  • Tests are now organised properly to mirror app/.

Breaking Changes

  • Environment variables are renamed;
  • ID endpoints no longer accept minio_config / root_path.
  • Status codes are clearer
  • Storage layout is explicit: crates under crates/, results under validation-results/.
  • Requires rocrate-validator >= 0.10.0

Closes #170

- Replace the import-time Config classes with a validated Settings dataclass built in create_app().
- Fails early with an aggregated error when storage is enabled but the required S3/Celery vars are missing.
- Gate storage routes on settings.storage_enabled; Settings injected in tests instead of monkeypatching.
- Add app/storage as a "vendor-neutral" abstraction (so, works with MinIO, RustFS, or any other S3 service)
- Closes #171, #172
Implement S3Backend using boto3. Supports MinIO SDK features such as pagination.
Includes tests
Refs #172
Add app/crates/ids.py: validate_crate_id / is_valid_crate_id and an InvalidCrateId exception. IDs are constrained to a safe charset (start alphanumeric, [A-Za-z0-9._-], max 128, no '/' or '..'). Fixes the ".zip in crate ID" fragility. IDs are no longer parsed.
Closes #174
Add app/crates/layout.py (separate crate/result key prefixes) and resolver.py. `resolve_crate()` locates a crate by direct stat checks on "canonical keys" instead of prefix-listing.
- Distinguishes zip vs directory,
- confirms ro-crate-metadata.json for directories, and
- reports "AmbiguousCrate" / "CrateNotFound"
- Fixes sibling false-match, .zip-in-id, and some missing-metadata gaps.
Closes #175
- Add `app/validation/results.py` (`ValidationOutcome` with valid/invalid/error status, detail/error, serialisation) and `runner.py` wrapping rocrate_validator so both entry points always return an outcome;
- Replaces the ValidationResult|str / isinstance pattern.
Closes #176
- replaces metadata path with `run_metadata_validation()`, which validates inline via the runner and returns a `ValidationOutcome` (200 valid/invalid, 422 unvalidatable).
- remove the metadata Celery task and `perform_metadata_validation`, which addresses the result.get() blocking and the return-in-finally bug.

closes #169
- Move POST/task/GET onto StorageBackend + resolver + runner with server-side credentials. POST resolves before queueing (400/404/409/503 via app error handlers);
- the Celery task (`run_validation_job`) runs fetch->validate->persist->webhook with retries and persists error outcomes; GET reads the results prefix.
- Add s3_crate_prefix/s3_results_prefix to Settings. Delete minio_utils and test_minio, remove minio from requirements and recompile the lock.
- Covers the store-backed validation flow

Closes #177, #178, #179
- Refactor `send_webhook_notification` to retry failures with exponential backoff, add a request timeout, and raise `WebhookDeliveryError` on terminal failure instead of accepting it. - Results are persisted so a delivery failure can be surfaced without losing the result.

Closes #180
Rewrite logging_service with a JsonFormatter, a RedactionFilter that masks configured S3 credentials, and a RequestIdFilter
closes #177
- Adds endpoints to check if the S3 is reachable and Celery / broker connected;
- Returns 503 with a per-check breakdown when a dependency is down.
- When storage is disabled both checks report "disabled".

Closes #181
- Consolidate packaging and tooling into pyproject.toml and standardise linting and formatting with ruff.
- Move metadata and deps into pyproject.toml; compile requirements.txt (runtime) and requirements-dev.txt (runtime + dev: pytest, pytest-mock, moto, ruff) from it via pip-tools; retire requirements.in.
- Move pytest config into [tool.pytest.ini_options] and remove pytest.ini.
- Add [tool.ruff] config; apply ruff check --fix and ruff format across code.
- CI: unit_tests installs requirements-dev.txt; add a lint workflow running ruff check and ruff format --check on PRs to develop.

Closes #183, #173
- Replace the MinIO dev container with a vendor-neutral "objectstore" service running rustfs/rustfs.
- Update .env/example.env to S3_* app config plus RUSTFS_* container credentials; the app/worker stay vendor-neutral (S3_* only).
- Start with `docker compose --profile objectstore up`.
- Move unit tests into app-mirroring sub-packages (storage/, crates/, etc.) and switch pytest to importlib import mode with pythonpath=".".
- Rewrite the integration test for the new stack
- Update the integration workflow deps

Closes #184
… and image

A few final things to bundle together:
- bump `roc-validator` 0.9.0 -> 0.10.0 to use offline/cache/extra-profiles
- rename the WSGI entrypoint and align image/label naming.
- Add `VALIDATION_OFFLINE`, `CACHE_PATH` and `EXTRA_PROFILES_PATH` settings
- Offline validation is opt-in (default off).
- `extra_profiles_path` ADDS to the bundled profiles (roc-validator 0.10.0 fix)
- Warm validator HTTP cache at image build (rocrate-validator cache warm)
- Replace the site-packages profile bake with extraction to /app/extra-profiles loaded via `EXTRA_PROFILES_PATH`
- This is passed as a build arg from the profiles workflow.
- Rename entrypoint `cratey.py` -> `wsgi.py` to fit convention
- Rename the published image and the five-safes label to ro-crate-validation-service; point image.source at the renamed repo.

Closes #162, #163, #164
@alexhambley alexhambley requested review from AnnieZQH and douglowe June 18, 2026 09:47
@elichad

elichad commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

@alexhambley to make this slightly more manageable to review, are you able to pop the last three commits (particularly the last one because it touches many files) off into a separate PR that depends on this one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Production-readiness refactor

2 participants