feat(download): new v2 download service replacing sda-download#2187
feat(download): new v2 download service replacing sda-download#2187
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2187 +/- ##
==========================================
- Coverage 43.06% 42.29% -0.77%
==========================================
Files 98 120 +22
Lines 9825 12242 +2417
==========================================
+ Hits 4231 5178 +947
- Misses 5029 6432 +1403
- Partials 565 632 +67
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
I know that it is not in swagger but in my opinion we should probably add an endpoint for downloading a whole dataset. As far as I remember the only way to download a whole dataset is only by using the sda-cli tool. What do you think? |
fbf7ec2 to
633bc20
Compare
766b982 to
1ea1b63
Compare
|
Would it be possible to get built in caching of database query responses already now? There are several good options that makes this fairly easy (the old sda-download used ristretto for a session cache and it's used in sdafs to handle data caching). (For the streaming use case, even if the query and response is cached in the database, having to do several roundtrips for each fetch is not nice, even ignoring the unneeded load caused.) |
63a8660 to
4ebb2be
Compare
4ebb2be to
d1a1a68
Compare
dccff14 to
63b5ef8
Compare
d2599df to
c1f426b
Compare
cd20c7a to
d1d8f7d
Compare
c1f426b to
181ff87
Compare
55ab531 to
51ebac3
Compare
9b19978 to
ef6a8bb
Compare
…eencrypt New download service at sda/cmd/download/ replacing sda-download/. Core components: - main.go: entry point with production safety guards, TLS, graceful shutdown - config/: 50+ config flags via internal/config/v2 (pflag/viper framework) - health/: gRPC health server for K8s probes + HTTP ready/live endpoints - streaming/: Range header parsing (RFC 9110), If-Range ETag support, combined header+body and body-only streaming with seek-based positioning - reencrypt/: gRPC client for crypt4gh header re-encryption with lazy connection and TLS support - internal/config/v2/: shared config registration framework - swagger_v2.yml: OpenAPI spec for the v2 REST API
Database components: - database.go: PostgreSQL interface with prepared statements, keyset cursor pagination via (submission_file_path, stable_id) composite cursor, LATERAL json_agg for checksum aggregation (prevents row multiplication), LIKE prefix escaping, and CheckDatasetExists for no-existence-leakage pattern - cache.go: Ristretto-based cache wrapper (lock-free) for file lookups, permission checks, and dataset queries. Paginated queries bypass cache due to cursor variability.
Authentication middleware: - Structure-based JWT detection via looksLikeJWT() (3 dot-segments with base64url JSON header+payload) - JWT path: validate locally via loaded keyset, optional issuer match - Opaque path: call UserinfoClient.FetchUserinfo for subject resolution - Session cookie cache (sda_session + legacy sda_session_key) - Token-keyed cache (sha256(token)) with TTL bounded by min(token.exp, min(visa.exp), configTTL) - Permission model: combined/visa/ownership dataset population - SameSite=Lax on session cookies - Audit denial events (download.denied) on all 401 paths
GA4GH visa support: - validator.go: GetVisaDatasets() extracts datasets from visa JWTs, enforces (iss, jku) allowlist, verifies signatures via cached JWKS, validates ControlledAccessGrants (by, value, source, conditions, asserted), supports broker-bound/strict-sub/strict-iss-sub identity binding modes, detects multi-identity scenarios - trust.go: LoadTrustedIssuers from JSON with conditional HTTPS enforcement for JKU URLs - userinfo.go: UserinfoClient with HTTP cache, io.LimitReader safety, GA4GH passport v1 claim extraction - jwks_cache.go: JWK cache with per-request fetch limits and (iss, jku) allowlist enforcement - types.go: Identity, TrustedIssuer, VisaClaim, UserinfoResponse - Pre-validation limits: max-visas (200), max-visa-size (16KB), max-jwks-per-request (10)
…-info REST API endpoints: - GET /datasets: paginated dataset list with HMAC-signed page tokens - GET /datasets/:datasetId: dataset metadata (date, files, size) - GET /datasets/:datasetId/files: keyset-paginated file list with filePath/pathPrefix filters (mutually exclusive, 4096 char limit) - HEAD/GET /files/:fileId: combined download with re-encrypted header - HEAD/GET /files/:fileId/header: re-encrypted header only - HEAD/GET /files/:fileId/content: raw archive body (no pubkey needed) - GET /service-info: GA4GH service-info metadata - GET /health/ready, /health/live: health probes Cross-cutting: - RFC 9457 Problem Details for all errors (application/problem+json) - No existence leakage: 403 for both "not found" and "no access" - Content-Disposition with .c4gh extension - Cache-Control headers on all data endpoints - UseRawPath=true for URL-encoded slash dataset IDs - Audit denied/failed events on 403 and server errors - Correlation ID middleware (X-Correlation-ID)
Audit logging: - audit.Logger interface with StdoutLogger (JSON lines) and NoopLogger - download.complete/content/header events on successful operations - download.denied events on 401/403 (middleware + handlers) - download.failed events on storage/streaming errors with ErrorReason - Correlation ID propagated to all audit events Production guards (app.environment=production): - Fail startup if jwt.allow-all-data is enabled - Fail startup if pagination.hmac-secret is empty - Fail startup if gRPC client TLS certs are missing - validateProductionConfig extracted for unit testing
Docker Compose integration test environment: - postgres, minio (S3), mock-aai, mockoidc (OIDC + visa JWTs), reencrypt (gRPC), download service under test - database_seed with test dataset + file - make_download_credentials.sh: RSA keypair, JWT token, trusted issuers - mockoidc.py: OIDC discovery, JWKS, userinfo with visa datasets 33 integration tests covering: - Health, auth (JWT + opaque), session cookies, service-info - Dataset listing, file listing, pagination, invalid pageSize/pageToken - Encoded-slash dataset IDs (UseRawPath verification) - Range requests, multi-range rejection, If-Range ETag contract - Content-Disposition, pathPrefix filter with SQL wildcard escaping - Expired token rejection, long-transfer resume scenario - Problem Details format, access control (no existence leakage) Environment capability probes (SetupSuite): - REQUIRES_REENCRYPT, REQUIRES_STORAGE_FILE, REQUIRES_SESSION_CACHE - Tests skip on missing prerequisites, hard-fail on regressions
Benchmark infrastructure:
- benchmark.go: concurrent load tester comparing old (sda-download)
vs new (sda/cmd/download) services with auto-discovery, JWT auth,
configurable iterations/concurrency, percentile stats (p50/p95/p99)
- sda-benchmark.yml: Docker Compose with both services, shared
pipeline, and benchmark runner
- seed_benchmark_data.sh: uploads crypt4gh files through real ingest
pipeline for realistic test data
- Makefile targets: benchmark-download-{up,seed,run,down}
Result: NEW service is ~255% faster than OLD (67 vs 19 req/s).
…rflow hint
- 55_download_test.sh: update from v1 API paths (/info/datasets,
/file/{id}, public_key header) to v2 (/datasets, /files/{id},
X-C4GH-Public-Key). Fixes CI failure in sda (s3) integration job.
- middleware/auth.go: use len(a) instead of len(a)+len(b) as map
capacity hint in mergeDatasets to silence CodeQL overflow warning.
- benchmark.go: fmt.Errorf → errors.New where no format verbs, nlreturn blank lines, ifElseChain → switch, nolint for gosec G402 (TLS skip in benchmark tool) and revive deep-exit - handlers/files.go: remove duplicate ArchivePath check - middleware/auth.go: extract tokenExpiry() to reduce nestif - visa/: nolint:gosec for SSRF (URLs validated against allowlist)
Rewrites the service documentation to match the actual v2 API routes, authentication model, GA4GH visa support, and full configuration reference. The previous version documented draft/old routes that did not match the implementation.
Update curl examples to use v2 API routes (/datasets, /files) instead of old /info/ and /file/ routes. Fix public key header name from public_key to X-C4GH-Public-Key.
dcf3a46 to
bca6183
Compare
Summary
New download service at
sda/cmd/download/replacingsda-download/. Complete implementation with v2 REST API, GA4GH visa support, keyset pagination, audit logging, and ~255% performance improvement over the old service.Related issues
What's included
Core service
internal/config/v2(pflag/viper framework), 50+ flagsDatabase
Authentication
GA4GH visa validation
v2 REST API
GET /datasets— paginated dataset list with HMAC-signed page tokensGET /datasets/:datasetId— dataset metadataGET /datasets/:datasetId/files— keyset-paginated file list with filePath/pathPrefix filtersHEAD/GET /files/:fileId— combined download with re-encrypted headerHEAD/GET /files/:fileId/header— re-encrypted header onlyHEAD/GET /files/:fileId/content— raw archive body (no pubkey needed)GET /service-info— GA4GH service-infoUseRawPath=truefor URL-encoded slash dataset IDsSecurity
Testing
How to test
Commits
feat(download): add service foundation— config, main, health, streaming, reencryptfeat(download): add database layer— queries, caching, keyset paginationfeat(download): add auth middleware— JWT, opaque, session cachefeat(download): add GA4GH visa validation— validator, trust, userinfo, JWKS cachefeat(download): implement v2 API endpoints— handlers, pagination, Problem Detailsfeat(download): add audit logging and production guards— audit events, startup guardstest(download): add integration tests— Docker Compose, 33 tests, capability probesfeat(download): add benchmark tool— old vs new comparison