-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Please describe the feature
As a data recipient using htsget-compatible tools (e.g. samtools), I want
htsget-rs to be able to resolve and serve files from the SDA archive so
that I can use standard genomics tools transparently (US8).
Background
The old sda-download service exposed /s3-encrypted/{dataset}/{filepath} —
a path-based endpoint that htsget-rs used via UrlStorage. The v2 download
API replaces this with ID-based endpoints (/files/{fileId}/content) and
does not expose a path-based download endpoint.
htsget-rs currently works with UrlStorage, which constructs URLs from a
base URL + file path. It has no concept of resolving a filepath to an ID
before constructing ticket URLs.
Who needs path-based download?
| Consumer | Addresses files via | Needs path-based download? |
|---|---|---|
| BigPicture | accession ID (confirmed from deployment code) | No |
| sda-cli | accession ID / list+filter | No |
| htsget-rs | {dataset}/{filepath} via UrlStorage |
Yes (only consumer) |
Upstream feedback
We filed umccr/htsget-rs#356
proposing a DrsStorage backend. The htsget-rs maintainer confirmed the feature
is in-scope and welcomed a contribution, but suggested a different approach
than our original two-step resolution proposal:
Our original proposal: htsget-rs resolves {dataset}/{filepath} → fileId
via GET /datasets/{id}/files?filePath={fp}, then constructs ticket URLs
to /files/{fileId}/content.
Maintainer preference: stick closer to the DRS spec. Instead of a
list-based query that returns a fileId, the SDA API should expose a
GET /objects/{id} endpoint returning a DrsObject with pre-resolved
access_methods[].access_url. This gives htsget-rs a standard GA4GH
object to work with — no custom resolution logic needed.
Key quotes from the maintainer response:
"I'd also like to propose the possibility of having this stick close to
the DRS spec. I think matching that is beneficial as we wouldn't have to
use custom schemes for how objects are listed and returned.""the DRS spirit would be more in-line with getting the
access_methods[].access_urlfield from the returned DrsObject""I'm happy for this to be kept as simple as possible for now. E.g. only
assumingaccess_methods[].access_urlexists and is of typehttps, and
ignoring complexities around authorization and passports by just using
forward_headers."
Chosen approach: minimal DRS object endpoint
Based on the upstream feedback, the plan is:
1. SDA download API: add GET /objects/{datasetId}/{filePath}
A minimal DRS-compatible endpoint that resolves {dataset}/{filepath} to
a file and returns a DrsObject with an access_url pointing to the
content endpoint. This performs the resolution step internally — htsget-rs
gets a single standard response with everything it needs.
Example request:
GET /objects/EGAD00001000001/samples/controls/sample1.bam.c4gh
Authorization: Bearer <token>
Example response (minimal DrsObject):
{
"id": "urn:neic:001-002-003",
"self_uri": "drs://download.example.org/urn:neic:001-002-003",
"size": 1048576,
"created_time": "2026-01-15T10:30:00Z",
"checksums": [
{ "checksum": "a1b2c3d4...", "type": "sha-256" }
],
"access_methods": [
{
"type": "https",
"access_url": {
"url": "https://download.example.org/files/urn:neic:001-002-003/content"
}
}
]
}DRS 1.5 required fields mapped to SDA data:
| DRS field | Source | Notes |
|---|---|---|
id |
fileId (from DB) |
Maps to DRS object_id |
self_uri |
Constructed as drs://{host}/{fileId} |
Required by DRS 1.5 |
size |
Encrypted file size (from DB) | DRS spec: blob size in bytes |
created_time |
File creation timestamp (from DB) | RFC 3339 |
checksums |
Decrypted file checksums (from DB) | {checksum, type} per DRS format |
access_methods[].type |
"https" |
Fixed for our API |
access_methods[].access_url.url |
{base}/files/{fileId}/content |
Pre-resolved URL |
The endpoint uses the same auth and permission model as other protected
endpoints (403 for both access denied and not found, to prevent existence
leakage).
2. htsget-rs: contribute DrsStorage backend
A new DrsStorage backend (feature-gated as #[cfg(feature = "drs")]) that:
- Calls
GET {api_url}/objects/{dataset}/{filepath}to get aDrsObject - Extracts
access_methods[0].access_url.urlas the content URL - Uses that URL for ticket construction (
range_url()) and internal
fetches (get(),head()) - Caches the resolution (files are immutable per fileId in SDA)
- Forwards auth headers via
forward_headers(existing htsget-rs pattern)
Example htsget-rs config:
[[locations]]
regex = "^(?P<dataset>[^/]+)/(?P<filepath>.+)$"
substitution_string = "$dataset/$filepath"
backend.kind = "Drs"
backend.api_url = "http://download-internal:8080"
backend.response_url = "https://download.example.org"
backend.forward_headers = true
backend.header_blacklist = ["Host"]Data flow
1. Client → htsget-rs: GET /reads/EGAD00001/sample1.bam
2. htsget-rs → SDA API: GET /objects/EGAD00001/sample1.bam.c4gh
3. SDA API: Resolves filepath → fileId internally
4. SDA API → htsget-rs: DrsObject { access_url: ".../files/{fileId}/content" }
5. htsget-rs: Reads BAM index via access_url, computes byte ranges
6. htsget-rs → Client: Ticket with URLs pointing to access_url + Range headers
7. Client → SDA API: GET /files/{fileId}/content (Range: bytes=X-Y)
Acceptance criteria
- SDA download API exposes
GET /objects/{datasetId}/{filePath}returning
a minimal DRS 1.5DrsObjectwithaccess_methods[].access_url - DrsObject includes all DRS 1.5 required fields (
id,self_uri,
size,created_time,checksums) -
access_url.urlpoints to the content endpoint with the resolved fileId - Endpoint uses the same auth/permission model as other protected endpoints
- htsget-rs DrsStorage backend calls this endpoint and constructs valid
ticket URLs from theaccess_url - Auth headers forwarded to backend/ticket fetch path per htsget-rs config
- Client public key forwarded via
Htsget-Context-Public-Keyheader - Range requests work correctly for partial file access
- End-to-end test: samtools/sda-cli → htsget-rs → SDA download v2
- DRS endpoint added to the v2 API spec (
swagger_v2.yml)
Work breakdown
| # | What | Where | Depends on |
|---|---|---|---|
| 1 | Add DRS object endpoint to v2 spec | SDA repo | — |
| 2 | Implement DRS object handler | SDA repo | 1 |
| 3 | Contribute DrsStorage backend | umccr/htsget-rs | 1 |
| 4 | Integration tests | SDA repo | 2, 3 |
Additional context
- Upstream issue: umccr/htsget-rs#356
- The v2 spec documents
Htsget-Context-Public-Keyas the header alias
for htsget-rs compatibility - GA4GH DRS 1.5 spec: https://ga4gh.github.io/data-repository-service-schemas/preview/release/drs-1.5.0/docs/
- htsget-rs source: https://github.com/umccr/htsget-rs
Estimation of size
medium — DRS endpoint in SDA (small) + DrsStorage backend in htsget-rs (medium)
Estimation of priority
high (blocks US8)