Skip to content

[download] Enable htsget-rs integration with the v2 download API (US8) #2246

@jhagberg

Description

@jhagberg

Please describe the feature

As a data recipient using htsget-compatible tools (e.g. samtools), I want
htsget-rs to be able to resolve and serve files from the SDA archive so
that I can use standard genomics tools transparently (US8).

Background

The old sda-download service exposed /s3-encrypted/{dataset}/{filepath}
a path-based endpoint that htsget-rs used via UrlStorage. The v2 download
API replaces this with ID-based endpoints (/files/{fileId}/content) and
does not expose a path-based download endpoint.

htsget-rs currently works with UrlStorage, which constructs URLs from a
base URL + file path. It has no concept of resolving a filepath to an ID
before constructing ticket URLs.

Who needs path-based download?

Consumer Addresses files via Needs path-based download?
BigPicture accession ID (confirmed from deployment code) No
sda-cli accession ID / list+filter No
htsget-rs {dataset}/{filepath} via UrlStorage Yes (only consumer)

Upstream feedback

We filed umccr/htsget-rs#356
proposing a DrsStorage backend. The htsget-rs maintainer confirmed the feature
is in-scope and welcomed a contribution, but suggested a different approach
than our original two-step resolution proposal:

Our original proposal: htsget-rs resolves {dataset}/{filepath}fileId
via GET /datasets/{id}/files?filePath={fp}, then constructs ticket URLs
to /files/{fileId}/content.

Maintainer preference: stick closer to the DRS spec. Instead of a
list-based query that returns a fileId, the SDA API should expose a
GET /objects/{id} endpoint returning a DrsObject with pre-resolved
access_methods[].access_url. This gives htsget-rs a standard GA4GH
object to work with — no custom resolution logic needed.

Key quotes from the maintainer response:

"I'd also like to propose the possibility of having this stick close to
the DRS spec. I think matching that is beneficial as we wouldn't have to
use custom schemes for how objects are listed and returned."

"the DRS spirit would be more in-line with getting the
access_methods[].access_url field from the returned DrsObject"

"I'm happy for this to be kept as simple as possible for now. E.g. only
assuming access_methods[].access_url exists and is of type https, and
ignoring complexities around authorization and passports by just using
forward_headers."

Chosen approach: minimal DRS object endpoint

Based on the upstream feedback, the plan is:

1. SDA download API: add GET /objects/{datasetId}/{filePath}

A minimal DRS-compatible endpoint that resolves {dataset}/{filepath} to
a file and returns a DrsObject with an access_url pointing to the
content endpoint. This performs the resolution step internally — htsget-rs
gets a single standard response with everything it needs.

Example request:

GET /objects/EGAD00001000001/samples/controls/sample1.bam.c4gh
Authorization: Bearer <token>

Example response (minimal DrsObject):

{
  "id": "urn:neic:001-002-003",
  "self_uri": "drs://download.example.org/urn:neic:001-002-003",
  "size": 1048576,
  "created_time": "2026-01-15T10:30:00Z",
  "checksums": [
    { "checksum": "a1b2c3d4...", "type": "sha-256" }
  ],
  "access_methods": [
    {
      "type": "https",
      "access_url": {
        "url": "https://download.example.org/files/urn:neic:001-002-003/content"
      }
    }
  ]
}

DRS 1.5 required fields mapped to SDA data:

DRS field Source Notes
id fileId (from DB) Maps to DRS object_id
self_uri Constructed as drs://{host}/{fileId} Required by DRS 1.5
size Encrypted file size (from DB) DRS spec: blob size in bytes
created_time File creation timestamp (from DB) RFC 3339
checksums Decrypted file checksums (from DB) {checksum, type} per DRS format
access_methods[].type "https" Fixed for our API
access_methods[].access_url.url {base}/files/{fileId}/content Pre-resolved URL

The endpoint uses the same auth and permission model as other protected
endpoints (403 for both access denied and not found, to prevent existence
leakage).

2. htsget-rs: contribute DrsStorage backend

A new DrsStorage backend (feature-gated as #[cfg(feature = "drs")]) that:

  1. Calls GET {api_url}/objects/{dataset}/{filepath} to get a DrsObject
  2. Extracts access_methods[0].access_url.url as the content URL
  3. Uses that URL for ticket construction (range_url()) and internal
    fetches (get(), head())
  4. Caches the resolution (files are immutable per fileId in SDA)
  5. Forwards auth headers via forward_headers (existing htsget-rs pattern)

Example htsget-rs config:

[[locations]]
regex = "^(?P<dataset>[^/]+)/(?P<filepath>.+)$"
substitution_string = "$dataset/$filepath"

backend.kind = "Drs"
backend.api_url = "http://download-internal:8080"
backend.response_url = "https://download.example.org"
backend.forward_headers = true
backend.header_blacklist = ["Host"]

Data flow

1. Client → htsget-rs:      GET /reads/EGAD00001/sample1.bam
2. htsget-rs → SDA API:     GET /objects/EGAD00001/sample1.bam.c4gh
3. SDA API:                  Resolves filepath → fileId internally
4. SDA API → htsget-rs:     DrsObject { access_url: ".../files/{fileId}/content" }
5. htsget-rs:                Reads BAM index via access_url, computes byte ranges
6. htsget-rs → Client:      Ticket with URLs pointing to access_url + Range headers
7. Client → SDA API:        GET /files/{fileId}/content (Range: bytes=X-Y)

Acceptance criteria

  • SDA download API exposes GET /objects/{datasetId}/{filePath} returning
    a minimal DRS 1.5 DrsObject with access_methods[].access_url
  • DrsObject includes all DRS 1.5 required fields (id, self_uri,
    size, created_time, checksums)
  • access_url.url points to the content endpoint with the resolved fileId
  • Endpoint uses the same auth/permission model as other protected endpoints
  • htsget-rs DrsStorage backend calls this endpoint and constructs valid
    ticket URLs from the access_url
  • Auth headers forwarded to backend/ticket fetch path per htsget-rs config
  • Client public key forwarded via Htsget-Context-Public-Key header
  • Range requests work correctly for partial file access
  • End-to-end test: samtools/sda-cli → htsget-rs → SDA download v2
  • DRS endpoint added to the v2 API spec (swagger_v2.yml)

Work breakdown

# What Where Depends on
1 Add DRS object endpoint to v2 spec SDA repo
2 Implement DRS object handler SDA repo 1
3 Contribute DrsStorage backend umccr/htsget-rs 1
4 Integration tests SDA repo 2, 3

Additional context

Estimation of size

medium — DRS endpoint in SDA (small) + DrsStorage backend in htsget-rs (medium)

Estimation of priority

high (blocks US8)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions