Skip to content

[download] Investigate rclone compatibility for dataset downloads (US11) #2247

@jhagberg

Description

@jhagberg

Please describe the feature

As a data recipient, I want to be able to use rclone to copy a complete dataset so I can easily manage downloads (US11, priority: could).

Background

rclone supports many backends beyond S3. The most relevant for SDA:

  • HTTP — simplest. Needs a page with file links (directory listing). rclone supports custom headers (--header), so Authorization and X-C4GH-Public-Key can be passed.
  • WebDAV — more capable (directory listing, metadata), but requires implementing a WebDAV interface.
  • S3 — most complex. Requires full S3 API compatibility.

As noted in #1680: "rclone does not require full S3 support, there are lots of standard protocols that rclone supports, maybe it's possible with simple http for example."

Questions to investigate

  1. Can rclone's HTTP backend work with the current v2 API as-is? The API already provides dataset file listing (GET /datasets/{datasetId}/files) and file download (GET /files/{fileId}). The gap is that rclone's HTTP backend expects a browsable HTML directory listing, not a JSON API.
  2. Is a thin adapter needed? E.g. a lightweight layer that serves an HTML directory listing from the dataset files endpoint, with download links pointing to file endpoints.
  3. Header forwarding — rclone supports --header for custom headers, but does it forward them correctly on redirects? Same concern as htsget-rs.
  4. Crypt4GH complication — downloaded files are re-encrypted per recipient. rclone would download encrypted files. Is the user expected to decrypt locally with their private key, or do we need integration with crypt4gh-aware tooling?

Acceptance criteria

  • Document which rclone backend(s) are compatible with the download API
  • If an adapter is needed, implement it (internal, not public API)
  • Verify rclone can list and download a complete dataset with auth headers
  • Document the rclone configuration for end users

Additional context

  • US11 is priority could, so this is lower priority than htsget-rs (US8, must)
  • sda-cli already handles dataset downloads (US9) — rclone would be an alternative for users who prefer standard tooling
  • The investigation outcome may be "current API already works with rclone HTTP backend + custom headers" — in which case only documentation is needed

Estimation of size

small (investigation) / medium (if adapter needed)

Estimation of priority

low (US11 is could)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions