Skip to content

Fix S3 lookup unbounded pagination with double call#6851

Merged
pditommaso merged 3 commits intomasterfrom
fix/s3-lookup-unbounded-pagination-with-double-call
Feb 23, 2026
Merged

Fix S3 lookup unbounded pagination with double call#6851
pditommaso merged 3 commits intomasterfrom
fix/s3-lookup-unbounded-pagination-with-double-call

Conversation

@jorgee
Copy link
Contributor

@jorgee jorgee commented Feb 20, 2026

Problem

S3ObjectSummaryLookup.lookup() used an unbounded while(true) pagination loop that iterated through all objects sharing a given prefix (fetching 250 keys per page). On S3 buckets with large prefixes containing millions of objects, this caused excessive LIST API calls, high latency, and potential timeouts — just to check whether a single path exists.

Solution

Replace the unbounded loop with at most two bounded listObjects calls:

  1. Call 1prefix(key), maxKeys(2): covers the common cases where the exact key or its first directory child appears within the first 2 lexicographic results.

  2. Call 2 (fallback)prefix(key + "/"), maxKeys(1): needed because S3 lists keys in lexicographic (UTF-8 byte) order, and characters like - (0x2D) and . (0x2E) sort before / (0x2F). This means sibling keys such as a-a/ and a.txt appear before a/ in the listing, potentially pushing the directory child outside Call 1's result window. Call 2 searches with prefix key/ directly, bypassing those siblings.

Example of the lexicographic ordering issue

Given keys a-a/file-3, a.txt, and a/file-1, S3 returns them as:

a-a/file-3   ← '-' (0x2D) < '/' (0x2F)
a.txt         ← '.' (0x2E) < '/' (0x2F)
a/file-1      ← '/' (0x2F) — the actual directory child

With maxKeys(2), Call 1 only sees a-a/file-3 and a.txt — neither matches. Call 2 with prefix a/ finds a/file-1, confirming that a is a directory.

Alternative to #6849

pditommaso and others added 2 commits February 19, 2026 21:59
The lookup method paginated through all objects under an S3 prefix
(maxKeys=250) to check path existence. On prefixes with millions of
objects this caused the main thread to hang for minutes parsing massive
XML responses.

Observed in production: nf-schema parameter validation calls
Files.exists() on an S3 outdir path, which triggers
S3ObjectSummaryLookup.lookup. With a large prefix like
s3://bucket/results containing many objects from previous runs,
the pagination loop iterated indefinitely.

Fix: use maxKeys=2 and remove pagination. The matchName check only
needs to find the exact key or its first child (key + "/"), which
are guaranteed to appear in the first results due to S3 lexicographic
ordering.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
…refix and smaller lexico order characters than /

Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@netlify
Copy link

netlify bot commented Feb 20, 2026

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 4d5fd24
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/699889e920365d00089931fa

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Copy link
Member

@pditommaso pditommaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done. Considering is really a tricky issue, I took the liberty to extend the docs/comment

@pditommaso pditommaso merged commit a2e67eb into master Feb 23, 2026
7 checks passed
@pditommaso pditommaso deleted the fix/s3-lookup-unbounded-pagination-with-double-call branch February 23, 2026 11:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants