Skip to content

Comments

[branch-45] Fix sequential metadata fetching in ListingTable causing high latency#2

Merged
gabotechs merged 1 commit intobranch-45from
geoffrey.claude/concurrent_heads
Mar 3, 2025
Merged

[branch-45] Fix sequential metadata fetching in ListingTable causing high latency#2
gabotechs merged 1 commit intobranch-45from
geoffrey.claude/concurrent_heads

Conversation

@geoffreyclaude
Copy link

@geoffreyclaude geoffreyclaude commented Feb 27, 2025

Which issue does this PR close?

Rationale for this change

When scanning an exact list of remote Parquet files, the ListingTable was fetching file metadata (via head calls) sequentially. This was due to using stream::iter(file_list).flatten(), which processes each one-item stream in order. For remote blob stores, where each head call can take tens to hundreds of milliseconds, this sequential behavior significantly increased the time to create the physical plan.

What changes are included in this PR?

This commit replaces the sequential flattening with concurrent merging using stream::iter(file_list).flatten_unordered(meta_fetch_concurrency). With this change, the head requests are executed in parallel (up to the configured meta_fetch_concurrency limit), reducing latency when creating the physical plan.
Note that the ordering loss introduced by flatten_unordered is perfectly acceptable as the file list will anyways be fully sorted by path in split_files before being returned.

Are these changes tested?

No tests for this fork to mitigate future git conflict risks. Tests have been updated in the upstream PR apache#14918.

Are there any user-facing changes?

No user-facing changes besides reducing the latency in this particular situation.

@geoffreyclaude geoffreyclaude force-pushed the geoffrey.claude/concurrent_heads branch from c0290b4 to 6da3517 Compare February 27, 2025 14:15
When scanning an exact list of remote Parquet files, the ListingTable was fetching file metadata (via head calls) sequentially. This was due to using `stream::iter(file_list).flatten()`, which processes each one-item stream in order. For remote blob stores, where each head call can take tens to hundreds of milliseconds, this sequential behavior significantly increased the time to create the physical plan.

This commit replaces the sequential flattening with concurrent merging using `tream::iter(file_list).flatten_unordered(meta_fetch_concurrency). With this change, the `head` requests are executed in parallel (up to the configured `meta_fetch_concurrency` limit), reducing latency when creating the physical plan.
Note that the ordering loss introduced by `flatten_unordered` is perfectly acceptable as the file list will anyways be fully sorted by path in `split_files` before being returned.

Additionally, tests have been updated to ensure that metadata fetching occurs concurrently.
@geoffreyclaude geoffreyclaude force-pushed the geoffrey.claude/concurrent_heads branch from 6da3517 to bbfd5b0 Compare February 27, 2025 20:14
@gabotechs gabotechs merged commit 262c9fb into branch-45 Mar 3, 2025
gabotechs pushed a commit that referenced this pull request Mar 4, 2025
…#2)

When scanning an exact list of remote Parquet files, the ListingTable was fetching file metadata (via head calls) sequentially. This was due to using `stream::iter(file_list).flatten()`, which processes each one-item stream in order. For remote blob stores, where each head call can take tens to hundreds of milliseconds, this sequential behavior significantly increased the time to create the physical plan.

This commit replaces the sequential flattening with concurrent merging using `tream::iter(file_list).flatten_unordered(meta_fetch_concurrency). With this change, the `head` requests are executed in parallel (up to the configured `meta_fetch_concurrency` limit), reducing latency when creating the physical plan.
Note that the ordering loss introduced by `flatten_unordered` is perfectly acceptable as the file list will anyways be fully sorted by path in `split_files` before being returned.

Additionally, tests have been updated to ensure that metadata fetching occurs concurrently.
gabotechs pushed a commit that referenced this pull request Mar 4, 2025
…#2)

When scanning an exact list of remote Parquet files, the ListingTable was fetching file metadata (via head calls) sequentially. This was due to using `stream::iter(file_list).flatten()`, which processes each one-item stream in order. For remote blob stores, where each head call can take tens to hundreds of milliseconds, this sequential behavior significantly increased the time to create the physical plan.

This commit replaces the sequential flattening with concurrent merging using `tream::iter(file_list).flatten_unordered(meta_fetch_concurrency). With this change, the `head` requests are executed in parallel (up to the configured `meta_fetch_concurrency` limit), reducing latency when creating the physical plan.
Note that the ordering loss introduced by `flatten_unordered` is perfectly acceptable as the file list will anyways be fully sorted by path in `split_files` before being returned.

Additionally, tests have been updated to ensure that metadata fetching occurs concurrently.
@gabotechs gabotechs changed the title Fix sequential metadata fetching in ListingTable causing high latency [branch-45] Fix sequential metadata fetching in ListingTable causing high latency Mar 20, 2025
@geoffreyclaude geoffreyclaude deleted the geoffrey.claude/concurrent_heads branch December 8, 2025 09:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants