[branch-45] Fix sequential metadata fetching in ListingTable causing high latency by geoffreyclaude · Pull Request #2 · DataDog/datafusion

geoffreyclaude · 2025-02-27T13:58:16Z

Which issue does this PR close?

Closes Slow Physical Plan Creation for Remote Parquet Files apache/datafusion#14916.

Rationale for this change

When scanning an exact list of remote Parquet files, the ListingTable was fetching file metadata (via head calls) sequentially. This was due to using stream::iter(file_list).flatten(), which processes each one-item stream in order. For remote blob stores, where each head call can take tens to hundreds of milliseconds, this sequential behavior significantly increased the time to create the physical plan.

What changes are included in this PR?

This commit replaces the sequential flattening with concurrent merging using stream::iter(file_list).flatten_unordered(meta_fetch_concurrency). With this change, the head requests are executed in parallel (up to the configured meta_fetch_concurrency limit), reducing latency when creating the physical plan.
Note that the ordering loss introduced by flatten_unordered is perfectly acceptable as the file list will anyways be fully sorted by path in split_files before being returned.

Are these changes tested?

No tests for this fork to mitigate future git conflict risks. Tests have been updated in the upstream PR apache#14918.

Are there any user-facing changes?

No user-facing changes besides reducing the latency in this particular situation.

When scanning an exact list of remote Parquet files, the ListingTable was fetching file metadata (via head calls) sequentially. This was due to using `stream::iter(file_list).flatten()`, which processes each one-item stream in order. For remote blob stores, where each head call can take tens to hundreds of milliseconds, this sequential behavior significantly increased the time to create the physical plan. This commit replaces the sequential flattening with concurrent merging using `tream::iter(file_list).flatten_unordered(meta_fetch_concurrency). With this change, the `head` requests are executed in parallel (up to the configured `meta_fetch_concurrency` limit), reducing latency when creating the physical plan. Note that the ordering loss introduced by `flatten_unordered` is perfectly acceptable as the file list will anyways be fully sorted by path in `split_files` before being returned. Additionally, tests have been updated to ensure that metadata fetching occurs concurrently.

…#2) When scanning an exact list of remote Parquet files, the ListingTable was fetching file metadata (via head calls) sequentially. This was due to using `stream::iter(file_list).flatten()`, which processes each one-item stream in order. For remote blob stores, where each head call can take tens to hundreds of milliseconds, this sequential behavior significantly increased the time to create the physical plan. This commit replaces the sequential flattening with concurrent merging using `tream::iter(file_list).flatten_unordered(meta_fetch_concurrency). With this change, the `head` requests are executed in parallel (up to the configured `meta_fetch_concurrency` limit), reducing latency when creating the physical plan. Note that the ordering loss introduced by `flatten_unordered` is perfectly acceptable as the file list will anyways be fully sorted by path in `split_files` before being returned. Additionally, tests have been updated to ensure that metadata fetching occurs concurrently.

geoffreyclaude force-pushed the geoffrey.claude/concurrent_heads branch from c0290b4 to 6da3517 Compare February 27, 2025 14:15

geoffreyclaude force-pushed the geoffrey.claude/concurrent_heads branch from 6da3517 to bbfd5b0 Compare February 27, 2025 20:14

gabotechs approved these changes Mar 3, 2025

View reviewed changes

gabotechs merged commit 262c9fb into branch-45 Mar 3, 2025

gabotechs mentioned this pull request Mar 4, 2025

[branch-46] Branch 46 update #3

Merged

gabotechs changed the title ~~Fix sequential metadata fetching in ListingTable causing high latency~~ [branch-45] Fix sequential metadata fetching in ListingTable causing high latency Mar 20, 2025

geoffreyclaude deleted the geoffrey.claude/concurrent_heads branch December 8, 2025 09:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[branch-45] Fix sequential metadata fetching in ListingTable causing high latency#2

[branch-45] Fix sequential metadata fetching in ListingTable causing high latency#2
gabotechs merged 1 commit intobranch-45from
geoffrey.claude/concurrent_heads

geoffreyclaude commented Feb 27, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

geoffreyclaude commented Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

geoffreyclaude commented Feb 27, 2025 •

edited

Loading