Multi-GPU implementation of Q21 #758

wence- · 2025-12-17T17:16:31Z

The general data flow is to read the lineitem table twice (with different column selections). This minimises memory pressure. We also use a latch-based system to only release the "wide" read once we're going to be ready to process it.

Similarly, we can avoid shuffling all of the large orders table by using a bloom filter to pre-filter before shuffling. Similarly to the lineitem read, we use the arrival of the bloom filter as latch to only release reading the orders table when we're ready to process it (rather than reading up front and just waiting with data in device memory).

Comes with two small fixes to the ndsh utilities that I noticed in the course of the query implementation.

Forgot to divide by 100.

TomAugspurger · 2025-12-17T18:16:32Z

One question (which I'll look into if I get a chance): have you tried tuning num_producers in the read parquet nodes at all? In this nsys profile, there's an idle thread near the start of the query

which lines up with another thread lower down doing its own read_parquet, which makes me think we're using up all the tickets we've configured it to use.

But if we produce too much data too fast, we'll exhaust our memory, and I haven't looked at what files are being read to know whether reading it earlier would improve the overall runtime.

wence- · 2025-12-17T20:43:38Z

One question (which I'll look into if I get a chance): have you tried tuning num_producers in the read parquet nodes at all? In this nsys profile, there's an idle thread near the start of the query
which lines up with another thread lower down doing its own read_parquet, which makes me think we're using up all the tickets we've configured it to use.
But if we produce too much data too fast, we'll exhaust our memory, and I haven't looked at what files are being read to know whether reading it earlier would improve the overall runtime.

I did a a bit but not a lot. You're right that too many tickets blows through memory though

wence- added 3 commits December 17, 2025 16:59

Fix broadcast join sequence numbers

eb3faab

Fix spill device limit parsing

2da51f7

Forgot to divide by 100.

Implement q21

34c74a9

wence- requested review from a team as code owners December 17, 2025 17:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-GPU implementation of Q21 #758

Multi-GPU implementation of Q21 #758

wence- commented Dec 17, 2025

Uh oh!

TomAugspurger commented Dec 17, 2025

Uh oh!

wence- commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Multi-GPU implementation of Q21 #758

Are you sure you want to change the base?

Multi-GPU implementation of Q21 #758

Conversation

wence- commented Dec 17, 2025

Uh oh!

TomAugspurger commented Dec 17, 2025

Uh oh!

wence- commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants