Multi-GPU implementation of Q21 #758
Open
+972
−8
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.


The general data flow is to read the lineitem table twice (with different column selections). This minimises memory pressure. We also use a latch-based system to only release the "wide" read once we're going to be ready to process it.
Similarly, we can avoid shuffling all of the large orders table by using a bloom filter to pre-filter before shuffling. Similarly to the lineitem read, we use the arrival of the bloom filter as latch to only release reading the orders table when we're ready to process it (rather than reading up front and just waiting with data in device memory).
Comes with two small fixes to the ndsh utilities that I noticed in the course of the query implementation.