Conversation
|
Excellent PR, thank you for doing it! Please add a comment in #2036 to mention that TB maps need to be added as a pass. Part of the reason it's crashing, I assume, is that you didn't use a proper recursive test for the existence of TB maps. Please check cuda.py (or infer_types.py) for how to use the existing functionality. |
Yes, it is probably that. I used a variation of this transformation for the auto-tiling passes, therefore, I think the transformation should exist. We could have a pass version of the same transformation such that it aligns in style with other preprocessing passes, then we can have both. |
I have a new question, I updated to check If we have a 2D map [0:N, 0:N] and pass block size [32, 8, 2] I would expect the generated code to be launched [32, 8]. But of course, this is the default value, I guess the correct way for this pass to collapse if the default value is used from config, but if we are using a value set by the user we should check input dimensions match the map dimensions (if 1D map, then tblock is also 1 dimensional at max) |
| print_warning = True | ||
| self.thread_block_size_y = 1 | ||
| self.thread_block_size_z = 1 | ||
| new_block = (self.thread_block_size_x, self.thread_block_size_y, self.thread_block_size_z) |
There was a problem hiding this comment.
I am thinking that it would be simpler to write something similar to:
old_block = (self.thread_block_size_x, self.thread_block_size_y, self.thread_block_size_z)
new_block = list(old_block)
for d in range(3, num_dims_in_map, -1):
new_block[d-1] *= new_block[d]
new_block[d] = 1
new_block = tuple(new_block)
if new_block != old_block:
warnings.warn ...It may not be immediately as readable as the current code though, so this just a suggestion.
I think this is a UX issue. I like the collapsing approach, but I have no strong opinion here. I also think that we should consider revisiting how thread-blocks are matched to map dimensions in light of new CUDA developments. |
I think collapsing is also fine, I think the issues are currently related to GPU_Persistent maps (and GPU Device maps within) |
I am testing how many existing test cases would break if we were to add a "thread block" map to all gpu kernels, if it does not exist, to enable simplification on codegen by removing case distinctions. [WIP]