Implement the new tuning API for DeviceScan#7565
Implement the new tuning API for DeviceScan#7565griwes wants to merge 42 commits intoNVIDIA:mainfrom
DeviceScan#7565Conversation
This comment has been minimized.
This comment has been minimized.
bernhardmgruber
left a comment
There was a problem hiding this comment.
This looks really good already! Great work!
…feature/new-tuning-api/scan
|
Note, the warpspeed integration is still largely untested; I've added an rtxpro6000 test job to c.parallel and that will be the primary test right now. I'll lease a machine with a relevant GPU if that fails, or if there's anything that's clearly wrong to someone's eyes in review. Edit: also seems I messed up some constexprness 😅 |
This comment has been minimized.
This comment has been minimized.
|
I have been thinking a bit about how the check whether a single stage fits into 48KiB SMEM, and I wondered whether we actually need this check in CCCL.C. The main purpose of the check is to ensure forward compatibility of compiled binaries. So if you compile for The second reason we have this check is that a user could provide us with an input type, or an accumulator type (as dictated by the scan operator), that is so huge that we go beyond 48KiB SMEM even with a conservative tuning policy, and we should just fall back to the old scan, because it's not possible to run the warpspeed scan. Now I wondered, is the set of types that CCCL.C will use open or closed? Because if we know all types that warpspeed scan will be used from CCCL.C, we can just test if it fits into SMEM in a unit test and entirely omit the entire compile time checking for CCCL.C. We would just drop the SMEM check from the |
|
I just realized we still need the runtime computation to know how much SMEM we must request :S |
This comment has been minimized.
This comment has been minimized.
|
There is SASS changes. Here's a random assortment of kernels compared: https://gist.github.com/griwes/a94e3daf0d2b58faaeebea1932e0c1b0. I believe that there's a whole bunch of codegen artifacts here + some loss/gain of uniform instructions (presumably because the changes made it both easier and harder for the compiler to reason about uniformity...). I have not spotted any significant changes in the hot paths. There's also two specific cases that seem to now be producing LMEM instructions, though as far as I can tell it's not in the hot loop either: https://gist.github.com/griwes/e0bc6107675b9a55fc3efabdc7244564. |
This comment has been minimized.
This comment has been minimized.
This currently makes thrust.test.scan fail, which needs to be investigated, since it worked before in the presence of the warpspeed implementation
|
While investigating the SASS changes, I noticed that some symbols in the CUB benchmarks contained the use of |
|
There are now no SASS changes for |
🥳 CI Workflow Results🟩 Finished in 2h 59m: Pass: 100%/306 | Total: 11d 06h | Max: 2h 58m | Hits: 74%/272688See results here. |
|
pre-commit.ci autofix |
|
/ok to test 3e55bd0 |
There was a problem hiding this comment.
I have collected a few more pieces of refactorings, but I think those should go to a separate PR after this one.
I dislike some of the host code butchery that was required for CCCL.C, but in most cases I don't see how it could have been done better.
Since there are no SASS diffs and the tests pass, I think this is good to go in!
Description
Resolves #7521
Resolves #7476
Resolves #6821
Ready for review, still planning to do SASS inspection in some crucial places.
Sidenote: this exact type of task seems to fit Codex really, really well.
Checklist