Skip to content

Implement the new tuning API for DeviceScan#7565

Open
griwes wants to merge 42 commits intoNVIDIA:mainfrom
griwes:feature/new-tuning-api/scan
Open

Implement the new tuning API for DeviceScan#7565
griwes wants to merge 42 commits intoNVIDIA:mainfrom
griwes:feature/new-tuning-api/scan

Conversation

@griwes
Copy link
Contributor

@griwes griwes commented Feb 8, 2026

Description

Resolves #7521
Resolves #7476
Resolves #6821

Ready for review, still planning to do SASS inspection in some crucial places.

Sidenote: this exact type of task seems to fit Codex really, really well.

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@griwes griwes requested review from a team as code owners February 8, 2026 05:44
@griwes griwes requested a review from shwina February 8, 2026 05:44
@griwes griwes requested a review from elstehle February 8, 2026 05:44
@github-project-automation github-project-automation bot moved this to Todo in CCCL Feb 8, 2026
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Feb 8, 2026
@github-actions

This comment has been minimized.

Copy link
Contributor

@bernhardmgruber bernhardmgruber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good already! Great work!

@bernhardmgruber
Copy link
Contributor

@griwes we just merged #6811, which also touches the scan tunings. This will probably create some more work for this PR. Issue #6821 also tracks making the new scan implementation available to CCCL.C. Do you think you can handle this as well?

@bernhardmgruber
Copy link
Contributor

@griwes I pulled out the delay constructor refactoring in #7668 so I can better stack my refactorings on top, in case this PR takes a bit longer (sorry again for the extra work with warpspeed!)

@griwes
Copy link
Contributor Author

griwes commented Feb 18, 2026

Note, the warpspeed integration is still largely untested; I've added an rtxpro6000 test job to c.parallel and that will be the primary test right now. I'll lease a machine with a relevant GPU if that fails, or if there's anything that's clearly wrong to someone's eyes in review.

Edit: also seems I messed up some constexprness 😅

@github-actions

This comment has been minimized.

@bernhardmgruber
Copy link
Contributor

I have been thinking a bit about how the check whether a single stage fits into 48KiB SMEM, and I wondered whether we actually need this check in CCCL.C. The main purpose of the check is to ensure forward compatibility of compiled binaries. So if you compile for sm_100 today and run that binary in 10 years on a GPU that really only has 48KiB SMEM, it should still work. We don't need that guarantee for CCCL.C, since we don't keep around binaries.

The second reason we have this check is that a user could provide us with an input type, or an accumulator type (as dictated by the scan operator), that is so huge that we go beyond 48KiB SMEM even with a conservative tuning policy, and we should just fall back to the old scan, because it's not possible to run the warpspeed scan.

Now I wondered, is the set of types that CCCL.C will use open or closed? Because if we know all types that warpspeed scan will be used from CCCL.C, we can just test if it fits into SMEM in a unit test and entirely omit the entire compile time checking for CCCL.C. We would just drop the SMEM check from the scan_use_warpspeed predicate. That would make this PR a lot simpler.

@bernhardmgruber
Copy link
Contributor

I just realized we still need the runtime computation to know how much SMEM we must request :S

@github-actions

This comment has been minimized.

@griwes
Copy link
Contributor Author

griwes commented Mar 16, 2026

There is SASS changes. Here's a random assortment of kernels compared: https://gist.github.com/griwes/a94e3daf0d2b58faaeebea1932e0c1b0. I believe that there's a whole bunch of codegen artifacts here + some loss/gain of uniform instructions (presumably because the changes made it both easier and harder for the compiler to reason about uniformity...). I have not spotted any significant changes in the hot paths.

There's also two specific cases that seem to now be producing LMEM instructions, though as far as I can tell it's not in the hot loop either: https://gist.github.com/griwes/e0bc6107675b9a55fc3efabdc7244564.

@github-actions

This comment has been minimized.

This currently makes thrust.test.scan fail, which needs to be investigated, since it worked before in the presence of the warpspeed implementation
@bernhardmgruber bernhardmgruber requested a review from a team as a code owner March 23, 2026 14:02
@bernhardmgruber
Copy link
Contributor

While investigating the SASS changes, I noticed that some symbols in the CUB benchmarks contained the use of policy_selector_from_hub, which we should no longer see (its only use to support users directly accessing the dispatcher). I found out that those came from Thrust, so I ported the Thrust CUB backend to use cub::detail::scan::dispatch directly. But that is leading to test failures now in thrust.test.scan, which is super odd.

@bernhardmgruber
Copy link
Contributor

There are now no SASS changes for cub.bench.scan.exclusive.sum.base on SM75;80;86;90;100;120

@github-actions
Copy link
Contributor

🥳 CI Workflow Results

🟩 Finished in 2h 59m: Pass: 100%/306 | Total: 11d 06h | Max: 2h 58m | Hits: 74%/272688

See results here.

@bernhardmgruber
Copy link
Contributor

pre-commit.ci autofix

@bernhardmgruber
Copy link
Contributor

/ok to test 3e55bd0

Copy link
Contributor

@bernhardmgruber bernhardmgruber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have collected a few more pieces of refactorings, but I think those should go to a separate PR after this one.

I dislike some of the host code butchery that was required for CCCL.C, but in most cases I don't see how it could have been done better.

Since there are no SASS diffs and the tests pass, I think this is good to go in!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

Implement the new tuning API for DeviceScan Refactor cccl.c scan to use tuning API Make warpspeed scan work in CCCL.C

3 participants