Implement the new tuning API for `DeviceScan` by griwes · Pull Request #7565 · NVIDIA/cccl

griwes · 2026-02-08T05:44:48Z

Description

Resolves #7521
Resolves #7476
Resolves #6821

Ready for review, still planning to do SASS inspection in some crucial places.

Sidenote: this exact type of task seems to fit Codex really, really well.

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

…i/scan

bernhardmgruber

This looks really good already! Great work!

c/parallel/src/scan.cu

cub/benchmarks/bench/scan/policy_selector.h

cub/cub/device/dispatch/tuning/tuning_radix_sort.cuh

cub/cub/device/dispatch/dispatch_scan.cuh

cub/cub/device/dispatch/tuning/tuning_scan.cuh

bernhardmgruber · 2026-02-10T22:03:07Z

@griwes we just merged #6811, which also touches the scan tunings. This will probably create some more work for this PR. Issue #6821 also tracks making the new scan implementation available to CCCL.C. Do you think you can handle this as well?

bernhardmgruber · 2026-02-13T13:03:55Z

@griwes I pulled out the delay constructor refactoring in #7668 so I can better stack my refactorings on top, in case this PR takes a bit longer (sorry again for the extra work with warpspeed!)

…feature/new-tuning-api/scan

…i/scan

griwes · 2026-02-18T23:52:04Z

Note, the warpspeed integration is still largely untested; I've added an rtxpro6000 test job to c.parallel and that will be the primary test right now. I'll lease a machine with a relevant GPU if that fails, or if there's anything that's clearly wrong to someone's eyes in review.

Edit: also seems I messed up some constexprness 😅

…i/scan

bernhardmgruber · 2026-03-13T14:04:53Z

I have been thinking a bit about how the check whether a single stage fits into 48KiB SMEM, and I wondered whether we actually need this check in CCCL.C. The main purpose of the check is to ensure forward compatibility of compiled binaries. So if you compile for sm_100 today and run that binary in 10 years on a GPU that really only has 48KiB SMEM, it should still work. We don't need that guarantee for CCCL.C, since we don't keep around binaries.

The second reason we have this check is that a user could provide us with an input type, or an accumulator type (as dictated by the scan operator), that is so huge that we go beyond 48KiB SMEM even with a conservative tuning policy, and we should just fall back to the old scan, because it's not possible to run the warpspeed scan.

Now I wondered, is the set of types that CCCL.C will use open or closed? Because if we know all types that warpspeed scan will be used from CCCL.C, we can just test if it fits into SMEM in a unit test and entirely omit the entire compile time checking for CCCL.C. We would just drop the SMEM check from the scan_use_warpspeed predicate. That would make this PR a lot simpler.

bernhardmgruber · 2026-03-13T14:28:12Z

I just realized we still need the runtime computation to know how much SMEM we must request :S

griwes · 2026-03-16T21:56:16Z

There is SASS changes. Here's a random assortment of kernels compared: https://gist.github.com/griwes/a94e3daf0d2b58faaeebea1932e0c1b0. I believe that there's a whole bunch of codegen artifacts here + some loss/gain of uniform instructions (presumably because the changes made it both easier and harder for the compiler to reason about uniformity...). I have not spotted any significant changes in the hot paths.

There's also two specific cases that seem to now be producing LMEM instructions, though as far as I can tell it's not in the hot loop either: https://gist.github.com/griwes/e0bc6107675b9a55fc3efabdc7244564.

…i/scan

This currently makes thrust.test.scan fail, which needs to be investigated, since it worked before in the presence of the warpspeed implementation

bernhardmgruber · 2026-03-23T14:05:35Z

While investigating the SASS changes, I noticed that some symbols in the CUB benchmarks contained the use of policy_selector_from_hub, which we should no longer see (its only use to support users directly accessing the dispatcher). I found out that those came from Thrust, so I ported the Thrust CUB backend to use cub::detail::scan::dispatch directly. But that is leading to test failures now in thrust.test.scan, which is super odd.

cub/cub/device/dispatch/kernels/kernel_scan.cuh

bernhardmgruber · 2026-03-23T14:57:28Z

There are now no SASS changes for cub.bench.scan.exclusive.sum.base on SM75;80;86;90;100;120

github-actions · 2026-03-23T18:09:00Z

🥳 CI Workflow Results

🟩 Finished in 2h 59m: Pass: 100%/306 | Total: 11d 06h | Max: 2h 58m | Hits: 74%/272688

See results here.

bernhardmgruber · 2026-03-23T18:28:12Z

pre-commit.ci autofix

bernhardmgruber · 2026-03-23T18:39:25Z

/ok to test 3e55bd0

bernhardmgruber

I have collected a few more pieces of refactorings, but I think those should go to a separate PR after this one.

I dislike some of the host code butchery that was required for CCCL.C, but in most cases I don't see how it could have been done better.

Since there are no SASS diffs and the tests pass, I think this is good to go in!

griwes added 5 commits February 7, 2026 20:58

Base changes in scan and tests.

ad1c1df

Update benchmarks.

6371339

Update copyright years.

9e346b5

Merge remote-tracking branch 'origin/main' into feature/new-tuning-ap…

910d511

…i/scan

c.parallel: centralize the handling of common cub types.

e9467af

griwes requested review from a team as code owners February 8, 2026 05:44

griwes requested a review from shwina February 8, 2026 05:44

github-project-automation bot added this to CCCL Feb 8, 2026

griwes requested a review from elstehle February 8, 2026 05:44

github-project-automation bot moved this to Todo in CCCL Feb 8, 2026

cccl-authenticator-app bot moved this from Todo to In Review in CCCL Feb 8, 2026

This comment has been minimized.

Sign in to view

bernhardmgruber reviewed Feb 9, 2026

View reviewed changes

bernhardmgruber mentioned this pull request Feb 9, 2026

Move delay constructor policies somewhere central #7530

Closed

griwes added 2 commits February 12, 2026 18:09

Resolve review comments.

c4c0c09

Fix c.parallel radix_sort breakage.

2c2db7c

This was referenced Feb 13, 2026

Implement the new tuning API for Dispatch[Streaming]ReduceByKey #7667

Merged

Centralize delay_constructor policy helpers #7668

Merged

griwes added 2 commits February 19, 2026 00:44

integrate warpspeed: Merge remote-tracking branch 'origin/main' into …

a288da0

…feature/new-tuning-api/scan

Merge remote-tracking branch 'origin/main' into feature/new-tuning-ap…

9eec3e2

…i/scan

griwes added 4 commits February 19, 2026 01:15

Compilation fixes.

2a0ddf4

Go through dispatch_arch, unify dispatch paths for scan.

22ece56

Remove cuda::std::optional from policies.

d7f5333

Pull scan_warpspeed_policy out into its own file.

497638c

griwes added 4 commits March 12, 2026 20:56

Codegen fixes.

565a017

Review comments.

9cd3ca0

Merge remote-tracking branch 'origin/main' into feature/new-tuning-ap…

f6af88d

…i/scan

More abstraction layers to restore constexprness.

81b7a7f

This comment has been minimized.

Sign in to view

Correctly check for the constants.

029b195

griwes added 4 commits March 13, 2026 11:41

Another abstraction layer, to remove a constexpr reference to this.

ac03691

I kinda hate this but I think it has to be like this.

a6ed3cd

Silence a warning.

3bb8169

Silence MSVC unreachable code warning.

5dbcfd6

This comment has been minimized.

Sign in to view

Merge remote-tracking branch 'origin/main' into feature/new-tuning-ap…

203e807

…i/scan

This comment has been minimized.

Sign in to view

Move Thrust to new detail::scan::dispatch*

59f726a

This currently makes thrust.test.scan fail, which needs to be investigated, since it worked before in the presence of the warpspeed implementation

bernhardmgruber requested a review from a team as a code owner March 23, 2026 14:02

FIX

3cd8c7d

bernhardmgruber reviewed Mar 23, 2026

View reviewed changes

cub/cub/device/dispatch/kernels/kernel_scan.cuh Outdated Show resolved Hide resolved

Fix const

e48eb66

This was referenced Mar 23, 2026

Fix OOB in warpspeed scan kernel on last partial tiles #8134

Draft

[BUG] warpspeed scan causes OOB reads in some Thrust tests #8136

Open

Merge branch 'main' into feature/new-tuning-api/scan

3e55bd0

bernhardmgruber approved these changes Mar 23, 2026

View reviewed changes

Conversation

griwes commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

This comment has been minimized.

bernhardmgruber left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bernhardmgruber commented Feb 10, 2026

Uh oh!

bernhardmgruber commented Feb 13, 2026

Uh oh!

griwes commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment has been minimized.

bernhardmgruber commented Mar 13, 2026

Uh oh!

bernhardmgruber commented Mar 13, 2026

Uh oh!

This comment has been minimized.

griwes commented Mar 16, 2026

Uh oh!

This comment has been minimized.

bernhardmgruber commented Mar 23, 2026

Uh oh!

Uh oh!

bernhardmgruber commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

🥳 CI Workflow Results

🟩 Finished in 2h 59m: Pass: 100%/306 | Total: 11d 06h | Max: 2h 58m | Hits: 74%/272688

Uh oh!

bernhardmgruber commented Mar 23, 2026

Uh oh!

bernhardmgruber commented Mar 23, 2026

Uh oh!

bernhardmgruber left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

griwes commented Feb 8, 2026 •

edited

Loading

griwes commented Feb 18, 2026 •

edited

Loading

bernhardmgruber left a comment •

edited

Loading