New tuning API for DeviceScanByKey by griwes · Pull Request #8164 · NVIDIA/cccl

griwes · 2026-03-25T03:07:12Z

Description

Implements the new, improved, tuning API for ScanByKey.

Resolves #7640.

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

griwes · 2026-03-25T03:18:27Z

Note: verifying SASS is still TODO, will get to it shortly.

griwes · 2026-03-25T06:42:00Z

After the above policy realignment, there's no more SASS differences.

griwes · 2026-03-25T08:22:40Z

Note, turns out there's still more work on the policy values here.

griwes · 2026-03-25T20:25:42Z

I think that fixes the policies, currently still no SASS diffs.

…i/scan-by-key

github-actions · 2026-03-25T22:41:08Z

🥳 CI Workflow Results

🟩 Finished in 1h 55m: Pass: 100%/249 | Total: 8d 06h | Max: 1h 32m | Hits: 72%/160202

See results here.

bernhardmgruber · 2026-03-25T10:15:54Z

cub/cub/device/dispatch/dispatch_scan_by_key.cuh

@@ -119,7 +125,37 @@ template <typename ChainedPolicyT,
          typename OffsetT,
          typename AccumT,
          typename KeyT = cub::detail::it_value_t<KeysInputIteratorT>>


Important: those parameters are unused, let's remove them.

bernhardmgruber · 2026-03-25T10:16:35Z

cub/cub/device/dispatch/dispatch_scan_by_key.cuh

+    OffsetT,
+    AccumT,
+    KeyT>(),
+  1)


Important: This 1 was not there before and could impact SASS. Maybe stick to the old arguments and remove it.

bernhardmgruber · 2026-03-25T10:21:20Z

cub/cub/device/dispatch/dispatch_scan_by_key.cuh

+  template <typename PolicyGetter, typename PolicySelectorT>
+  CUB_RUNTIME_FUNCTION _CCCL_FORCEINLINE cudaError_t __invoke(PolicyGetter policy_getter, const PolicySelectorT&)


Important: I think PolicySelectorT is unused, so it can be removed again.

Suggested change

template <typename PolicyGetter, typename PolicySelectorT>

CUB_RUNTIME_FUNCTION _CCCL_FORCEINLINE cudaError_t __invoke(PolicyGetter policy_getter, const PolicySelectorT&)

template <typename PolicyGetter>

CUB_RUNTIME_FUNCTION _CCCL_FORCEINLINE cudaError_t __invoke(PolicyGetter policy_getter)

bernhardmgruber · 2026-03-26T10:03:41Z

cub/benchmarks/bench/scan/exclusive/by_key.cu

 #include <nvbench_helper.cuh>

+#include "../policy_selector.h"
+


I don't think the policy selector there handles scan by key, since it returns a scan_policy and not a scan_by_key_policy. I think we need a dedicated policy selector for this benchmark here.

bernhardmgruber · 2026-03-26T10:11:07Z

cub/cub/device/dispatch/tuning/tuning_scan_by_key.cuh

+      if (primitive_op_t == primitive_op::yes)
+      {
+        switch (key_size)
+        {
+          case 1:
+            switch (value_size)
+            {
+              case 1:
+                if (primitive_value_t == primitive_accum::yes)
+                {


Suggestion: can we simplify this? Maybe it would be more readable if we did if/else clauses like:

Suggested change

if (primitive_op_t == primitive_op::yes)

{

switch (key_size)

{

case 1:

switch (value_size)

{

case 1:

if (primitive_value_t == primitive_accum::yes)

{

const bool prim_op = primitive_op_t == primitive_op::yes;

const bool prim_val = primitive_value_t == primitive_accum::yes;

if (prim_op && key_size == 1 && value_size == 1 && prim_val) {

...

}

else if (prim_op && key_size == 1 && value_size == 2 && prim_val) {

...

}

...

bernhardmgruber · 2026-03-26T10:11:57Z

cub/cub/device/dispatch/tuning/tuning_scan_by_key.cuh

+        }
+      }
+
+      arch = ::cuda::arch_id::sm_80;


Important: I think it's very confusing if we change the arch argument during this already very long function. Let's try to avoid that.

bernhardmgruber · 2026-03-26T10:14:00Z

cub/cub/device/dispatch/dispatch_scan_by_key.cuh

+  template <typename PolicyGetter, typename PolicySelectorT>
+  CUB_RUNTIME_FUNCTION _CCCL_HOST _CCCL_FORCEINLINE cudaError_t
+  invoke(PolicyGetter policy_getter, const PolicySelectorT& policy_selector)
+  {
+    return __invoke(policy_getter, policy_selector);
+  }


Q: Why do we need this function? Can't we just call __invoke directly?

bernhardmgruber · 2026-03-26T10:17:30Z

cub/cub/device/dispatch/dispatch_scan_by_key.cuh

+         "Dispatching DeviceScanByKey to arch %d with tuning: %s\n", static_cast<int>(arch_id), ss.str().c_str());))
+#endif
+
+    return detail::dispatch_arch(policy_selector, arch_id, [&](auto policy_getter) {


Suggestion: IIUC, the active tuning policy is never needed as a compile-time value, so we can just omit using dispatch_arch and just query the policy like and keep it a runtime value:

const scan_by_key_policy active_policy = policy_selector(arch_id);

This would also reduce template instantiations.

New tuning API for DeviceScanByKey

bc76d6a

griwes requested review from a team as code owners March 25, 2026 03:07

griwes requested review from davebayer and shwina March 25, 2026 03:07

github-project-automation bot added this to CCCL Mar 25, 2026

github-project-automation bot moved this to Todo in CCCL Mar 25, 2026

cccl-authenticator-app bot moved this from Todo to In Review in CCCL Mar 25, 2026

griwes added 2 commits March 24, 2026 23:38

Clean up no longer used variable.

a87cf35

Policy realignment with policy_hub.

d8d2492

This comment has been minimized.

Sign in to view

griwes added 2 commits March 25, 2026 15:19

Fix policy snafus.

d1341bc

Transplant policy benchmarking comments.

fd4af22

griwes added 2 commits March 25, 2026 15:31

Merge remote-tracking branch 'origin/main' into feature/new-tuning-ap…

5c2813f

…i/scan-by-key

Fix incomplete merge.

14d51c7

griwes enabled auto-merge (squash) March 25, 2026 23:13

bernhardmgruber reviewed Mar 26, 2026

View reviewed changes

		template <typename PolicyGetter, typename PolicySelectorT>
		CUB_RUNTIME_FUNCTION _CCCL_FORCEINLINE cudaError_t __invoke(PolicyGetter policy_getter, const PolicySelectorT&)

		#include <nvbench_helper.cuh>

		#include "../policy_selector.h"

Conversation

griwes commented Mar 25, 2026

Description

Checklist

Uh oh!

griwes commented Mar 25, 2026

Uh oh!

griwes commented Mar 25, 2026

Uh oh!

griwes commented Mar 25, 2026

Uh oh!

This comment has been minimized.

griwes commented Mar 25, 2026

Uh oh!

github-actions bot commented Mar 25, 2026

🥳 CI Workflow Results

🟩 Finished in 1h 55m: Pass: 100%/249 | Total: 8d 06h | Max: 1h 32m | Hits: 72%/160202

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants