Skip to content

New tuning API for DeviceScanByKey#8164

Open
griwes wants to merge 7 commits intoNVIDIA:mainfrom
griwes:feature/new-tuning-api/scan-by-key
Open

New tuning API for DeviceScanByKey#8164
griwes wants to merge 7 commits intoNVIDIA:mainfrom
griwes:feature/new-tuning-api/scan-by-key

Conversation

@griwes
Copy link
Contributor

@griwes griwes commented Mar 25, 2026

Description

Implements the new, improved, tuning API for ScanByKey.

Resolves #7640.

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@griwes griwes requested review from a team as code owners March 25, 2026 03:07
@griwes griwes requested review from davebayer and shwina March 25, 2026 03:07
@github-project-automation github-project-automation bot moved this to Todo in CCCL Mar 25, 2026
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Mar 25, 2026
@griwes
Copy link
Contributor Author

griwes commented Mar 25, 2026

Note: verifying SASS is still TODO, will get to it shortly.

@griwes
Copy link
Contributor Author

griwes commented Mar 25, 2026

After the above policy realignment, there's no more SASS differences.

@griwes
Copy link
Contributor Author

griwes commented Mar 25, 2026

Note, turns out there's still more work on the policy values here.

@github-actions

This comment has been minimized.

@griwes
Copy link
Contributor Author

griwes commented Mar 25, 2026

I think that fixes the policies, currently still no SASS diffs.

@github-actions
Copy link
Contributor

🥳 CI Workflow Results

🟩 Finished in 1h 55m: Pass: 100%/249 | Total: 8d 06h | Max: 1h 32m | Hits: 72%/160202

See results here.

@griwes griwes enabled auto-merge (squash) March 25, 2026 23:13
Comment on lines 118 to 127
@@ -119,7 +125,37 @@ template <typename ChainedPolicyT,
typename OffsetT,
typename AccumT,
typename KeyT = cub::detail::it_value_t<KeysInputIteratorT>>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: those parameters are unused, let's remove them.

OffsetT,
AccumT,
KeyT>(),
1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: This 1 was not there before and could impact SASS. Maybe stick to the old arguments and remove it.

Comment on lines +447 to +448
template <typename PolicyGetter, typename PolicySelectorT>
CUB_RUNTIME_FUNCTION _CCCL_FORCEINLINE cudaError_t __invoke(PolicyGetter policy_getter, const PolicySelectorT&)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: I think PolicySelectorT is unused, so it can be removed again.

Suggested change
template <typename PolicyGetter, typename PolicySelectorT>
CUB_RUNTIME_FUNCTION _CCCL_FORCEINLINE cudaError_t __invoke(PolicyGetter policy_getter, const PolicySelectorT&)
template <typename PolicyGetter>
CUB_RUNTIME_FUNCTION _CCCL_FORCEINLINE cudaError_t __invoke(PolicyGetter policy_getter)

#include <nvbench_helper.cuh>

#include "../policy_selector.h"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the policy selector there handles scan by key, since it returns a scan_policy and not a scan_by_key_policy. I think we need a dedicated policy selector for this benchmark here.

Comment on lines +1119 to +1128
if (primitive_op_t == primitive_op::yes)
{
switch (key_size)
{
case 1:
switch (value_size)
{
case 1:
if (primitive_value_t == primitive_accum::yes)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: can we simplify this? Maybe it would be more readable if we did if/else clauses like:

Suggested change
if (primitive_op_t == primitive_op::yes)
{
switch (key_size)
{
case 1:
switch (value_size)
{
case 1:
if (primitive_value_t == primitive_accum::yes)
{
const bool prim_op = primitive_op_t == primitive_op::yes;
const bool prim_val = primitive_value_t == primitive_accum::yes;
if (prim_op && key_size == 1 && value_size == 1 && prim_val) {
...
}
else if (prim_op && key_size == 1 && value_size == 2 && prim_val) {
...
}
...

}
}

arch = ::cuda::arch_id::sm_80;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: I think it's very confusing if we change the arch argument during this already very long function. Let's try to avoid that.

Comment on lines +582 to +587
template <typename PolicyGetter, typename PolicySelectorT>
CUB_RUNTIME_FUNCTION _CCCL_HOST _CCCL_FORCEINLINE cudaError_t
invoke(PolicyGetter policy_getter, const PolicySelectorT& policy_selector)
{
return __invoke(policy_getter, policy_selector);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Why do we need this function? Can't we just call __invoke directly?

"Dispatching DeviceScanByKey to arch %d with tuning: %s\n", static_cast<int>(arch_id), ss.str().c_str());))
#endif

return detail::dispatch_arch(policy_selector, arch_id, [&](auto policy_getter) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: IIUC, the active tuning policy is never needed as a compile-time value, so we can just omit using dispatch_arch and just query the policy like and keep it a runtime value:

const scan_by_key_policy active_policy = policy_selector(arch_id);

This would also reduce template instantiations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

Implement the new tuning API for DispatchScanByKey

2 participants