Skip to content

Add env overloads for DeviceSegmentedRadixSort#7999

Open
gonidelis wants to merge 2 commits intoNVIDIA:mainfrom
gonidelis:seg-radix-sort-env
Open

Add env overloads for DeviceSegmentedRadixSort#7999
gonidelis wants to merge 2 commits intoNVIDIA:mainfrom
gonidelis:seg-radix-sort-env

Conversation

@gonidelis
Copy link
Member

fixes #7549

No deterministic guarantees. It's sorting, if it's not deterministic it's just wrong.

@gonidelis gonidelis requested a review from a team as a code owner March 12, 2026 02:47
@gonidelis gonidelis requested a review from pauleonix March 12, 2026 02:47
@github-project-automation github-project-automation bot moved this to Todo in CCCL Mar 12, 2026
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Mar 12, 2026
@gonidelis gonidelis force-pushed the seg-radix-sort-env branch from 2c8d80a to 55ca027 Compare March 12, 2026 04:13
@github-actions

This comment has been minimized.

Copy link
Contributor

@NaderAlAwar NaderAlAwar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: we are missing the DoubleBuffer overloads

offsets.begin() + 1,
0,
sizeof(int) * 8,
env);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: do we want to add a test where we do not explicitly pass an env? Other env API tests seem to have both kinds, explicit and implicit env objects.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's fine. unit tests cover that case and users can see it throughout docs as soon as they start using our env overloads. don't wanna over bloat our test base for

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have that test in the other file added in this PR.

@gonidelis
Copy link
Member Author

Important: we are missing the DoubleBuffer overloads

@NaderAlAwar will be added in subsequent pr for economy of reviewing effort per pr

offsets.begin() + 1,
0,
sizeof(int) * 8,
env);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have that test in the other file added in this PR.

@github-actions
Copy link
Contributor

🥳 CI Workflow Results

🟩 Finished in 1h 10m: Pass: 100%/249 | Total: 3d 19h | Max: 55m 34s | Hits: 94%/157462

See results here.

@pauleonix
Copy link
Contributor

pauleonix commented Mar 19, 2026

No deterministic guarantees. It's sorting, if it's not deterministic it's just wrong.

Only because our radix-sort is stable by design. Unstable sort could be non-deterministic 😉

Copy link
Contributor

@pauleonix pauleonix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still some nits

//!
//! Snippet
//!
//! .. literalinclude:: ../../../cub/test/catch2_test_device_segmented_radix_sort_env_api.cu
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//! .. literalinclude:: ../../../cub/test/catch2_test_device_segmented_radix_sort_env_api.cu
//! .. literalinclude:: ../../test/catch2_test_device_segmented_radix_sort_env_api.cu

//! This is an environment-based API that allows customization of:
//!
//! - Stream: Query via ``cuda::get_stream``
//! - Memory resource: Query via ``cuda::mr::get_memory_resource``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I like this better than the - Can use a specific stream or cuda memory resource through the ``env`` parameter I saw in the previous PR. I'm also not sure this list wont be rendered/read as part as the following list.

//! yield a corresponding performance improvement.
//! - Note, the size of any segment may not exceed ``INT_MAX``. Please consider using ``DeviceSegmentedSort`` instead,
//! if the size of at least one of your segments could exceed ``INT_MAX``.
//!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should have a @env like the @devicestorage on the old APIs to avoid the repetition. Even if it needs multiple like @env-with-guarantees.

Copy link
Contributor

@bernhardmgruber bernhardmgruber Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered suggesting the same! No need to do it in this PR though.

//! - An optional bit subrange ``[begin_bit, end_bit)`` of differentiating key
//! bits can be specified. This can reduce overall sorting overhead and
//! yield a corresponding performance improvement.
//! - Note, the size of any segment may not exceed ``INT_MAX``. Please consider using ``DeviceSegmentedSort`` instead,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing

  //! - Let ``in`` be one of ``{d_keys_in, d_values_in}`` and `out` be any of
  //!   ``{d_keys_out, d_values_out}``. The range ``[out, out + num_items)`` shall
  //!   not overlap ``[in, in + num_items)``,
  //!   ``[d_begin_offsets, d_begin_offsets + num_segments)`` nor
  //!   ``[d_end_offsets, d_end_offsets + num_segments)`` in any way.

and

  //! - Segments are not required to be contiguous. For all index values ``i``
  //!   outside the specified segments ``d_keys_in[i]``, ``d_values_in[i]``,
  //!   ``d_keys_out[i]``, ``d_values_out[i]`` will not be accessed nor modified.

Maybe also

  //! - @devicestorageNP For sorting using only ``O(P)`` temporary storage, see
  //!   the sorting interface using DoubleBuffer wrappers below.

}

//! @rst
//! Overview
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "title" is new in comparison to the existing docs. If it stays, it looks like it should be underlined (rendered as a sub-title)

Suggested change
//! Overview
//! Overview
//! +++++++++++++++++++++++++++++++++++++++++++++

Same for Snippet below (which is not new but has this underline in existing docs)

typename BeginOffsetIteratorT,
typename EndOffsetIteratorT,
typename EnvT = ::cuda::std::execution::env<>>
[[nodiscard]] CUB_RUNTIME_FUNCTION _CCCL_FORCEINLINE static cudaError_t SortKeys(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is _CCCL_FORCEINLINE necessary here? Existing overload does not have it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary, please remove!

//! - An optional bit subrange ``[begin_bit, end_bit)`` of differentiating key
//! bits can be specified. This can reduce overall sorting overhead and
//! yield a corresponding performance improvement.
//! - Note, the size of any segment may not exceed ``INT_MAX``. Please consider using ``DeviceSegmentedSort`` instead,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing entries again.

thrust::raw_pointer_cast(keys_out.data()),
thrust::raw_pointer_cast(values_in.data()),
thrust::raw_pointer_cast(values_out.data()),
static_cast<int>(keys_in.size()),
Copy link
Contributor

@pauleonix pauleonix Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static_cast<int>(keys_in.size()),
static_cast<cuda::std::int64_t>(keys_in.size()),

REQUIRE(keys_out == expected_keys);
}

TEST_CASE("DeviceSegmentedRadixSort::SortPairs uses custom stream", "[segmented_radix_sort][device]")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be tested for all overloads?

thrust::raw_pointer_cast(keys_out.data()),
thrust::raw_pointer_cast(values_in.data()),
thrust::raw_pointer_cast(values_out.data()),
static_cast<int>(keys_in.size()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static_cast<int>(keys_in.size()),
static_cast<cuda::std::int64_t>(keys_in.size()),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a cast at all here? To suppress the unsigned -> signed warning, right?

Copy link
Contributor

@bernhardmgruber bernhardmgruber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please apply the remaining feedback from other reviewers. Otherwise LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

Add env-based API for cub::DeviceSegmentedRadixSort

4 participants