Skip to content

Expose PSTL algorithms through <cuda/std/algorithm> and <cuda/std/numeric>#7931

Open
miscco wants to merge 3 commits intoNVIDIA:mainfrom
miscco:expose_pstl
Open

Expose PSTL algorithms through <cuda/std/algorithm> and <cuda/std/numeric>#7931
miscco wants to merge 3 commits intoNVIDIA:mainfrom
miscco:expose_pstl

Conversation

@miscco
Copy link
Contributor

@miscco miscco commented Mar 9, 2026

We discussed this internally and we are happy with the results of the parallel CUDA backend. So we want to expose this now rather than waiting for all algorithms to be implemented.

There are certain caveats:

  • We require random access iterators for the CUDA backend

  • We do not expose only a CUDA backend through cuda::execution::gpu. Standard execution policies will currently static_assert that there is a missing backend

  • We do not provide any fallback serial implementation. This would be dangerous, because the serial implementation would naively run on host and not device.

@miscco miscco requested review from a team as code owners March 9, 2026 10:49
@miscco miscco requested review from jrhemstad and shwina March 9, 2026 10:49
@github-project-automation github-project-automation bot moved this to Todo in CCCL Mar 9, 2026
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Mar 9, 2026
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

Comment on lines +134 to +141
// parallel algorithms
#if _CCCL_HAS_PSTL_BACKEND()
# include <cuda/std/__pstl/adjacent_find.h>
# include <cuda/std/__pstl/all_of.h>
# include <cuda/std/__pstl/any_of.h>
# include <cuda/std/__pstl/copy.h>
# include <cuda/std/__pstl/copy_if.h>
# include <cuda/std/__pstl/copy_n.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: I thought many standard libraries would expose PSTL algorithms through the <execution> header and not <algorithm>. This would make the inclusion of <algorithm> cheaper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this with @miscco offline and it seems the C++ standard requires the overloads to be in <algorithm>. However, it may not be observable to the common user, since they need to include <execution> in addition to supply an execution policy.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not observable, then I would like to see exposing it in the <execution> header to avoid bloating <algorithm>.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not believe that is a correct statement.

<execution> can include it all and be fine, but then <algorithm> would not have it.

The point is that the pstl headers pull effectively all of <algorithm>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can include it all and be fine, but then would not have it.

Why is the advantage of <algorithm> having an overload that cannot be called if a user does not also include <execution>?

The point is that the pstl headers pull effectively all of

This is fine IMO, including a PSTL header can be more expensive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the exposure to <cuda/std/execution> in the hope of then being able to expose via a modularized access

@github-actions

This comment has been minimized.

@bernhardmgruber
Copy link
Contributor

@miscco could you please measure the compile-time of

#include <cuda/std/algorithm>
int main() {
  return cuda::std::min(0, 2);
}

before and after this PR? I would be curious how much of an impact pulling in most of CUB has ;)

@github-actions

This comment has been minimized.

@miscco miscco requested a review from a team as a code owner March 17, 2026 19:09
@miscco miscco requested a review from alliepiper March 17, 2026 19:09
@github-actions

This comment has been minimized.

@miscco
Copy link
Contributor Author

miscco commented Mar 23, 2026

I checked the differences in compile times for the <cuda/std/algorithm>, <cuda/std/numeric> and <cuda/std/execution> header compared against main

Main:

  • <cuda/std/algorithm>: 0.709372
  • <cuda/std/numeric>: 0.545339
  • <cuda/std/execution>: 1.004296

With PSTL

  • <cuda/std/algorithm>: 2.118540 -> 300%
  • <cuda/std/numeric>: 2.471399 -> 450%
  • <cuda/std/execution>: 5.873960 ->587%

The reason that <cuda/std/execution> is hit so bad is that it also has to include all of the serial algorithms too

@bernhardmgruber
Copy link
Contributor

Main:

  • <cuda/std/algorithm>: 0.709372
  • <cuda/std/numeric>: 0.545339
  • <cuda/std/execution>: 1.004296

With PSTL

  • <cuda/std/algorithm>: 2.118540 -> 300%
  • <cuda/std/numeric>: 2.471399 -> 450%
  • <cuda/std/execution>: 5.873960 ->587%

Thank you for providing these numbers! I think we should go ahead with the status quo and expose the PSTL through <cuda/std/execution> to not burden existing users including <cuda/std/algorithm> and <cuda/std/numeric>.

@github-actions
Copy link
Contributor

🥳 CI Workflow Results

🟩 Finished in 3h 40m: Pass: 100%/105 | Total: 2d 19h | Max: 3h 40m | Hits: 80%/258702

See results here.

@miscco
Copy link
Contributor Author

miscco commented Mar 23, 2026

We discussed this today in the our internal review meeting and decided to better align the customization with the other environment work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

4 participants