Expose PSTL algorithms through `<cuda/std/algorithm>` and `<cuda/std/numeric>` by miscco · Pull Request #7931 · NVIDIA/cccl

miscco · 2026-03-09T10:49:18Z

We discussed this internally and we are happy with the results of the parallel CUDA backend. So we want to expose this now rather than waiting for all algorithms to be implemented.

There are certain caveats:

We require random access iterators for the CUDA backend
We do not expose only a CUDA backend through cuda::execution::gpu. Standard execution policies will currently static_assert that there is a missing backend
We do not provide any fallback serial implementation. This would be dangerous, because the serial implementation would naively run on host and not device.

libcudacxx/benchmarks/bench/remove_copy/basic.cu

libcudacxx/include/cuda/__execution/policy.h

libcudacxx/test/libcudacxx/std/algorithms/alg.modifying/alg.copy/pstl_copy_n.cu

bernhardmgruber · 2026-03-10T10:04:59Z

libcudacxx/include/cuda/std/algorithm

+// parallel algorithms
+#if _CCCL_HAS_PSTL_BACKEND()
+#  include <cuda/std/__pstl/adjacent_find.h>
+#  include <cuda/std/__pstl/all_of.h>
+#  include <cuda/std/__pstl/any_of.h>
+#  include <cuda/std/__pstl/copy.h>
+#  include <cuda/std/__pstl/copy_if.h>
+#  include <cuda/std/__pstl/copy_n.h>


Q: I thought many standard libraries would expose PSTL algorithms through the <execution> header and not <algorithm>. This would make the inclusion of <algorithm> cheaper.

Discussed this with @miscco offline and it seems the C++ standard requires the overloads to be in <algorithm>. However, it may not be observable to the common user, since they need to include <execution> in addition to supply an execution policy.

If it's not observable, then I would like to see exposing it in the <execution> header to avoid bloating <algorithm>.

I do not believe that is a correct statement.

<execution> can include it all and be fine, but then <algorithm> would not have it.

The point is that the pstl headers pull effectively all of <algorithm>

can include it all and be fine, but then would not have it.

Why is the advantage of <algorithm> having an overload that cannot be called if a user does not also include <execution>?

The point is that the pstl headers pull effectively all of

This is fine IMO, including a PSTL header can be more expensive.

Moved the exposure to <cuda/std/execution> in the hope of then being able to expose via a modularized access

bernhardmgruber · 2026-03-10T16:39:30Z

@miscco could you please measure the compile-time of

#include <cuda/std/algorithm>
int main() {
  return cuda::std::min(0, 2);
}

before and after this PR? I would be curious how much of an impact pulling in most of CUB has ;)

miscco · 2026-03-23T09:34:47Z

I checked the differences in compile times for the <cuda/std/algorithm>, <cuda/std/numeric> and <cuda/std/execution> header compared against main

Main:

<cuda/std/algorithm>: 0.709372
<cuda/std/numeric>: 0.545339
<cuda/std/execution>: 1.004296

With PSTL

<cuda/std/algorithm>: 2.118540 -> 300%
<cuda/std/numeric>: 2.471399 -> 450%
<cuda/std/execution>: 5.873960 ->587%

The reason that <cuda/std/execution> is hit so bad is that it also has to include all of the serial algorithms too

bernhardmgruber · 2026-03-23T11:39:31Z

Main:

<cuda/std/algorithm>: 0.709372

<cuda/std/numeric>: 0.545339

<cuda/std/execution>: 1.004296

With PSTL

<cuda/std/algorithm>: 2.118540 -> 300%

<cuda/std/numeric>: 2.471399 -> 450%

<cuda/std/execution>: 5.873960 ->587%

Thank you for providing these numbers! I think we should go ahead with the status quo and expose the PSTL through <cuda/std/execution> to not burden existing users including <cuda/std/algorithm> and <cuda/std/numeric>.

github-actions · 2026-03-23T15:05:55Z

🥳 CI Workflow Results

🟩 Finished in 3h 40m: Pass: 100%/105 | Total: 2d 19h | Max: 3h 40m | Hits: 80%/258702

See results here.

miscco · 2026-03-23T18:39:11Z

We discussed this today in the our internal review meeting and decided to better align the customization with the other environment work

miscco requested review from a team as code owners March 9, 2026 10:49

miscco requested review from jrhemstad and shwina March 9, 2026 10:49

github-project-automation bot added this to CCCL Mar 9, 2026

github-project-automation bot moved this to Todo in CCCL Mar 9, 2026

cccl-authenticator-app bot moved this from Todo to In Review in CCCL Mar 9, 2026

davebayer reviewed Mar 9, 2026

View reviewed changes

libcudacxx/benchmarks/bench/remove_copy/basic.cu Outdated Show resolved Hide resolved

bernhardmgruber reviewed Mar 9, 2026

View reviewed changes

libcudacxx/include/cuda/__execution/policy.h Outdated Show resolved Hide resolved

bernhardmgruber approved these changes Mar 9, 2026

View reviewed changes