Skip to content

Commit 65b4fde

Browse files
authored
Improvements to cuda.compute documentation (#7061)
* Updates to documentation and examples * Lint * Address review feedback * Remove stray backtick --------- Co-authored-by: Ashwin Srinath <[email protected]>
1 parent ee32f63 commit 65b4fde

File tree

6 files changed

+253
-110
lines changed

6 files changed

+253
-110
lines changed

docs/python/compute.rst

Lines changed: 208 additions & 75 deletions
Original file line numberDiff line numberDiff line change
@@ -3,142 +3,275 @@
33
``cuda.compute``: Parallel Computing Primitives
44
===============================================
55

6-
The ``cuda.compute`` library provides parallel computing primitives that operate
7-
on entire arrays or ranges of data. These algorithms are designed to be easy to use from Python
8-
while delivering the performance of hand-optimized CUDA kernels, portable across different
9-
GPU architectures.
6+
The ``cuda.compute`` library provides composable primitives for building custom
7+
parallel algorithms on the GPU—without writing CUDA kernels directly.
108

119
Algorithms
1210
----------
1311

14-
The core functionality provided by the ``cuda.compute`` library are algorithms such
15-
as reductions, scans, sorts, and transforms.
12+
Algorithms are the core of ``cuda.compute``. They operate on arrays or
13+
:ref:`iterators <cuda.compute.iterators>` and can be composed to build specialized
14+
GPU operations—reductions, scans, sorts, transforms, and more.
1615

17-
Here's a simple example showing how to use the :func:`reduce_into <cuda.compute.algorithms.reduce_into>` algorithm to
18-
reduce an array of integers.
16+
Typical usage of an algorithm looks like this:
1917

18+
.. code-block:: python
19+
20+
cuda.compute.reduce_into(
21+
d_in=..., # input array or iterator
22+
d_out=..., # output array or iterator
23+
op=..., # binary operator (built-in or user-defined)
24+
num_items=..., # number of input elements
25+
h_init=..., # initial value for the reduction
26+
)
27+
28+
API conventions
29+
+++++++++++++++
30+
31+
* **Naming** — The ``d_`` prefix denotes *device* memory (e.g., CuPy arrays, PyTorch tensors);
32+
``h_`` denotes *host* memory (NumPy arrays). Some scalar values must be passed as
33+
host arrays.
34+
35+
* **Output semantics** — Algorithms write results into a user-provided array or iterator
36+
rather than returning them. This keeps memory ownership explicit and lifetimes under
37+
your control.
38+
39+
* **Operators** — Many algorithms accept an ``op`` parameter. This can be a built-in
40+
:class:`OpKind <cuda.compute.op.OpKind>` value or a
41+
:ref:`user-defined function <cuda.compute.user_defined_operations>`.
42+
When possible, prefer built-in operators (e.g., ``OpKind.PLUS``) over the equivalent
43+
user-defined operation (e.g., ``lambda a, b: a + b``) for better performance.
44+
45+
* **Iterators** — Inputs and outputs can be :ref:`iterators <cuda.compute.iterators>`
46+
instead of arrays, enabling lazy evaluation and operation fusion.
47+
48+
Full Example
49+
++++++++++++
50+
51+
The following example uses :func:`reduce_into <cuda.compute.algorithms.reduce_into>`
52+
to compute the sum of a sequence of integers:
2053

2154
.. literalinclude:: ../../python/cuda_cccl/tests/compute/examples/reduction/sum_reduction.py
2255
:language: python
2356
:start-after: # example-begin
24-
:caption: Basic reduction example.
57+
:caption: Sum reduction example.
58+
59+
Controlling temporary memory
60+
++++++++++++++++++++++++++++
61+
62+
Many algorithms allocate temporary device memory for intermediate results. For finer
63+
control over allocation—or to reuse buffers across calls—use the object-based API.
64+
For example, :func:`make_reduce_into <cuda.compute.algorithms.make_reduce_into>`
65+
returns a reusable reduction object that lets you manage memory explicitly.
66+
67+
.. code-block:: python
68+
:caption: Controlling temporary memory.
2569
26-
Many algorithms, including reduction, require a temporary memory buffer.
27-
The library will allocate this buffer for you, but you can also use the
28-
object-based API for greater control.
70+
# create a reducer object:
71+
reducer = cuda.compute.make_reduce_into(d_in, d_out, op, h_init)
72+
# get the temporary storage size by passing None as the first argument:
73+
temp_storage_bytes = reducer(None, d_in, d_out, num_items, h_init)
74+
# allocate the temporary storage as any array-like object
75+
# (e.g., CuPy array, Torch tensor):
76+
temp_storage = cp.empty(temp_storage_bytes, dtype=np.uint8)
77+
# perform the reduction, passing the temporary storage as the first argument:
78+
reducer(temp_storage, d_in, d_out, num_items, h_init)
2979
30-
.. literalinclude:: ../../python/cuda_cccl/tests/compute/examples/reduction/reduce_object.py
80+
.. _cuda.compute.user_defined_operations:
81+
82+
User-Defined Operations
83+
-----------------------
84+
85+
A powerful feature is the ability to use algorithms with user-defined operations.
86+
For example, to compute the sum of only the even values in a sequence,
87+
we can use :func:`reduce_into <cuda.compute.algorithms.reduce_into>` with a custom binary operation:
88+
89+
.. literalinclude:: ../../python/cuda_cccl/tests/compute/examples/reduction/sum_custom_reduction.py
3190
:language: python
3291
:start-after: # example-begin
33-
:caption: Reduction with object-based API.
92+
:caption: Reduction with a custom binary operation.
3493

94+
Features and Restrictions
95+
+++++++++++++++++++++++++
96+
97+
User-defined operations are compiled into device code using
98+
`Numba CUDA <https://nvidia.github.io/numba-cuda/>`_, so they inherit many
99+
of the same features and restrictions as Numba CUDA functions:
100+
101+
* `Python features <https://nvidia.github.io/numba-cuda/user/cudapysupported.html>`_
102+
and `atomic operations <https://nvidia.github.io/numba-cuda/user/intrinsics.html>`_
103+
supported by Numba CUDA are also supported within user-defined operations.
104+
* Nested functions must be decorated with ``@numba.cuda.jit``.
105+
* Variables captured in closures or globals follow
106+
`Numba CUDA semantics <https://nvidia.github.io/numba-cuda/user/globals.html>`_:
107+
scalars and host arrays are captured by value (as constants),
108+
while device arrays are captured by reference.
109+
110+
.. _cuda.compute.iterators:
35111

36112
Iterators
37113
---------
38114

39-
Algorithms can be used not just on arrays, but also on iterators. Iterators
40-
provide a way to represent sequences of data without needing to allocate memory
41-
for them.
115+
Iterators represent sequences whose elements are computed **on the fly**. They can
116+
be used in place of arrays in most algorithms, enabling lazy evaluation, operation
117+
fusion, and custom data access patterns.
118+
119+
A :func:`CountingIterator <cuda.compute.iterators.CountingIterator>`, for example,
120+
represents an integer sequence starting from a given value:
121+
122+
.. code-block:: python
42123
43-
Here's an example showing how to use reduction with a :func:`CountingIterator <cuda.compute.iterators.CountingIterator>` that
44-
generates a sequence of numbers starting from a specified value.
124+
it = CountingIterator(np.int32(1)) # represents [1, 2, 3, 4, ...]
125+
126+
To compute the sum of the first 100 integers, we can pass a
127+
:func:`CountingIterator <cuda.compute.iterators.CountingIterator>` directly to
128+
:func:`reduce_into <cuda.compute.algorithms.reduce_into>`. No memory is allocated
129+
to store the input sequence—the values are generated as needed.
45130

46131
.. literalinclude:: ../../python/cuda_cccl/tests/compute/examples/iterator/counting_iterator_basic.py
47132
:language: python
48133
:start-after: # example-begin
49134
:caption: Counting iterator example.
50135

51-
Iterators also provide a way to compose operations. Here's an example showing
52-
how to use :func:`reduce_into <cuda.compute.algorithms.reduce_into>` with a :func:`TransformIterator <cuda.compute.iterators.TransformIterator>` to compute the sum of squares
53-
of a sequence of numbers.
136+
Iterators can also be used to *fuse* operations. In the example below, a
137+
:func:`TransformIterator <cuda.compute.iterators.TransformIterator>` lazily applies
138+
the square operation to each element of the input sequence. The resulting iterator
139+
is then passed to :func:`reduce_into <cuda.compute.algorithms.reduce_into>` to compute
140+
the sum of squares.
141+
142+
Because the square is evaluated on demand during the reduction, there is no need
143+
to create or store an intermediate array of squared values. The transform and the
144+
reduction are fused into a single pass over the data.
54145

55146
.. literalinclude:: ../../python/cuda_cccl/tests/compute/examples/iterator/transform_iterator_basic.py
56147
:language: python
57148
:start-after: # example-begin
58149
:caption: Transform iterator example.
59150

60-
Iterators that wrap an array (or another output iterator) may be used as both input and output iterators.
61-
Here's an example showing how to use a
62-
:func:`TransformIterator <cuda.compute.iterators.TransformIterator>` to transform the output
63-
of a reduction before writing to the underlying array.
151+
Some iterators can also be used as the output of an algorithm. In the example below,
152+
a :func:`TransformOutputIterator <cuda.compute.iterators.TransformOutputIterator>`
153+
applies the square-root operation to the result of a reduction before writing
154+
it into the underlying array.
64155

65156
.. literalinclude:: ../../python/cuda_cccl/tests/compute/examples/iterator/transform_output_iterator.py
66157
:language: python
67158
:start-after: # example-begin
68159
:caption: Transform output iterator example.
69160

70-
Custom Types
161+
As another example, :func:`ZipIterator <cuda.compute.iterators.ZipIterator>` combines multiple
162+
arrays or iterators into a single logical sequence. In the example below, we combine
163+
a counting iterator and an array, creating an iterator that yields ``(index, value)``
164+
pairs. This combined iterator is then used as the input to
165+
:func:`reduce_into <cuda.compute.algorithms.reduce_into>` to compute the index of
166+
the maximum value in the array.
167+
168+
.. literalinclude:: ../../python/cuda_cccl/tests/compute/examples/iterator/zip_iterator_counting.py
169+
:language: python
170+
:start-after: # example-begin
171+
:caption: Argmax using a zip iterator.
172+
173+
These examples illustrate a few of the patterns enabled by iterators. See the
174+
:ref:`API reference <cuda_compute-module>` for the full set of available iterators.
175+
176+
.. _cuda.compute.custom_types:
177+
178+
Struct Types
71179
------------
72180

73-
The ``cuda.compute`` library supports defining custom data types,
74-
using the :func:`gpu_struct <cuda.compute.struct.gpu_struct>` decorator.
75-
Here are some examples showing how to define and use custom types:
181+
The :func:`gpu_struct <cuda.compute.struct.gpu_struct>` decorator defines
182+
GPU-compatible struct types. These are useful when you have data laid out
183+
as an "array of structures", similar to `NumPy structured arrays <https://numpy.org/doc/stable/user/basics.rec.html>`_.
76184

77185
.. literalinclude:: ../../python/cuda_cccl/tests/compute/examples/struct/struct_reduction.py
78186
:language: python
79187
:start-after: # example-begin
80-
:caption: Custom type reduction example.
188+
:caption: Custom struct type in a reduction.
81189

82-
User-defined operations
83-
-----------------------
190+
Array of Structures vs Structure of Arrays
191+
++++++++++++++++++++++++++++++++++++++++++
84192

85-
A powerful feature of ``cuda.compute`` is the ability to customized algorithms
86-
with user-defined operations. Below is an example of doing a custom reduction
87-
with a user-defined binary operation.
193+
When working with structured data, there are two common memory layouts:
88194

89-
.. literalinclude:: ../../python/cuda_cccl/tests/compute/examples/reduction/sum_custom_reduction.py
90-
:language: python
91-
:start-after: # example-begin
92-
:caption: Reduction with user-defined binary operations.
195+
* **Array of Structures (AoS)** — each element is a complete struct, stored
196+
contiguously. For example, an array of ``Point`` structs where each point's
197+
``x`` and ``y`` are adjacent in memory.
198+
199+
* **Structure of Arrays (SoA)** — each field is stored in its own array.
200+
For example, separate ``x_coords`` and ``y_coords`` arrays.
201+
202+
``cuda.compute`` supports both layouts:
203+
204+
* **``gpu_struct``** — defines a true AoS type with named fields
205+
* **``ZipIterator``** — combines separate arrays into tuples on the fly, letting
206+
you work with SoA data as if it were AoS
207+
208+
.. _cuda.compute.caching:
209+
210+
Caching
211+
-------
212+
213+
Algorithms in ``cuda.compute`` are compiled to GPU code at runtime. To avoid
214+
recompiling on every call, build results are cached in memory. When you invoke
215+
an algorithm with the same configuration—same dtypes, iterator kinds, operator,
216+
and compute capability—the cached build is reused.
217+
218+
What determines the cache key
219+
+++++++++++++++++++++++++++++
220+
221+
Each algorithm computes a cache key from:
222+
223+
* **Array dtypes** — the data types of input and output arrays
224+
* **Iterator kinds** — for iterator inputs/outputs, a descriptor of the iterator type
225+
* **Operator identity** — for user-defined functions, the function's bytecode,
226+
constants, and closure contents (see below)
227+
* **Compute capability** — the GPU architecture of the current device
228+
* **Algorithm-specific parameters** — such as initial value dtype or determinism mode
93229

94-
Note that user-defined operations are compiled into device code
95-
using `numba-cuda <https://nvidia.github.io/numba-cuda/>`_,
96-
so many of the same features and restrictions of `numba` and `numba-cuda` apply.
97-
Here are some important gotchas to be aware of:
230+
Note that array *contents* or *pointers* are not part of the cache key—only
231+
the array's dtype. This means you can reuse a cached algorithm across different
232+
arrays of the same type.
98233

99-
* Lambda functions are not supported.
100-
* Functions may invoke other functions, but the inner functions must be
101-
decorated with ``@numba.cuda.jit``.
102-
* Functions capturing by global reference may not work as intended.
103-
Prefer using closures in these situations.
234+
How user-defined functions are cached
235+
+++++++++++++++++++++++++++++++++++++
104236

105-
Here is an example of a function that captures a global variable by reference,
106-
which is then used in a loop with ``unary_transform``.
237+
User-defined operators and predicates are hashed based on their bytecode, constants,
238+
and closure contents. Two functions with identical bytecode and closures produce
239+
the same cache key, even if defined at different source locations.
107240

108-
.. code-block:: python
241+
Closure contents are recursively hashed:
109242

110-
for i in range(10):
111-
def func(x):
112-
return x + i # i is captured from global scope
243+
* **Scalars and host arrays** — hashed by value
244+
* **Device arrays** — hashed by pointer, shape, and dtype (not contents)
245+
* **Nested functions** — hashed by their own bytecode and closures
113246

114-
cuda.compute.unary_transform(d_in, d_out, func, num_items)
247+
Because device arrays captured in closures are hashed by pointer, changing the
248+
array's contents does not invalidate the cache—only reassigning the variable to
249+
a different array does.
115250

116-
Modifications to the global variable ``i`` may not be reflected in the function
117-
when the function is called multiple times. Thus, the different calls to
118-
``unary_transform`` may not produce different results. This is true even though
119-
the function is redefined each time in the loop.
251+
Memory considerations
252+
+++++++++++++++++++++
120253

121-
To avoid this, capture the variable in a closure:
254+
The cache persists for the lifetime of the process and grows with the number of
255+
unique algorithm configurations. In long-running applications or exploratory
256+
notebooks, this can accumulate significant memory.
122257

123-
.. code-block:: python
258+
To clear all caches and free memory:
124259

125-
def make_func(i):
126-
def func(x):
127-
return x + i # i is captured as a closure variable
128-
return func
260+
.. code-block:: python
129261
130-
for i in range(10):
131-
func = make_func(i)
132-
cuda.compute.unary_transform(d_in, d_out, func, num_items)
262+
import cuda.compute
263+
cuda.compute.clear_all_caches()
133264
265+
This forces recompilation on the next algorithm invocation—useful for benchmarking
266+
compilation time or reclaiming memory.
134267

135-
Example Collections
136-
-------------------
268+
Examples
269+
--------
137270

138-
For complete runnable examples and more advanced usage patterns, see our
139-
full collection of `examples <https://github.com/NVIDIA/CCCL/tree/main/python/cuda_cccl/tests/compute/examples>`_.
271+
For complete runnable examples and additional usage patterns, see the
272+
`examples directory <https://github.com/NVIDIA/CCCL/tree/main/python/cuda_cccl/tests/compute/examples>`_.
140273

141-
External API References
142-
------------------------
274+
API Reference
275+
-------------
143276

144277
- :ref:`cuda_compute-module`

docs/python/compute_api.rst

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,11 +25,15 @@ Iterators
2525

2626
Operators
2727
---------
28+
29+
.. py:currentmodule:: cuda.compute.op
30+
2831
.. Unfortunately, we need to manually document the OpKind enum here because
2932
.. the `._bindings` module, where OpKind is defined, is mocked out when building
3033
.. docs. The mock out is needed to avoid the need for CUDA to be installed
3134
.. at docs build time.
32-
.. py:class:: cuda.compute.op.OpKind
35+
36+
.. py:class:: OpKind
3337
3438
Enumeration of operator kinds for CUDA parallel algorithms.
3539

docs/python/coop.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ Example Collections
2121
For complete runnable examples and more advanced usage patterns, see our
2222
full collection of `examples <https://github.com/NVIDIA/CCCL/tree/main/python/cuda_cccl/tests/coop/examples>`_.
2323

24-
External API References
25-
------------------------
24+
API Reference
25+
-------------
2626

2727
- :ref:`cuda_coop-module`

0 commit comments

Comments
 (0)