|
3 | 3 | ``cuda.compute``: Parallel Computing Primitives |
4 | 4 | =============================================== |
5 | 5 |
|
6 | | -The ``cuda.compute`` library provides parallel computing primitives that operate |
7 | | -on entire arrays or ranges of data. These algorithms are designed to be easy to use from Python |
8 | | -while delivering the performance of hand-optimized CUDA kernels, portable across different |
9 | | -GPU architectures. |
| 6 | +The ``cuda.compute`` library provides composable primitives for building custom |
| 7 | +parallel algorithms on the GPU—without writing CUDA kernels directly. |
10 | 8 |
|
11 | 9 | Algorithms |
12 | 10 | ---------- |
13 | 11 |
|
14 | | -The core functionality provided by the ``cuda.compute`` library are algorithms such |
15 | | -as reductions, scans, sorts, and transforms. |
| 12 | +Algorithms are the core of ``cuda.compute``. They operate on arrays or |
| 13 | +:ref:`iterators <cuda.compute.iterators>` and can be composed to build specialized |
| 14 | +GPU operations—reductions, scans, sorts, transforms, and more. |
16 | 15 |
|
17 | | -Here's a simple example showing how to use the :func:`reduce_into <cuda.compute.algorithms.reduce_into>` algorithm to |
18 | | -reduce an array of integers. |
| 16 | +Typical usage of an algorithm looks like this: |
19 | 17 |
|
| 18 | +.. code-block:: python |
| 19 | +
|
| 20 | + cuda.compute.reduce_into( |
| 21 | + d_in=..., # input array or iterator |
| 22 | + d_out=..., # output array or iterator |
| 23 | + op=..., # binary operator (built-in or user-defined) |
| 24 | + num_items=..., # number of input elements |
| 25 | + h_init=..., # initial value for the reduction |
| 26 | + ) |
| 27 | +
|
| 28 | +API conventions |
| 29 | ++++++++++++++++ |
| 30 | + |
| 31 | +* **Naming** — The ``d_`` prefix denotes *device* memory (e.g., CuPy arrays, PyTorch tensors); |
| 32 | + ``h_`` denotes *host* memory (NumPy arrays). Some scalar values must be passed as |
| 33 | + host arrays. |
| 34 | + |
| 35 | +* **Output semantics** — Algorithms write results into a user-provided array or iterator |
| 36 | + rather than returning them. This keeps memory ownership explicit and lifetimes under |
| 37 | + your control. |
| 38 | + |
| 39 | +* **Operators** — Many algorithms accept an ``op`` parameter. This can be a built-in |
| 40 | + :class:`OpKind <cuda.compute.op.OpKind>` value or a |
| 41 | + :ref:`user-defined function <cuda.compute.user_defined_operations>`. |
| 42 | + When possible, prefer built-in operators (e.g., ``OpKind.PLUS``) over the equivalent |
| 43 | + user-defined operation (e.g., ``lambda a, b: a + b``) for better performance. |
| 44 | + |
| 45 | +* **Iterators** — Inputs and outputs can be :ref:`iterators <cuda.compute.iterators>` |
| 46 | + instead of arrays, enabling lazy evaluation and operation fusion. |
| 47 | + |
| 48 | +Full Example |
| 49 | +++++++++++++ |
| 50 | + |
| 51 | +The following example uses :func:`reduce_into <cuda.compute.algorithms.reduce_into>` |
| 52 | +to compute the sum of a sequence of integers: |
20 | 53 |
|
21 | 54 | .. literalinclude:: ../../python/cuda_cccl/tests/compute/examples/reduction/sum_reduction.py |
22 | 55 | :language: python |
23 | 56 | :start-after: # example-begin |
24 | | - :caption: Basic reduction example. |
| 57 | + :caption: Sum reduction example. |
| 58 | + |
| 59 | +Controlling temporary memory |
| 60 | +++++++++++++++++++++++++++++ |
| 61 | + |
| 62 | +Many algorithms allocate temporary device memory for intermediate results. For finer |
| 63 | +control over allocation—or to reuse buffers across calls—use the object-based API. |
| 64 | +For example, :func:`make_reduce_into <cuda.compute.algorithms.make_reduce_into>` |
| 65 | +returns a reusable reduction object that lets you manage memory explicitly. |
| 66 | + |
| 67 | +.. code-block:: python |
| 68 | + :caption: Controlling temporary memory. |
25 | 69 |
|
26 | | -Many algorithms, including reduction, require a temporary memory buffer. |
27 | | -The library will allocate this buffer for you, but you can also use the |
28 | | -object-based API for greater control. |
| 70 | + # create a reducer object: |
| 71 | + reducer = cuda.compute.make_reduce_into(d_in, d_out, op, h_init) |
| 72 | + # get the temporary storage size by passing None as the first argument: |
| 73 | + temp_storage_bytes = reducer(None, d_in, d_out, num_items, h_init) |
| 74 | + # allocate the temporary storage as any array-like object |
| 75 | + # (e.g., CuPy array, Torch tensor): |
| 76 | + temp_storage = cp.empty(temp_storage_bytes, dtype=np.uint8) |
| 77 | + # perform the reduction, passing the temporary storage as the first argument: |
| 78 | + reducer(temp_storage, d_in, d_out, num_items, h_init) |
29 | 79 |
|
30 | | -.. literalinclude:: ../../python/cuda_cccl/tests/compute/examples/reduction/reduce_object.py |
| 80 | +.. _cuda.compute.user_defined_operations: |
| 81 | + |
| 82 | +User-Defined Operations |
| 83 | +----------------------- |
| 84 | + |
| 85 | +A powerful feature is the ability to use algorithms with user-defined operations. |
| 86 | +For example, to compute the sum of only the even values in a sequence, |
| 87 | +we can use :func:`reduce_into <cuda.compute.algorithms.reduce_into>` with a custom binary operation: |
| 88 | + |
| 89 | +.. literalinclude:: ../../python/cuda_cccl/tests/compute/examples/reduction/sum_custom_reduction.py |
31 | 90 | :language: python |
32 | 91 | :start-after: # example-begin |
33 | | - :caption: Reduction with object-based API. |
| 92 | + :caption: Reduction with a custom binary operation. |
34 | 93 |
|
| 94 | +Features and Restrictions |
| 95 | ++++++++++++++++++++++++++ |
| 96 | + |
| 97 | +User-defined operations are compiled into device code using |
| 98 | +`Numba CUDA <https://nvidia.github.io/numba-cuda/>`_, so they inherit many |
| 99 | +of the same features and restrictions as Numba CUDA functions: |
| 100 | + |
| 101 | +* `Python features <https://nvidia.github.io/numba-cuda/user/cudapysupported.html>`_ |
| 102 | + and `atomic operations <https://nvidia.github.io/numba-cuda/user/intrinsics.html>`_ |
| 103 | + supported by Numba CUDA are also supported within user-defined operations. |
| 104 | +* Nested functions must be decorated with ``@numba.cuda.jit``. |
| 105 | +* Variables captured in closures or globals follow |
| 106 | + `Numba CUDA semantics <https://nvidia.github.io/numba-cuda/user/globals.html>`_: |
| 107 | + scalars and host arrays are captured by value (as constants), |
| 108 | + while device arrays are captured by reference. |
| 109 | + |
| 110 | +.. _cuda.compute.iterators: |
35 | 111 |
|
36 | 112 | Iterators |
37 | 113 | --------- |
38 | 114 |
|
39 | | -Algorithms can be used not just on arrays, but also on iterators. Iterators |
40 | | -provide a way to represent sequences of data without needing to allocate memory |
41 | | -for them. |
| 115 | +Iterators represent sequences whose elements are computed **on the fly**. They can |
| 116 | +be used in place of arrays in most algorithms, enabling lazy evaluation, operation |
| 117 | +fusion, and custom data access patterns. |
| 118 | + |
| 119 | +A :func:`CountingIterator <cuda.compute.iterators.CountingIterator>`, for example, |
| 120 | +represents an integer sequence starting from a given value: |
| 121 | + |
| 122 | +.. code-block:: python |
42 | 123 |
|
43 | | -Here's an example showing how to use reduction with a :func:`CountingIterator <cuda.compute.iterators.CountingIterator>` that |
44 | | -generates a sequence of numbers starting from a specified value. |
| 124 | + it = CountingIterator(np.int32(1)) # represents [1, 2, 3, 4, ...] |
| 125 | +
|
| 126 | +To compute the sum of the first 100 integers, we can pass a |
| 127 | +:func:`CountingIterator <cuda.compute.iterators.CountingIterator>` directly to |
| 128 | +:func:`reduce_into <cuda.compute.algorithms.reduce_into>`. No memory is allocated |
| 129 | +to store the input sequence—the values are generated as needed. |
45 | 130 |
|
46 | 131 | .. literalinclude:: ../../python/cuda_cccl/tests/compute/examples/iterator/counting_iterator_basic.py |
47 | 132 | :language: python |
48 | 133 | :start-after: # example-begin |
49 | 134 | :caption: Counting iterator example. |
50 | 135 |
|
51 | | -Iterators also provide a way to compose operations. Here's an example showing |
52 | | -how to use :func:`reduce_into <cuda.compute.algorithms.reduce_into>` with a :func:`TransformIterator <cuda.compute.iterators.TransformIterator>` to compute the sum of squares |
53 | | -of a sequence of numbers. |
| 136 | +Iterators can also be used to *fuse* operations. In the example below, a |
| 137 | +:func:`TransformIterator <cuda.compute.iterators.TransformIterator>` lazily applies |
| 138 | +the square operation to each element of the input sequence. The resulting iterator |
| 139 | +is then passed to :func:`reduce_into <cuda.compute.algorithms.reduce_into>` to compute |
| 140 | +the sum of squares. |
| 141 | + |
| 142 | +Because the square is evaluated on demand during the reduction, there is no need |
| 143 | +to create or store an intermediate array of squared values. The transform and the |
| 144 | +reduction are fused into a single pass over the data. |
54 | 145 |
|
55 | 146 | .. literalinclude:: ../../python/cuda_cccl/tests/compute/examples/iterator/transform_iterator_basic.py |
56 | 147 | :language: python |
57 | 148 | :start-after: # example-begin |
58 | 149 | :caption: Transform iterator example. |
59 | 150 |
|
60 | | -Iterators that wrap an array (or another output iterator) may be used as both input and output iterators. |
61 | | -Here's an example showing how to use a |
62 | | -:func:`TransformIterator <cuda.compute.iterators.TransformIterator>` to transform the output |
63 | | -of a reduction before writing to the underlying array. |
| 151 | +Some iterators can also be used as the output of an algorithm. In the example below, |
| 152 | +a :func:`TransformOutputIterator <cuda.compute.iterators.TransformOutputIterator>` |
| 153 | +applies the square-root operation to the result of a reduction before writing |
| 154 | +it into the underlying array. |
64 | 155 |
|
65 | 156 | .. literalinclude:: ../../python/cuda_cccl/tests/compute/examples/iterator/transform_output_iterator.py |
66 | 157 | :language: python |
67 | 158 | :start-after: # example-begin |
68 | 159 | :caption: Transform output iterator example. |
69 | 160 |
|
70 | | -Custom Types |
| 161 | +As another example, :func:`ZipIterator <cuda.compute.iterators.ZipIterator>` combines multiple |
| 162 | +arrays or iterators into a single logical sequence. In the example below, we combine |
| 163 | +a counting iterator and an array, creating an iterator that yields ``(index, value)`` |
| 164 | +pairs. This combined iterator is then used as the input to |
| 165 | +:func:`reduce_into <cuda.compute.algorithms.reduce_into>` to compute the index of |
| 166 | +the maximum value in the array. |
| 167 | + |
| 168 | +.. literalinclude:: ../../python/cuda_cccl/tests/compute/examples/iterator/zip_iterator_counting.py |
| 169 | + :language: python |
| 170 | + :start-after: # example-begin |
| 171 | + :caption: Argmax using a zip iterator. |
| 172 | + |
| 173 | +These examples illustrate a few of the patterns enabled by iterators. See the |
| 174 | +:ref:`API reference <cuda_compute-module>` for the full set of available iterators. |
| 175 | + |
| 176 | +.. _cuda.compute.custom_types: |
| 177 | + |
| 178 | +Struct Types |
71 | 179 | ------------ |
72 | 180 |
|
73 | | -The ``cuda.compute`` library supports defining custom data types, |
74 | | -using the :func:`gpu_struct <cuda.compute.struct.gpu_struct>` decorator. |
75 | | -Here are some examples showing how to define and use custom types: |
| 181 | +The :func:`gpu_struct <cuda.compute.struct.gpu_struct>` decorator defines |
| 182 | +GPU-compatible struct types. These are useful when you have data laid out |
| 183 | +as an "array of structures", similar to `NumPy structured arrays <https://numpy.org/doc/stable/user/basics.rec.html>`_. |
76 | 184 |
|
77 | 185 | .. literalinclude:: ../../python/cuda_cccl/tests/compute/examples/struct/struct_reduction.py |
78 | 186 | :language: python |
79 | 187 | :start-after: # example-begin |
80 | | - :caption: Custom type reduction example. |
| 188 | + :caption: Custom struct type in a reduction. |
81 | 189 |
|
82 | | -User-defined operations |
83 | | ------------------------ |
| 190 | +Array of Structures vs Structure of Arrays |
| 191 | +++++++++++++++++++++++++++++++++++++++++++ |
84 | 192 |
|
85 | | -A powerful feature of ``cuda.compute`` is the ability to customized algorithms |
86 | | -with user-defined operations. Below is an example of doing a custom reduction |
87 | | -with a user-defined binary operation. |
| 193 | +When working with structured data, there are two common memory layouts: |
88 | 194 |
|
89 | | -.. literalinclude:: ../../python/cuda_cccl/tests/compute/examples/reduction/sum_custom_reduction.py |
90 | | - :language: python |
91 | | - :start-after: # example-begin |
92 | | - :caption: Reduction with user-defined binary operations. |
| 195 | +* **Array of Structures (AoS)** — each element is a complete struct, stored |
| 196 | + contiguously. For example, an array of ``Point`` structs where each point's |
| 197 | + ``x`` and ``y`` are adjacent in memory. |
| 198 | + |
| 199 | +* **Structure of Arrays (SoA)** — each field is stored in its own array. |
| 200 | + For example, separate ``x_coords`` and ``y_coords`` arrays. |
| 201 | + |
| 202 | +``cuda.compute`` supports both layouts: |
| 203 | + |
| 204 | +* **``gpu_struct``** — defines a true AoS type with named fields |
| 205 | +* **``ZipIterator``** — combines separate arrays into tuples on the fly, letting |
| 206 | + you work with SoA data as if it were AoS |
| 207 | + |
| 208 | +.. _cuda.compute.caching: |
| 209 | + |
| 210 | +Caching |
| 211 | +------- |
| 212 | + |
| 213 | +Algorithms in ``cuda.compute`` are compiled to GPU code at runtime. To avoid |
| 214 | +recompiling on every call, build results are cached in memory. When you invoke |
| 215 | +an algorithm with the same configuration—same dtypes, iterator kinds, operator, |
| 216 | +and compute capability—the cached build is reused. |
| 217 | + |
| 218 | +What determines the cache key |
| 219 | ++++++++++++++++++++++++++++++ |
| 220 | + |
| 221 | +Each algorithm computes a cache key from: |
| 222 | + |
| 223 | +* **Array dtypes** — the data types of input and output arrays |
| 224 | +* **Iterator kinds** — for iterator inputs/outputs, a descriptor of the iterator type |
| 225 | +* **Operator identity** — for user-defined functions, the function's bytecode, |
| 226 | + constants, and closure contents (see below) |
| 227 | +* **Compute capability** — the GPU architecture of the current device |
| 228 | +* **Algorithm-specific parameters** — such as initial value dtype or determinism mode |
93 | 229 |
|
94 | | -Note that user-defined operations are compiled into device code |
95 | | -using `numba-cuda <https://nvidia.github.io/numba-cuda/>`_, |
96 | | -so many of the same features and restrictions of `numba` and `numba-cuda` apply. |
97 | | -Here are some important gotchas to be aware of: |
| 230 | +Note that array *contents* or *pointers* are not part of the cache key—only |
| 231 | +the array's dtype. This means you can reuse a cached algorithm across different |
| 232 | +arrays of the same type. |
98 | 233 |
|
99 | | -* Lambda functions are not supported. |
100 | | -* Functions may invoke other functions, but the inner functions must be |
101 | | - decorated with ``@numba.cuda.jit``. |
102 | | -* Functions capturing by global reference may not work as intended. |
103 | | - Prefer using closures in these situations. |
| 234 | +How user-defined functions are cached |
| 235 | ++++++++++++++++++++++++++++++++++++++ |
104 | 236 |
|
105 | | - Here is an example of a function that captures a global variable by reference, |
106 | | - which is then used in a loop with ``unary_transform``. |
| 237 | +User-defined operators and predicates are hashed based on their bytecode, constants, |
| 238 | +and closure contents. Two functions with identical bytecode and closures produce |
| 239 | +the same cache key, even if defined at different source locations. |
107 | 240 |
|
108 | | - .. code-block:: python |
| 241 | +Closure contents are recursively hashed: |
109 | 242 |
|
110 | | - for i in range(10): |
111 | | - def func(x): |
112 | | - return x + i # i is captured from global scope |
| 243 | +* **Scalars and host arrays** — hashed by value |
| 244 | +* **Device arrays** — hashed by pointer, shape, and dtype (not contents) |
| 245 | +* **Nested functions** — hashed by their own bytecode and closures |
113 | 246 |
|
114 | | - cuda.compute.unary_transform(d_in, d_out, func, num_items) |
| 247 | +Because device arrays captured in closures are hashed by pointer, changing the |
| 248 | +array's contents does not invalidate the cache—only reassigning the variable to |
| 249 | +a different array does. |
115 | 250 |
|
116 | | - Modifications to the global variable ``i`` may not be reflected in the function |
117 | | - when the function is called multiple times. Thus, the different calls to |
118 | | - ``unary_transform`` may not produce different results. This is true even though |
119 | | - the function is redefined each time in the loop. |
| 251 | +Memory considerations |
| 252 | ++++++++++++++++++++++ |
120 | 253 |
|
121 | | - To avoid this, capture the variable in a closure: |
| 254 | +The cache persists for the lifetime of the process and grows with the number of |
| 255 | +unique algorithm configurations. In long-running applications or exploratory |
| 256 | +notebooks, this can accumulate significant memory. |
122 | 257 |
|
123 | | - .. code-block:: python |
| 258 | +To clear all caches and free memory: |
124 | 259 |
|
125 | | - def make_func(i): |
126 | | - def func(x): |
127 | | - return x + i # i is captured as a closure variable |
128 | | - return func |
| 260 | +.. code-block:: python |
129 | 261 |
|
130 | | - for i in range(10): |
131 | | - func = make_func(i) |
132 | | - cuda.compute.unary_transform(d_in, d_out, func, num_items) |
| 262 | + import cuda.compute |
| 263 | + cuda.compute.clear_all_caches() |
133 | 264 |
|
| 265 | +This forces recompilation on the next algorithm invocation—useful for benchmarking |
| 266 | +compilation time or reclaiming memory. |
134 | 267 |
|
135 | | -Example Collections |
136 | | -------------------- |
| 268 | +Examples |
| 269 | +-------- |
137 | 270 |
|
138 | | -For complete runnable examples and more advanced usage patterns, see our |
139 | | -full collection of `examples <https://github.com/NVIDIA/CCCL/tree/main/python/cuda_cccl/tests/compute/examples>`_. |
| 271 | +For complete runnable examples and additional usage patterns, see the |
| 272 | +`examples directory <https://github.com/NVIDIA/CCCL/tree/main/python/cuda_cccl/tests/compute/examples>`_. |
140 | 273 |
|
141 | | -External API References |
142 | | ------------------------- |
| 274 | +API Reference |
| 275 | +------------- |
143 | 276 |
|
144 | 277 | - :ref:`cuda_compute-module` |
0 commit comments