|
| 1 | +.. _programming_model: |
| 2 | +.. include:: ./ext_links.txt |
| 3 | + |
| 4 | +Programming Model |
| 5 | +================= |
| 6 | + |
| 7 | +In a heterogeneous system there may be **multiple** devices a Python user may |
| 8 | +want to engage. For example, it is common for a consumer-grade laptop to feature |
| 9 | +an integrated or a discrete GPU alongside a CPU. |
| 10 | + |
| 11 | +To harness their power one needs to know how to answer the following 3 key |
| 12 | +questions: |
| 13 | + |
| 14 | +1. How does a Python program recognize available computational devices? |
| 15 | +2. How does a Python workload specify computations to be offloaded to selected |
| 16 | + devices? |
| 17 | +3. How does a Python application manage data sharing? |
| 18 | + |
| 19 | +Recognizing available devices |
| 20 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 21 | + |
| 22 | +Python package ``dpctl`` answers these questions. All the computational devices |
| 23 | +known to the underlying DPC++ runtime can be accessed using |
| 24 | +``dpctl.get_devices()``. A specific device of interest `can be selected |
| 25 | +<https://intelpython.github.io/dpctl/latest/docfiles/user_guides/manual/dpctl/device_selection.html>`__ |
| 26 | +either using a helper function, e.g. ``dpctl.select_gpu_device()``, or by |
| 27 | +passing a filter selector string to ``dpctl.SyclDevice`` constructor. |
| 28 | + |
| 29 | +.. code:: python |
| 30 | +
|
| 31 | + import dpctl |
| 32 | +
|
| 33 | + # select a GPU device. If multiple devices present, |
| 34 | + # let the underlying runtime select from GPUs |
| 35 | + dev_gpu = dpctl.SyclDevice("gpu") |
| 36 | + # select a CPU device |
| 37 | + dev_cpu = dpctl.SyclDevice("cpu") |
| 38 | +
|
| 39 | + # stand-alone function, equivalent to C++ |
| 40 | + # `auto dev = sycl::gpu_selector().select_device();` |
| 41 | + dev_gpu_alt = dpctl.select_gpu_device() |
| 42 | + # stand-alone function, equivalent to C++ |
| 43 | + # `auto dev = sycl::cpu_selector().select_device();` |
| 44 | + dev_cpu_alt = dpctl.select_cpu_device() |
| 45 | +
|
| 46 | +A `device object |
| 47 | +<https://intelpython.github.io/dpctl/latest/docfiles/user_guides/manual/dpctl/devices.html>`__ |
| 48 | +can be used to query properies of the device, such as its name, vendor, maximal |
| 49 | +number of computational units, memory size, etc. |
| 50 | + |
| 51 | +Specifying offload target |
| 52 | +~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 53 | + |
| 54 | +To answer the second question on the list we need a digression to explain |
| 55 | +offloading in oneAPI DPC++ first. |
| 56 | + |
| 57 | +.. note:: |
| 58 | + In DPC++, a computation kernel can be specified using generic C++ |
| 59 | + programming and then the kernel can be offloaded to any device that is |
| 60 | + supported by an underlying SYCL runtime. The device to which the kernel |
| 61 | + is offloaded is specified using an **execution queue** when *launching |
| 62 | + the kernel*. |
| 63 | + |
| 64 | + The oneAPI unified programming model brings portability across heterogeneous |
| 65 | + architectures. Another important aspect of the programming model is its |
| 66 | + inherent flexibility that makes it possible to go beyond portability and even |
| 67 | + strive for performance portability. An oneAPI library may be implemented |
| 68 | + using C++ techniques such as template metaprogramming or dynamic polymorphism |
| 69 | + to implement specializations for a generic kernel. If a kernel is implemented |
| 70 | + polymorphically, the specialized implementation will be dispatched based on |
| 71 | + the execution queue specified during kernel launch. The oneMKL library is an |
| 72 | + example of a performance portable oneAPI library. |
| 73 | + |
| 74 | +A computational task is offloaded for execution on a device by submitting it to |
| 75 | +DPC++ runtime which inserts the task in a computational graph. Once the device |
| 76 | +becomes available the runtime selects a task whose dependencies are met for |
| 77 | +execution. The computational graph as well as the device targeted by its tasks |
| 78 | +are stored in a `SYCL queue |
| 79 | +<https://intelpython.github.io/dpctl/latest/docfiles/user_guides/manual/dpctl/queues.html>`__ |
| 80 | +object. The task submission is therefore always associated with a queue. |
| 81 | + |
| 82 | +Queues can be constructed directly from a device object, or by using a filter |
| 83 | +selector string to indicate the device to construct: |
| 84 | + |
| 85 | +.. code:: python |
| 86 | +
|
| 87 | + # construct queue from device object |
| 88 | + q1 = dpctl.SyclQueue(dev_gpu) |
| 89 | + # construct queue using filter selector |
| 90 | + q2 = dpctl.SyclQueue("gpu") |
| 91 | +
|
| 92 | +The computational tasks can be stored in an oneAPI native extension in which |
| 93 | +case their submission is orchestrated during Python API calls. Let’s consider a |
| 94 | +function that offloads an evaluation of a polynomial for every point of a NumPy |
| 95 | +array ``X``. Such a function needs to receive a queue object to indicate which |
| 96 | +device the computation must be offloaded to: |
| 97 | + |
| 98 | +.. code:: python |
| 99 | +
|
| 100 | + # allocate space for the result |
| 101 | + Y = np.empty_like(X) |
| 102 | + # evaluate polynomial on the device targeted by the queue, Y[i] = p(X[i]) |
| 103 | + onapi_ext.offloaded_poly_evaluate(exec_q, X, Y) |
| 104 | +
|
| 105 | +Python call to ``onapi_ext.offloaded_poly_evaluate`` applied to NumPy arrays of |
| 106 | +double precision floating pointer numbers gets translated to the following |
| 107 | +sample C++ code: |
| 108 | + |
| 109 | +.. code:: cpp |
| 110 | +
|
| 111 | + void |
| 112 | + cpp_offloaded_poly_evaluate( |
| 113 | + sycl::queue q, const double *X, double *Y, size_t n) { |
| 114 | + // create buffers from malloc allocations to make data accessible from device |
| 115 | + sycl::buffer<1, double> buf_X(X, n); |
| 116 | + sycl::buffer<1, double> buf_Y(Y, n); |
| 117 | +
|
| 118 | + q.submit([&](sycl::handler &cgh) { |
| 119 | + // create buffer accessors indicating kernel data-flow pattern |
| 120 | + sycl::accessor acc_X(buf_X, cgh, sycl::read_only); |
| 121 | + sycl::accessor acc_Y(buf_Y, cgh, sycl::write_only, sycl::no_init); |
| 122 | +
|
| 123 | + cgh.parallel_for(n, |
| 124 | + // lambda function that gets executed by different work-items with |
| 125 | + // different arguments in parallel |
| 126 | + [=](sycl::id<1> id) { |
| 127 | + auto x = accX[id]; |
| 128 | + accY[id] = 3.0 + x * (1.0 + x * (-0.5 + 0.3 * x)); |
| 129 | + }); |
| 130 | + }).wait(); |
| 131 | +
|
| 132 | + return; |
| 133 | + } |
| 134 | +
|
| 135 | +We refer an interested reader to an excellent and freely available “`Data |
| 136 | +Parallel C++ <https://link.springer.com/book/10.1007%2F978-1-4842-5574-2>`__” |
| 137 | +book for details of this data parallel C++. |
| 138 | + |
| 139 | +Our package ``numba_dpex`` allows one to write kernels directly in Python. |
| 140 | + |
| 141 | +.. code:: python |
| 142 | +
|
| 143 | + import numba_dpex |
| 144 | +
|
| 145 | +
|
| 146 | + @numba_dpex.kernel |
| 147 | + def numba_dpex_poly(X, Y): |
| 148 | + i = numba_dpex.get_global_id(0) |
| 149 | + x = X[i] |
| 150 | + Y[i] = 3.0 + x * (1.0 + x * (-0.5 + 0.3 * x)) |
| 151 | +
|
| 152 | +Specifying the execution queue is done using Python context manager: |
| 153 | + |
| 154 | +.. code:: python |
| 155 | +
|
| 156 | + import numpy as np |
| 157 | +
|
| 158 | + X = np.random.randn(10**6) |
| 159 | + Y = np.empty_like(X) |
| 160 | +
|
| 161 | + with dpctl.device_context(q): |
| 162 | + # apply the kernel to elements of X, writing value into Y, |
| 163 | + # while executing using given queue |
| 164 | + numba_dpex_poly[numba_dpex.Range(X.size)](X, Y) |
| 165 | +
|
| 166 | +The argument to ``device_context`` can be a queue object, a device object for |
| 167 | +which a temporary queue will be created, or a filter selector string. Thus we |
| 168 | +could have equally used ``dpctl.device_context(gpu_dev)`` or |
| 169 | +``dpctl.device_context("gpu")``. |
| 170 | + |
| 171 | +Note that in this examples data sharing was implicitly managed for us: in the |
| 172 | +case of calling a function from a precompiled oneAPI native extension data |
| 173 | +sharing was managed by DPC++ runtime, while in the case of using ``numba_dpex`` |
| 174 | +kernel it was managed during execution of ``__call__`` method. |
| 175 | + |
| 176 | +Data sharing |
| 177 | +~~~~~~~~~~~~ |
| 178 | + |
| 179 | +Implicit management of data is surely convenient, but its use in an interpreted |
| 180 | +code comes at a performance cost. A runtime must implicitly copy data from host |
| 181 | +to the device before the kernel execution commences and then copy some (or all) |
| 182 | +of it back after the execution completes for every Python API call. |
| 183 | + |
| 184 | +``dpctl`` provides for allocating memory directly accessible to kernels |
| 185 | +executing on a device using SYCL’s Unified Shared Memory (`USM |
| 186 | +<https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:usm>`__) |
| 187 | +feature. It also implements USM-based ND-array object |
| 188 | +``dpctl.tensor.usm_ndarray`` that conforms `array-API standard |
| 189 | +<https://data-apis.org/array-api/latest/>`__. |
| 190 | + |
| 191 | +.. code:: python |
| 192 | +
|
| 193 | + import dpctl.tensor as dpt |
| 194 | +
|
| 195 | + # allocate array of doubles using USM-device allocation on GPU device |
| 196 | + X = dpt.arange(0.0, end=1.0, step=1e-6, device="gpu", usm_type="device") |
| 197 | + # allocate array for the output |
| 198 | + Y = dpt.empty_like(X) |
| 199 | +
|
| 200 | + # execution queue is inferred from allocation queues. |
| 201 | + # Kernel is executed on the same device where arrays were allocated |
| 202 | + numba_dpex_poly[X.size, numba_dpex.DEFAULT_LOCAL_SIZE](X, Y) |
| 203 | +
|
| 204 | +The execution queue can be unambiguously determined in this case since both |
| 205 | +arguments are USM arrays with the same allocation queues and ``X.sycl_queue == |
| 206 | +Y.sycl_queue`` evaluates to ``True``. Should allocation queues be different, |
| 207 | +such an inference becomes ambiguous and ``numba_dpex`` raises |
| 208 | +``IndeterminateExecutionQueueError`` advising user to explicitly migrate the |
| 209 | +data. |
| 210 | + |
| 211 | +Migration can be accomplished either by using ``dpctl.tensor.asarray(X, |
| 212 | +device=target_device)`` to create a copy, or by using |
| 213 | +``X.to_device(target_device)`` method. |
| 214 | + |
| 215 | +A USM array can be copied back into a NumPy array using ``dpt.asnumpy(Y)`` if |
| 216 | +needed. |
| 217 | + |
| 218 | +Compute follows data |
| 219 | +~~~~~~~~~~~~~~~~~~~~ |
| 220 | + |
| 221 | +Automatic deduction of the execution queue from allocation queues is consistent |
| 222 | +with “`local control for data allocation target |
| 223 | +<https://data-apis.org/array-api/latest/design_topics/device_support.html>`__” |
| 224 | +in the array API standard. User has full control over memory allocation through |
| 225 | +three keyword arguments present in all `array creation functions |
| 226 | +<https://data-apis.org/array-api/latest/API_specification/creation_functions.html>`__. |
| 227 | +For example, consider |
| 228 | + |
| 229 | +.. code:: python |
| 230 | +
|
| 231 | + # TODO |
| 232 | +
|
| 233 | +The keyword ``device`` is `mandated by the array API |
| 234 | +<https://data-apis.org/array-api/latest/design_topics/device_support.html#syntax-for-device-assignment>`__. |
| 235 | +In ``dpctl.tensor`` the allowed values of the keyword are |
| 236 | + |
| 237 | +- Filter selector string, e.g. ``device="gpu:0"`` |
| 238 | +- Existing ``dpctl.SyclDevice`` object, e.g. ``device=dev_gpu`` |
| 239 | +- Existing ``dpctl.SyclQueue`` object |
| 240 | +- ``dpctl.tensor.Device`` object instance obtained from an existing USM array, |
| 241 | + e.g. ``device=X.device`` |
| 242 | + |
| 243 | +In all cases, an allocation queue object will be constructed as described |
| 244 | +`earlier <#specifying-offload-target>`__ and stored in the array instance, |
| 245 | +accessible with ``X.sycl_queue``. Instead of using ``device`` keyword, one can |
| 246 | +alternatively use ``sycl_queue`` keyword for readability to directly specify a |
| 247 | +``dpctl.SyclQueue`` object to be used as the allocation queue. |
| 248 | + |
| 249 | +The rationale for storing the allocation queue in the array is that kernels |
| 250 | +submitted to this queue are guaranteed to be able to correctly dereference (i.e. |
| 251 | +access) the USM pointer. Array operations that only involve this single USM |
| 252 | +array can thus execute on the allocation queue, and the output array can be |
| 253 | +allocated on this same allocation queue with the same usm type as the input |
| 254 | +array. |
| 255 | + |
| 256 | +.. note:: |
| 257 | + Reusing the allocation queue of the input |
| 258 | + array ensures the computational tasks behind the API call can access the |
| 259 | + array without making implicit copies and the output array is allocated |
| 260 | + on the same device as the input. |
| 261 | + |
| 262 | +Compute follows data is the rule prescribing deduction of the execution and the |
| 263 | +allocation queue as well as the USM type for the result when multiple USM arrays |
| 264 | +are combined. It stipulates that arrays can be combined if and only if their |
| 265 | +allocation *queues are the same* as measured by ``==`` operator (i.e. |
| 266 | +``X.sycl_queue == Y.sycl_queue`` must evaluate to ``True``). Same queues refer |
| 267 | +to the same underlying task graphs and DPC++ schedulers. |
| 268 | + |
| 269 | +An attempt to combine USM arrays with unsame allocation queues raises an |
| 270 | +exception advising the user to migrate the data. Migration can be accomplished |
| 271 | +either by using ``dpctl.tensor.asarray(X, device=Y.device)`` to create a copy, |
| 272 | +or by using ``X.to_device(Y.device)`` method which can sometime do the migration |
| 273 | +more efficiently. |
| 274 | + |
| 275 | +.. warning:: |
| 276 | + ``dpctl`` and ``numba_dpex`` are both under heavy development. Feel free to file an |
| 277 | + issue on GitHub or reach out on Gitter should you encounter any issues. |
0 commit comments