Skip to content

Commit cec6ad8

Browse files
chudur-budurDiptorup Deb
authored andcommitted
Convert programming_model.md to rst
1 parent 8a0f016 commit cec6ad8

File tree

3 files changed

+274
-240
lines changed

3 files changed

+274
-240
lines changed

docs/source/user_guide/index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ User Guide
1212
.. toctree::
1313
:maxdepth: 2
1414

15-
programming_model.rst
15+
programming_model
1616
kernel_programming/index
1717
dpnp_offload
1818
debugging/index

docs/source/user_guide/programming_model.rst

Lines changed: 273 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,5 +4,276 @@
44
Programming Model
55
=================
66

7-
.. include:: ./programming_model_markdown.md
8-
:parser: myst_parser.sphinx_
7+
In a heterogeneous system there may be **multiple** devices a Python user may
8+
want to engage. For example, it is common for a consumer-grade laptop to feature
9+
an integrated or a discrete GPU alongside a CPU.
10+
11+
To harness their power one needs to know how to answer the following 3 key
12+
questions:
13+
14+
1. How does a Python program recognize available computational devices?
15+
2. How does a Python workload specify computations to be offloaded to selected
16+
devices?
17+
3. How does a Python application manage data sharing?
18+
19+
Recognizing available devices
20+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
21+
22+
Python package ``dpctl`` answers these questions. All the computational devices
23+
known to the underlying DPC++ runtime can be accessed using
24+
``dpctl.get_devices()``. A specific device of interest `can be selected
25+
<https://intelpython.github.io/dpctl/latest/docfiles/user_guides/manual/dpctl/device_selection.html>`__
26+
either using a helper function, e.g. ``dpctl.select_gpu_device()``, or by
27+
passing a filter selector string to ``dpctl.SyclDevice`` constructor.
28+
29+
.. code:: python
30+
31+
import dpctl
32+
33+
# select a GPU device. If multiple devices present,
34+
# let the underlying runtime select from GPUs
35+
dev_gpu = dpctl.SyclDevice("gpu")
36+
# select a CPU device
37+
dev_cpu = dpctl.SyclDevice("cpu")
38+
39+
# stand-alone function, equivalent to C++
40+
# `auto dev = sycl::gpu_selector().select_device();`
41+
dev_gpu_alt = dpctl.select_gpu_device()
42+
# stand-alone function, equivalent to C++
43+
# `auto dev = sycl::cpu_selector().select_device();`
44+
dev_cpu_alt = dpctl.select_cpu_device()
45+
46+
A `device object
47+
<https://intelpython.github.io/dpctl/latest/docfiles/user_guides/manual/dpctl/devices.html>`__
48+
can be used to query properies of the device, such as its name, vendor, maximal
49+
number of computational units, memory size, etc.
50+
51+
Specifying offload target
52+
~~~~~~~~~~~~~~~~~~~~~~~~~
53+
54+
To answer the second question on the list we need a digression to explain
55+
offloading in oneAPI DPC++ first.
56+
57+
.. note::
58+
In DPC++, a computation kernel can be specified using generic C++
59+
programming and then the kernel can be offloaded to any device that is
60+
supported by an underlying SYCL runtime. The device to which the kernel
61+
is offloaded is specified using an **execution queue** when *launching
62+
the kernel*.
63+
64+
The oneAPI unified programming model brings portability across heterogeneous
65+
architectures. Another important aspect of the programming model is its
66+
inherent flexibility that makes it possible to go beyond portability and even
67+
strive for performance portability. An oneAPI library may be implemented
68+
using C++ techniques such as template metaprogramming or dynamic polymorphism
69+
to implement specializations for a generic kernel. If a kernel is implemented
70+
polymorphically, the specialized implementation will be dispatched based on
71+
the execution queue specified during kernel launch. The oneMKL library is an
72+
example of a performance portable oneAPI library.
73+
74+
A computational task is offloaded for execution on a device by submitting it to
75+
DPC++ runtime which inserts the task in a computational graph. Once the device
76+
becomes available the runtime selects a task whose dependencies are met for
77+
execution. The computational graph as well as the device targeted by its tasks
78+
are stored in a `SYCL queue
79+
<https://intelpython.github.io/dpctl/latest/docfiles/user_guides/manual/dpctl/queues.html>`__
80+
object. The task submission is therefore always associated with a queue.
81+
82+
Queues can be constructed directly from a device object, or by using a filter
83+
selector string to indicate the device to construct:
84+
85+
.. code:: python
86+
87+
# construct queue from device object
88+
q1 = dpctl.SyclQueue(dev_gpu)
89+
# construct queue using filter selector
90+
q2 = dpctl.SyclQueue("gpu")
91+
92+
The computational tasks can be stored in an oneAPI native extension in which
93+
case their submission is orchestrated during Python API calls. Let’s consider a
94+
function that offloads an evaluation of a polynomial for every point of a NumPy
95+
array ``X``. Such a function needs to receive a queue object to indicate which
96+
device the computation must be offloaded to:
97+
98+
.. code:: python
99+
100+
# allocate space for the result
101+
Y = np.empty_like(X)
102+
# evaluate polynomial on the device targeted by the queue, Y[i] = p(X[i])
103+
onapi_ext.offloaded_poly_evaluate(exec_q, X, Y)
104+
105+
Python call to ``onapi_ext.offloaded_poly_evaluate`` applied to NumPy arrays of
106+
double precision floating pointer numbers gets translated to the following
107+
sample C++ code:
108+
109+
.. code:: cpp
110+
111+
void
112+
cpp_offloaded_poly_evaluate(
113+
sycl::queue q, const double *X, double *Y, size_t n) {
114+
// create buffers from malloc allocations to make data accessible from device
115+
sycl::buffer<1, double> buf_X(X, n);
116+
sycl::buffer<1, double> buf_Y(Y, n);
117+
118+
q.submit([&](sycl::handler &cgh) {
119+
// create buffer accessors indicating kernel data-flow pattern
120+
sycl::accessor acc_X(buf_X, cgh, sycl::read_only);
121+
sycl::accessor acc_Y(buf_Y, cgh, sycl::write_only, sycl::no_init);
122+
123+
cgh.parallel_for(n,
124+
// lambda function that gets executed by different work-items with
125+
// different arguments in parallel
126+
[=](sycl::id<1> id) {
127+
auto x = accX[id];
128+
accY[id] = 3.0 + x * (1.0 + x * (-0.5 + 0.3 * x));
129+
});
130+
}).wait();
131+
132+
return;
133+
}
134+
135+
We refer an interested reader to an excellent and freely available “`Data
136+
Parallel C++ <https://link.springer.com/book/10.1007%2F978-1-4842-5574-2>`__”
137+
book for details of this data parallel C++.
138+
139+
Our package ``numba_dpex`` allows one to write kernels directly in Python.
140+
141+
.. code:: python
142+
143+
import numba_dpex
144+
145+
146+
@numba_dpex.kernel
147+
def numba_dpex_poly(X, Y):
148+
i = numba_dpex.get_global_id(0)
149+
x = X[i]
150+
Y[i] = 3.0 + x * (1.0 + x * (-0.5 + 0.3 * x))
151+
152+
Specifying the execution queue is done using Python context manager:
153+
154+
.. code:: python
155+
156+
import numpy as np
157+
158+
X = np.random.randn(10**6)
159+
Y = np.empty_like(X)
160+
161+
with dpctl.device_context(q):
162+
# apply the kernel to elements of X, writing value into Y,
163+
# while executing using given queue
164+
numba_dpex_poly[X.size, numba_dpex.DEFAULT_LOCAL_SIZE](X, Y)
165+
166+
The argument to ``device_context`` can be a queue object, a device object for
167+
which a temporary queue will be created, or a filter selector string. Thus we
168+
could have equally used ``dpctl.device_context(gpu_dev)`` or
169+
``dpctl.device_context("gpu")``.
170+
171+
Note that in this examples data sharing was implicitly managed for us: in the
172+
case of calling a function from a precompiled oneAPI native extension data
173+
sharing was managed by DPC++ runtime, while in the case of using ``numba_dpex``
174+
kernel it was managed during execution of ``__call__`` method.
175+
176+
Data sharing
177+
~~~~~~~~~~~~
178+
179+
Implicit management of data is surely convenient, but its use in an interpreted
180+
code comes at a performance cost. A runtime must implicitly copy data from host
181+
to the device before the kernel execution commences and then copy some (or all)
182+
of it back after the execution completes for every Python API call.
183+
184+
``dpctl`` provides for allocating memory directly accessible to kernels
185+
executing on a device using SYCL’s Unified Shared Memory (`USM
186+
<https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:usm>`__)
187+
feature. It also implements USM-based ND-array object
188+
``dpctl.tensor.usm_ndarray`` that conforms `array-API standard
189+
<https://data-apis.org/array-api/latest/>`__.
190+
191+
.. code:: python
192+
193+
import dpctl.tensor as dpt
194+
195+
# allocate array of doubles using USM-device allocation on GPU device
196+
X = dpt.arange(0.0, end=1.0, step=1e-6, device="gpu", usm_type="device")
197+
# allocate array for the output
198+
Y = dpt.empty_like(X)
199+
200+
# execution queue is inferred from allocation queues.
201+
# Kernel is executed on the same device where arrays were allocated
202+
numba_dpex_poly[X.size, numba_dpex.DEFAULT_LOCAL_SIZE](X, Y)
203+
204+
The execution queue can be unambiguously determined in this case since both
205+
arguments are USM arrays with the same allocation queues and ``X.sycl_queue ==
206+
Y.sycl_queue`` evaluates to ``True``. Should allocation queues be different,
207+
such an inference becomes ambiguous and ``numba_dpex`` raises
208+
``IndeterminateExecutionQueueError`` advising user to explicitly migrate the
209+
data.
210+
211+
Migration can be accomplished either by using ``dpctl.tensor.asarray(X,
212+
device=target_device)`` to create a copy, or by using
213+
``X.to_device(target_device)`` method.
214+
215+
A USM array can be copied back into a NumPy array using ``dpt.asnumpy(Y)`` if
216+
needed.
217+
218+
Compute follows data
219+
~~~~~~~~~~~~~~~~~~~~
220+
221+
Automatic deduction of the execution queue from allocation queues is consitent
222+
with “`local control for data allocation target
223+
<https://data-apis.org/array-api/latest/design_topics/device_support.html>`__”
224+
in the array API standard. User has full control over memory allocation through
225+
three keyword arguments present in all `array creation functions
226+
<https://data-apis.org/array-api/latest/API_specification/creation_functions.html>`__.
227+
For example, consider
228+
229+
.. code:: python
230+
# Use usm_type = 'device' to get USM-device allocation (default),
231+
# usm_type = 'shared' to get USM-shared allocation,
232+
# usm_type = 'host' to get USM-host allocation
233+
# def dpt.empty(..., device=None, usm_type=None, sycl_queue=None) -> dpctl.tensor.usm_ndarray: ...
234+
235+
The keyword ``device`` is `mandated by the array API
236+
<https://data-apis.org/array-api/latest/design_topics/device_support.html#syntax-for-device-assignment>`__.
237+
In ``dpctl.tensor`` the allowed values of the keyword are
238+
239+
- Filter selector string, e.g. ``device="gpu:0"``
240+
- Existing ``dpctl.SyclDevice`` object, e.g. ``device=dev_gpu``
241+
- Existing ``dpctl.SyclQueue`` object
242+
- ``dpctl.tensor.Device`` object instance obtained from an existing USM array,
243+
e.g. ``device=X.device``
244+
245+
In all cases, an allocation queue object will be constructed as described
246+
`earlier <#specifying-offload-target>`__ and stored in the array instance,
247+
accessible with ``X.sycl_queue``. Instead of using ``device`` keyword, one can
248+
alternatively use ``sycl_queue`` keyword for readability to directly specify a
249+
``dpctl.SyclQueue`` object to be used as the allocation queue.
250+
251+
The rationale for storing the allocation queue in the array is that kernels
252+
submitted to this queue are guaranteed to be able to correctly dereference (i.e.
253+
access) the USM pointer. Array operations that only involve this single USM
254+
array can thus execute on the allocation queue, and the output array can be
255+
allocated on this same allocation queue with the same usm type as the input
256+
array.
257+
258+
.. note::
259+
Reusing the allocation queue of the input
260+
array ensures the computational tasks behind the API call can access the
261+
array without making implicit copies and the output array is allocated
262+
on the same device as the input.
263+
264+
Compute follows data is the rule prescribing deduction of the execution and the
265+
allocation queue as well as the USM type for the result when multiple USM arrays
266+
are combined. It stipulates that arrays can be combined if and only if their
267+
allocation *queues are the same* as measured by ``==`` operator (i.e.
268+
``X.sycl_queue == Y.sycl_queue`` must evaluate to ``True``). Same queues refer
269+
to the same underlying task graphs and DPC++ schedulers.
270+
271+
An attempt to combine USM arrays with unsame allocation queues raises an
272+
exception advising the user to migrate the data. Migration can be accomplished
273+
either by using ``dpctl.tensor.asarray(X, device=Y.device)`` to create a copy,
274+
or by using ``X.to_device(Y.device)`` method which can sometime do the migration
275+
more efficiently.
276+
277+
.. warning::
278+
``dpctl`` and ``numba_dpex`` are both under heavy development. Feel free to file an
279+
issue on GitHub or reach out on Gitter should you encounter any issues.

0 commit comments

Comments
 (0)