Skip to content

Commit 5240312

Browse files
author
Diptorup Deb
committed
Outline programming model
1 parent 3151d31 commit 5240312

12 files changed

+938
-413
lines changed

docs/source/bibliography.bib

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
2+
3+
@techreport{scott70,
4+
author = {Dana Scott},
5+
institution = {OUCL},
6+
month = {November},
7+
number = {PRG02},
8+
pages = {30},
9+
title = {OUTLINE OF A MATHEMATICAL THEORY OF COMPUTATION},
10+
year = {1970}
11+
}
12+
13+
@article{PLOTKIN20043,
14+
abstract = {We review the origins of structural operational semantics. The main publication `A Structural Approach to Operational Semantics,' also known as the `Aarhus Notes,' appeared in 1981 [G.D. Plotkin, A structural approach to operational semantics, DAIMI FN-19, Computer Science Department, Aarhus University, 1981]. The development of the ideas dates back to the early 1970s, involving many people and building on previous work on programming languages and logic. The former included abstract syntax, the SECD machine, and the abstract interpreting machines of the Vienna school; the latter included the λ-calculus and formal systems. The initial development of structural operational semantics was for simple functional languages, more or less variations of the λ-calculus; after that the ideas were gradually extended to include languages with parallel features, such as Milner's CCS. This experience set the ground for a more systematic exposition, the subject of an invited course of lectures at Aarhus University; some of these appeared in print as the 1981 Notes. We discuss the content of these lectures and some related considerations such as `small state' versus `grand state,' structural versus compositional semantics, the influence of the Scott–Strachey approach to denotational semantics, the treatment of recursion and jumps, and static semantics. We next discuss relations with other work and some immediate further development. We conclude with an account of an old, previously unpublished, idea: an alternative, perhaps more readable, graphical presentation of systems of rules for operational semantics.},
15+
author = {Gordon D Plotkin},
16+
doi = {https://doi.org/10.1016/j.jlap.2004.03.009},
17+
issn = {1567-8326},
18+
journal = {The Journal of Logic and Algebraic Programming},
19+
keywords = {Semantics of programming languages, (Structural) operational semantics, Structural induction, (Labelled) transition systems, -calculus, Concurrency, Big step semantics, Small-step semantics, Abstract machines, Static semantics},
20+
note = {Structural Operational Semantics},
21+
pages = {3-15},
22+
title = {The origins of structural operational semantics},
23+
url = {https://www.sciencedirect.com/science/article/pii/S1567832604000268},
24+
volume = {60-61},
25+
year = {2004}
26+
}

docs/source/conf.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,8 +31,11 @@
3131
"sphinxcontrib.googleanalytics",
3232
"myst_parser",
3333
"autoapi.extension",
34+
"sphinxcontrib.bibtex",
3435
]
3536

37+
bibtex_bibfiles = ["bibliography.bib"]
38+
3639
# Add any paths that contain templates here, relative to this directory.
3740
# templates_path = ['_templates']
3841
templates_path = []

docs/source/ext_links.txt

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
**********************************************************
33
THESE ARE EXTERNAL PROJECT LINKS USED IN THE DOCUMENTATION
44
**********************************************************
5-
5+
.. _math: https://docs.python.org/3/library/math.html
66
.. _NumPy*: https://numpy.org/
77
.. _Numba*: https://numba.pydata.org/
88
.. _numba-dpex: https://github.com/IntelPython/numba-dpex
@@ -14,6 +14,7 @@
1414
.. _SYCL*: https://www.khronos.org/sycl/
1515
.. _dpctl: https://intelpython.github.io/dpctl/latest/index.html
1616
.. _Data Parallel Control: https://intelpython.github.io/dpctl/latest/index.html
17+
.. _DLPack: https://dmlc.github.io/dlpack/latest/
1718
.. _Dpnp: https://intelpython.github.io/dpnp/
1819
.. _dpnp: https://intelpython.github.io/dpnp/
1920
.. _Data Parallel Extension for Numpy*: https://intelpython.github.io/dpnp/
@@ -28,3 +29,5 @@
2829
.. _oneDPL: https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-library.html#gs.5izf63
2930
.. _UXL: https://uxlfoundation.org/
3031
.. _oneAPI GPU optimization guide: https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2024-0/general-purpose-computing-on-gpu.html
32+
.. _dpctl.tensor.usm_ndarray: https://intelpython.github.io/dpctl/latest/docfiles/dpctl/usm_ndarray.html#dpctl.tensor.usm_ndarray
33+
.. _dpnp.ndarray: https://intelpython.github.io/dpnp/reference/ndarray.html

docs/source/programming_model.old.rst

Lines changed: 277 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,277 @@
1+
.. _programming_model:
2+
.. include:: ./ext_links.txt
3+
4+
Programming Model
5+
=================
6+
7+
In a heterogeneous system there may be **multiple** devices a Python user may
8+
want to engage. For example, it is common for a consumer-grade laptop to feature
9+
an integrated or a discrete GPU alongside a CPU.
10+
11+
To harness their power one needs to know how to answer the following 3 key
12+
questions:
13+
14+
1. How does a Python program recognize available computational devices?
15+
2. How does a Python workload specify computations to be offloaded to selected
16+
devices?
17+
3. How does a Python application manage data sharing?
18+
19+
Recognizing available devices
20+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
21+
22+
Python package ``dpctl`` answers these questions. All the computational devices
23+
known to the underlying DPC++ runtime can be accessed using
24+
``dpctl.get_devices()``. A specific device of interest `can be selected
25+
<https://intelpython.github.io/dpctl/latest/docfiles/user_guides/manual/dpctl/device_selection.html>`__
26+
either using a helper function, e.g. ``dpctl.select_gpu_device()``, or by
27+
passing a filter selector string to ``dpctl.SyclDevice`` constructor.
28+
29+
.. code:: python
30+
31+
import dpctl
32+
33+
# select a GPU device. If multiple devices present,
34+
# let the underlying runtime select from GPUs
35+
dev_gpu = dpctl.SyclDevice("gpu")
36+
# select a CPU device
37+
dev_cpu = dpctl.SyclDevice("cpu")
38+
39+
# stand-alone function, equivalent to C++
40+
# `auto dev = sycl::gpu_selector().select_device();`
41+
dev_gpu_alt = dpctl.select_gpu_device()
42+
# stand-alone function, equivalent to C++
43+
# `auto dev = sycl::cpu_selector().select_device();`
44+
dev_cpu_alt = dpctl.select_cpu_device()
45+
46+
A `device object
47+
<https://intelpython.github.io/dpctl/latest/docfiles/user_guides/manual/dpctl/devices.html>`__
48+
can be used to query properies of the device, such as its name, vendor, maximal
49+
number of computational units, memory size, etc.
50+
51+
Specifying offload target
52+
~~~~~~~~~~~~~~~~~~~~~~~~~
53+
54+
To answer the second question on the list we need a digression to explain
55+
offloading in oneAPI DPC++ first.
56+
57+
.. note::
58+
In DPC++, a computation kernel can be specified using generic C++
59+
programming and then the kernel can be offloaded to any device that is
60+
supported by an underlying SYCL runtime. The device to which the kernel
61+
is offloaded is specified using an **execution queue** when *launching
62+
the kernel*.
63+
64+
The oneAPI unified programming model brings portability across heterogeneous
65+
architectures. Another important aspect of the programming model is its
66+
inherent flexibility that makes it possible to go beyond portability and even
67+
strive for performance portability. An oneAPI library may be implemented
68+
using C++ techniques such as template metaprogramming or dynamic polymorphism
69+
to implement specializations for a generic kernel. If a kernel is implemented
70+
polymorphically, the specialized implementation will be dispatched based on
71+
the execution queue specified during kernel launch. The oneMKL library is an
72+
example of a performance portable oneAPI library.
73+
74+
A computational task is offloaded for execution on a device by submitting it to
75+
DPC++ runtime which inserts the task in a computational graph. Once the device
76+
becomes available the runtime selects a task whose dependencies are met for
77+
execution. The computational graph as well as the device targeted by its tasks
78+
are stored in a `SYCL queue
79+
<https://intelpython.github.io/dpctl/latest/docfiles/user_guides/manual/dpctl/queues.html>`__
80+
object. The task submission is therefore always associated with a queue.
81+
82+
Queues can be constructed directly from a device object, or by using a filter
83+
selector string to indicate the device to construct:
84+
85+
.. code:: python
86+
87+
# construct queue from device object
88+
q1 = dpctl.SyclQueue(dev_gpu)
89+
# construct queue using filter selector
90+
q2 = dpctl.SyclQueue("gpu")
91+
92+
The computational tasks can be stored in an oneAPI native extension in which
93+
case their submission is orchestrated during Python API calls. Let’s consider a
94+
function that offloads an evaluation of a polynomial for every point of a NumPy
95+
array ``X``. Such a function needs to receive a queue object to indicate which
96+
device the computation must be offloaded to:
97+
98+
.. code:: python
99+
100+
# allocate space for the result
101+
Y = np.empty_like(X)
102+
# evaluate polynomial on the device targeted by the queue, Y[i] = p(X[i])
103+
onapi_ext.offloaded_poly_evaluate(exec_q, X, Y)
104+
105+
Python call to ``onapi_ext.offloaded_poly_evaluate`` applied to NumPy arrays of
106+
double precision floating pointer numbers gets translated to the following
107+
sample C++ code:
108+
109+
.. code:: cpp
110+
111+
void
112+
cpp_offloaded_poly_evaluate(
113+
sycl::queue q, const double *X, double *Y, size_t n) {
114+
// create buffers from malloc allocations to make data accessible from device
115+
sycl::buffer<1, double> buf_X(X, n);
116+
sycl::buffer<1, double> buf_Y(Y, n);
117+
118+
q.submit([&](sycl::handler &cgh) {
119+
// create buffer accessors indicating kernel data-flow pattern
120+
sycl::accessor acc_X(buf_X, cgh, sycl::read_only);
121+
sycl::accessor acc_Y(buf_Y, cgh, sycl::write_only, sycl::no_init);
122+
123+
cgh.parallel_for(n,
124+
// lambda function that gets executed by different work-items with
125+
// different arguments in parallel
126+
[=](sycl::id<1> id) {
127+
auto x = accX[id];
128+
accY[id] = 3.0 + x * (1.0 + x * (-0.5 + 0.3 * x));
129+
});
130+
}).wait();
131+
132+
return;
133+
}
134+
135+
We refer an interested reader to an excellent and freely available “`Data
136+
Parallel C++ <https://link.springer.com/book/10.1007%2F978-1-4842-5574-2>`__”
137+
book for details of this data parallel C++.
138+
139+
Our package ``numba_dpex`` allows one to write kernels directly in Python.
140+
141+
.. code:: python
142+
143+
import numba_dpex
144+
145+
146+
@numba_dpex.kernel
147+
def numba_dpex_poly(X, Y):
148+
i = numba_dpex.get_global_id(0)
149+
x = X[i]
150+
Y[i] = 3.0 + x * (1.0 + x * (-0.5 + 0.3 * x))
151+
152+
Specifying the execution queue is done using Python context manager:
153+
154+
.. code:: python
155+
156+
import numpy as np
157+
158+
X = np.random.randn(10**6)
159+
Y = np.empty_like(X)
160+
161+
with dpctl.device_context(q):
162+
# apply the kernel to elements of X, writing value into Y,
163+
# while executing using given queue
164+
numba_dpex_poly[numba_dpex.Range(X.size)](X, Y)
165+
166+
The argument to ``device_context`` can be a queue object, a device object for
167+
which a temporary queue will be created, or a filter selector string. Thus we
168+
could have equally used ``dpctl.device_context(gpu_dev)`` or
169+
``dpctl.device_context("gpu")``.
170+
171+
Note that in this examples data sharing was implicitly managed for us: in the
172+
case of calling a function from a precompiled oneAPI native extension data
173+
sharing was managed by DPC++ runtime, while in the case of using ``numba_dpex``
174+
kernel it was managed during execution of ``__call__`` method.
175+
176+
Data sharing
177+
~~~~~~~~~~~~
178+
179+
Implicit management of data is surely convenient, but its use in an interpreted
180+
code comes at a performance cost. A runtime must implicitly copy data from host
181+
to the device before the kernel execution commences and then copy some (or all)
182+
of it back after the execution completes for every Python API call.
183+
184+
``dpctl`` provides for allocating memory directly accessible to kernels
185+
executing on a device using SYCL’s Unified Shared Memory (`USM
186+
<https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:usm>`__)
187+
feature. It also implements USM-based ND-array object
188+
``dpctl.tensor.usm_ndarray`` that conforms `array-API standard
189+
<https://data-apis.org/array-api/latest/>`__.
190+
191+
.. code:: python
192+
193+
import dpctl.tensor as dpt
194+
195+
# allocate array of doubles using USM-device allocation on GPU device
196+
X = dpt.arange(0.0, end=1.0, step=1e-6, device="gpu", usm_type="device")
197+
# allocate array for the output
198+
Y = dpt.empty_like(X)
199+
200+
# execution queue is inferred from allocation queues.
201+
# Kernel is executed on the same device where arrays were allocated
202+
numba_dpex_poly[X.size, numba_dpex.DEFAULT_LOCAL_SIZE](X, Y)
203+
204+
The execution queue can be unambiguously determined in this case since both
205+
arguments are USM arrays with the same allocation queues and ``X.sycl_queue ==
206+
Y.sycl_queue`` evaluates to ``True``. Should allocation queues be different,
207+
such an inference becomes ambiguous and ``numba_dpex`` raises
208+
``IndeterminateExecutionQueueError`` advising user to explicitly migrate the
209+
data.
210+
211+
Migration can be accomplished either by using ``dpctl.tensor.asarray(X,
212+
device=target_device)`` to create a copy, or by using
213+
``X.to_device(target_device)`` method.
214+
215+
A USM array can be copied back into a NumPy array using ``dpt.asnumpy(Y)`` if
216+
needed.
217+
218+
Compute follows data
219+
~~~~~~~~~~~~~~~~~~~~
220+
221+
Automatic deduction of the execution queue from allocation queues is consistent
222+
with “`local control for data allocation target
223+
<https://data-apis.org/array-api/latest/design_topics/device_support.html>`__”
224+
in the array API standard. User has full control over memory allocation through
225+
three keyword arguments present in all `array creation functions
226+
<https://data-apis.org/array-api/latest/API_specification/creation_functions.html>`__.
227+
For example, consider
228+
229+
.. code:: python
230+
231+
# TODO
232+
233+
The keyword ``device`` is `mandated by the array API
234+
<https://data-apis.org/array-api/latest/design_topics/device_support.html#syntax-for-device-assignment>`__.
235+
In ``dpctl.tensor`` the allowed values of the keyword are
236+
237+
- Filter selector string, e.g. ``device="gpu:0"``
238+
- Existing ``dpctl.SyclDevice`` object, e.g. ``device=dev_gpu``
239+
- Existing ``dpctl.SyclQueue`` object
240+
- ``dpctl.tensor.Device`` object instance obtained from an existing USM array,
241+
e.g. ``device=X.device``
242+
243+
In all cases, an allocation queue object will be constructed as described
244+
`earlier <#specifying-offload-target>`__ and stored in the array instance,
245+
accessible with ``X.sycl_queue``. Instead of using ``device`` keyword, one can
246+
alternatively use ``sycl_queue`` keyword for readability to directly specify a
247+
``dpctl.SyclQueue`` object to be used as the allocation queue.
248+
249+
The rationale for storing the allocation queue in the array is that kernels
250+
submitted to this queue are guaranteed to be able to correctly dereference (i.e.
251+
access) the USM pointer. Array operations that only involve this single USM
252+
array can thus execute on the allocation queue, and the output array can be
253+
allocated on this same allocation queue with the same usm type as the input
254+
array.
255+
256+
.. note::
257+
Reusing the allocation queue of the input
258+
array ensures the computational tasks behind the API call can access the
259+
array without making implicit copies and the output array is allocated
260+
on the same device as the input.
261+
262+
Compute follows data is the rule prescribing deduction of the execution and the
263+
allocation queue as well as the USM type for the result when multiple USM arrays
264+
are combined. It stipulates that arrays can be combined if and only if their
265+
allocation *queues are the same* as measured by ``==`` operator (i.e.
266+
``X.sycl_queue == Y.sycl_queue`` must evaluate to ``True``). Same queues refer
267+
to the same underlying task graphs and DPC++ schedulers.
268+
269+
An attempt to combine USM arrays with unsame allocation queues raises an
270+
exception advising the user to migrate the data. Migration can be accomplished
271+
either by using ``dpctl.tensor.asarray(X, device=Y.device)`` to create a copy,
272+
or by using ``X.to_device(Y.device)`` method which can sometime do the migration
273+
more efficiently.
274+
275+
.. warning::
276+
``dpctl`` and ``numba_dpex`` are both under heavy development. Feel free to file an
277+
issue on GitHub or reach out on Gitter should you encounter any issues.

0 commit comments

Comments
 (0)