Skip to content

Commit 174b4f7

Browse files
author
Diptorup Deb
committed
Edits to dpnp user guide.
1 parent cc747b2 commit 174b4f7

File tree

2 files changed

+184
-146
lines changed

2 files changed

+184
-146
lines changed

docs/source/api_reference/index.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,5 @@
44

55
API Reference
66
=============
7+
8+
Coming soon
Lines changed: 182 additions & 146 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,24 @@
11
.. include:: ./../ext_links.txt
22

3-
Compiling and Offloading Mechanisms
4-
====================================
3+
Compiling and Offloading ``dpnp`` statements
4+
============================================
55

6-
``numba-dpex`` can directly compile and offload different data parallel
7-
programming constructs and function libraries onto SYCL based devices.
6+
Data Parallel Extension for NumPy* (``dpnp``) is a drop-in ``NumPy*``
7+
replacement library built on top of oneMKL and SYCL. ``numba-dpex`` allows
8+
various ``dpnp`` library function calls to be JIT-compiled using the
9+
``numba_dpex.dpjit`` decorator. Presently, ``numba-dpex`` can compile several
10+
``dpnp`` array constructors (``ones``, ``zeros``, ``full``, ``empty``), most
11+
universal functions, ``prange`` loops, and vector expressions using
12+
``dpnp.ndarray`` objects.
813

9-
``dpnp`` Functions
10-
-------------------
14+
An example of a supported usage of ``dpnp`` statements in ``numba-dpex`` is
15+
provided in the following code snippet:
1116

12-
Data Parallel Extension for NumPy* (``dpnp``) is a drop-in ``NumPy*``
13-
replacement library built on top of oneMKL. ``numba-dpex`` allows various
14-
``dpnp`` library function calls to be jit-compiled thorugh its
15-
``numba_dpex.dpjit`` decorator.
1617

17-
``numba-dpex`` implements its own runtime library to support offloading ``dpnp``
18-
library functions to SYCL devices. For each ``dpnp`` function signature to be
19-
offloaded, ``numba-dpex`` implements the corresponding direct SYCL function call
20-
in the runtime and the function call is inlined in the generated LLVM IR.
18+
.. ``numba-dpex`` implements its own runtime library to support offloading ``dpnp``
19+
.. library functions to SYCL devices. For each ``dpnp`` function signature to be
20+
.. offloaded, ``numba-dpex`` implements the corresponding direct SYCL function call
21+
.. in the runtime and the function call is inlined in the generated LLVM IR.
2122
2223
.. code-block:: python
2324
@@ -27,157 +28,192 @@ in the runtime and the function call is inlined in the generated LLVM IR.
2728
2829
@dpjit
2930
def foo():
30-
return dpnp.ones(10) # the function call for this signature
31-
# will be generated through the runtime
32-
# library and inlined into the LLVM IR
31+
a = dpnp.ones(1024, device="gpu")
32+
return dpnp.sqrt(a)
3333
3434
3535
a = foo()
3636
print(a)
3737
print(type(a))
3838
39-
:samp:`dpnp.ones(10)` will be called through |ol_dpnp_ones(...)|_.
40-
41-
The following sections go over as aspects of the dpnp integration inside
42-
numba-dpex.
43-
44-
Repository Map
45-
---------------
46-
47-
- The code for numba-dpex's ``dpnp`` integration runtime resides in the
48-
:file:`numba_dpex/core/runtime` sub-module.
49-
- All the |numba.extending.overload|_ for ``dpnp`` array creation/initialization
50-
function signatures are implemented in
51-
:file:`numba_dpex/dpnp_iface/arrayobj.py`
52-
- Each overload's corresponding |numba.extending.intrinsic|_ is implemented in
53-
:file:`numba_dpex/dpnp_iface/_intrinsic.py`
54-
- Tests resides in :file:`numba_dpex/tests/dpjit_tests/dpnp`.
55-
56-
Design
57-
-------
58-
59-
``numba_dpex`` uses the |numba.extending.overload| decorator to create a Numba*
60-
implementation of a function that can be used in `nopython mode`_ functions.
61-
This is done through translation of ``dpnp`` function signature so that they can
62-
be called in ``numba_dpex.dpjit`` decorated code.
63-
64-
The specific SYCL operation for a certain ``dpnp`` function is performed by the
65-
runtime interface. During compiling a function decorated with the ``@dpjit``
66-
decorator, ``numba-dpex`` generates the corresponding SYCL function call through
67-
its runtime library and injects it into the LLVM IR through
68-
|numba.extending.intrinsic|_. The ``@intrinsic`` decorator is used for marking a
69-
``dpnp`` function as typing and implementing the function in nopython mode using
70-
the `llvmlite IRBuilder API`_. This is an escape hatch to build custom LLVM IR
71-
that will be inlined into the caller.
72-
73-
The code injection logic to enable ``dpnp`` functions calls in the Numba IR is
74-
implemented by :mod:`numba_dpex.core.dpnp_iface.arrayobj` module which replaces
75-
Numba*'s :mod:`numba.np.arrayobj`. Each ``dpnp`` function signature is provided
76-
with a concrete implementation to generates the actual code using Numba's
77-
``overload`` function API. e.g.:
78-
79-
.. code-block:: python
80-
81-
@overload(dpnp.ones, prefer_literal=True)
82-
def ol_dpnp_ones(
83-
shape, dtype=None, order="C", device=None, usm_type="device", sycl_queue=None
84-
):
85-
...
86-
87-
The corresponding intrinsic implementation is in :file:`numba_dpex/dpnp_iface/_intrinsic.py`.
88-
89-
.. code-block:: python
90-
91-
@intrinsic
92-
def impl_dpnp_ones(
93-
ty_context,
94-
ty_shape,
95-
ty_dtype,
96-
ty_order,
97-
ty_device,
98-
ty_usm_type,
99-
ty_sycl_queue,
100-
ty_retty_ref,
101-
):
102-
...
39+
.. :samp:`dpnp.ones(10)` will be called through |ol_dpnp_ones(...)|_.
40+
41+
42+
.. Design
43+
.. -------
44+
45+
.. ``numba_dpex`` uses the |numba.extending.overload| decorator to create a Numba*
46+
.. implementation of a function that can be used in `nopython mode`_ functions.
47+
.. This is done through translation of ``dpnp`` function signature so that they can
48+
.. be called in ``numba_dpex.dpjit`` decorated code.
49+
50+
.. The specific SYCL operation for a certain ``dpnp`` function is performed by the
51+
.. runtime interface. During compiling a function decorated with the ``@dpjit``
52+
.. decorator, ``numba-dpex`` generates the corresponding SYCL function call through
53+
.. its runtime library and injects it into the LLVM IR through
54+
.. |numba.extending.intrinsic|_. The ``@intrinsic`` decorator is used for marking a
55+
.. ``dpnp`` function as typing and implementing the function in nopython mode using
56+
.. the `llvmlite IRBuilder API`_. This is an escape hatch to build custom LLVM IR
57+
.. that will be inlined into the caller.
58+
59+
.. The code injection logic to enable ``dpnp`` functions calls in the Numba IR is
60+
.. implemented by :mod:`numba_dpex.core.dpnp_iface.arrayobj` module which replaces
61+
.. Numba*'s :mod:`numba.np.arrayobj`. Each ``dpnp`` function signature is provided
62+
.. with a concrete implementation to generates the actual code using Numba's
63+
.. ``overload`` function API. e.g.:
64+
65+
.. .. code-block:: python
66+
67+
.. @overload(dpnp.ones, prefer_literal=True)
68+
.. def ol_dpnp_ones(
69+
.. shape, dtype=None, order="C", device=None, usm_type="device", sycl_queue=None
70+
.. ):
71+
.. ...
72+
73+
.. The corresponding intrinsic implementation is in :file:`numba_dpex/dpnp_iface/_intrinsic.py`.
74+
75+
.. .. code-block:: python
76+
77+
.. @intrinsic
78+
.. def impl_dpnp_ones(
79+
.. ty_context,
80+
.. ty_shape,
81+
.. ty_dtype,
82+
.. ty_order,
83+
.. ty_device,
84+
.. ty_usm_type,
85+
.. ty_sycl_queue,
86+
.. ty_retty_ref,
87+
.. ):
88+
.. ...
10389
10490
Parallel Range
10591
---------------
10692

107-
``numba-dpex`` implements the ability to run loops in parallel, the language
108-
construct is adapted from Numba*'s ``prange`` concept that was initially
109-
designed to run OpenMP parallel for loops. In Numba*, the loop-body is scheduled
110-
in seperate threads, and they execute in a ``nopython`` Numba* context.
111-
``prange`` automatically takes care of data privatization. ``numba-dpex``
112-
employs the ``prange`` compilation mechanism to offload parallel loop like
113-
programming constructs onto SYCL enabled devices.
114-
115-
The ``prange`` compilation pass is delegated through Numba's
116-
:file:`numba/parfor/parfor_lowering.py` module where ``numba-dpex`` provides
117-
:file:`numba_dpex/core/parfors/parfor_lowerer.py` module to be used as the
118-
*lowering* mechanism through
119-
:py:class:`numba_dpex.core.parfors.parfor_lowerer.ParforLowerImpl` class. This
120-
provides a custom lowerer for ``prange`` nodes that generates a SYCL kernel for
121-
a ``prange`` node and submits it to a queue. Here is an example of a ``prange``
122-
use case in ``@dpjit`` context:
93+
``numba-dpex`` supports using the ``numba.prange`` statements with
94+
``dpnp.ndarray`` objects. All such ``prange`` loops are offloaded as kernels and
95+
executed on a device inferred using the compute follows data programming model.
96+
The next examples shows using a ``prange`` loop.
97+
98+
.. implements the ability to run loops in parallel, the language
99+
.. construct is adapted from Numba*'s ``prange`` concept that was initially
100+
.. designed to run OpenMP parallel for loops. In Numba*, the loop-body is scheduled
101+
.. in seperate threads, and they execute in a ``nopython`` Numba* context.
102+
.. ``prange`` automatically takes care of data privatization. ``numba-dpex``
103+
.. employs the ``prange`` compilation mechanism to offload parallel loop like
104+
.. programming constructs onto SYCL enabled devices.
105+
106+
.. The ``prange`` compilation pass is delegated through Numba's
107+
.. :file:`numba/parfor/parfor_lowering.py` module where ``numba-dpex`` provides
108+
.. :file:`numba_dpex/core/parfors/parfor_lowerer.py` module to be used as the
109+
.. *lowering* mechanism through
110+
.. :py:class:`numba_dpex.core.parfors.parfor_lowerer.ParforLowerImpl` class. This
111+
.. provides a custom lowerer for ``prange`` nodes that generates a SYCL kernel for
112+
.. a ``prange`` node and submits it to a queue. Here is an example of a ``prange``
113+
.. use case in ``@dpjit`` context:
123114
124115
.. code-block:: python
125116
126-
from numba import prange
127117
import dpnp
128-
from numba_dpex import dpjit
118+
from numba_dpex import dpjit, prange
129119
130120
131121
@dpjit
132-
def foo(a, b):
133-
x = dpnp.ones(10)
134-
for i in prange(10):
135-
x[i] = a[i] + b[i]
136-
return x
137-
122+
def foo():
123+
x = dpnp.ones(1024, device="gpu")
124+
o = dpnp.empty_like(a)
125+
for i in prange(x.shape[0]):
126+
o[i] = x[i] * x[i]
127+
return o
138128
139-
a = dpnp.ones(10)
140-
b = dpnp.ones(10)
141129
142-
c = foo(a, b)
130+
c = foo()
143131
print(c)
144132
print(type(c))
145133
146-
Each ``prange`` instruction in Numba* has an optional *lowerer* attribute. The
147-
lowerer attribute determines how the parfor instruction should be lowered to
148-
LLVM IR. In addition, the lower attribute decides which ``prange`` instructions
149-
can be fused together. At this point ``numba-dpex`` does not generate
150-
device-specific code and the lowerer used is same for all device types. However,
151-
a different :py:class:`numba_dpex.core.parfors.parfor_lowerer.ParforLowerImpl`
152-
instance is returned for every ``prange`` instruction for each corresponding CFD
153-
(Compute Follows Data) inferred device to prevent illegal ``prange`` fusion.
154-
155-
156-
Fusion of Kernels
157-
------------------
158-
159-
``numba-dpex`` can identify each NumPy* (or ``dpnp``) array expression as a
160-
data-parallel kernel and fuse them together to generate a single SYCL kernel.
161-
The kernel is automatically offloaded to the specified device where the fusion
162-
operation is invoked. Here is a simple example of a Black-Scholes formula
163-
computation where kernel fusion occurs at different ``dpnp`` math functions:
164-
165-
.. literalinclude:: ./../../../numba_dpex/examples/blacksholes_njit.py
166-
:language: python
167-
:pyobject: blackscholes
168-
:caption: **EXAMPLE:** Data parallel kernel implementing the vector sum a+b
169-
:name: blackscholes_dpjit
170-
171-
172-
.. |numba.extending.overload| replace:: ``numba.extending.overload``
173-
.. |numba.extending.intrinsic| replace:: ``numba.extending.intrinsic``
174-
.. |ol_dpnp_ones(...)| replace:: ``ol_dpnp_ones(...)``
175-
.. |numba.np.arrayobj| replace:: ``numba.np.arrayobj``
176-
177-
.. _low-level API: https://github.com/IntelPython/dpnp/tree/master/dpnp/backend
178-
.. _`ol_dpnp_ones(...)`: https://github.com/IntelPython/numba-dpex/blob/main/numba_dpex/dpnp_iface/arrayobj.py#L358
179-
.. _`numba.extending.overload`: https://numba.pydata.org/numba-doc/latest/extending/high-level.html#implementing-functions
180-
.. _`numba.extending.intrinsic`: https://numba.pydata.org/numba-doc/latest/extending/high-level.html#implementing-intrinsics
181-
.. _nopython mode: https://numba.pydata.org/numba-doc/latest/glossary.html#term-nopython-mode
182-
.. _`numba.np.arrayobj`: https://github.com/numba/numba/blob/main/numba/np/arrayobj.py
183-
.. _`llvmlite IRBuilder API`: http://llvmlite.pydata.org/en/latest/user-guide/ir/ir-builder.html
134+
.. Each ``prange`` instruction in Numba* has an optional *lowerer* attribute. The
135+
.. lowerer attribute determines how the parfor instruction should be lowered to
136+
.. LLVM IR. In addition, the lower attribute decides which ``prange`` instructions
137+
.. can be fused together. At this point ``numba-dpex`` does not generate
138+
.. device-specific code and the lowerer used is same for all device types. However,
139+
.. a different :py:class:`numba_dpex.core.parfors.parfor_lowerer.ParforLowerImpl`
140+
.. instance is returned for every ``prange`` instruction for each corresponding CFD
141+
.. (Compute Follows Data) inferred device to prevent illegal ``prange`` fusion.
142+
143+
``prange`` loop statements can also be used to write reduction loops as
144+
demonstrated by the following naive pairwise distance computation.
145+
146+
.. code-block:: python
147+
148+
from numba_dpex import dpjit, prange
149+
import dpnp
150+
import dpctl
151+
152+
153+
@dpjit
154+
def pairwise_distance(X1, X2, D):
155+
"""Naïve pairwise distance impl - take an array representing M points in N
156+
dimensions, and return the M x M matrix of Euclidean distances
157+
158+
Args:
159+
X1 : Set of points
160+
X2 : Set of points
161+
D : Outputted distance matrix
162+
"""
163+
# Size of inputs
164+
X1_rows = X1.shape[0]
165+
X2_rows = X2.shape[0]
166+
X1_cols = X1.shape[1]
167+
168+
float0 = X1.dtype.type(0.0)
169+
170+
# Outermost parallel loop over the matrix X1
171+
for i in prange(X1_rows):
172+
# Loop over the matrix X2
173+
for j in range(X2_rows):
174+
d = float0
175+
# Compute exclidean distance
176+
for k in range(X1_cols):
177+
tmp = X1[i, k] - X2[j, k]
178+
d += tmp * tmp
179+
# Write computed distance to distance matrix
180+
D[i, j] = dpnp.sqrt(d)
181+
182+
183+
q = dpctl.SyclQueue()
184+
X1 = dpnp.ones((10, 2), sycl_queue=q)
185+
X2 = dpnp.zeros((10, 2), sycl_queue=q)
186+
D = dpnp.empty((10, 2), sycl_queue=q)
187+
188+
pairwise_distance(X1, X2, D)
189+
print(D)
190+
191+
192+
.. Fusion of Kernels
193+
.. ------------------
194+
195+
.. ``numba-dpex`` can identify each NumPy* (or ``dpnp``) array expression as a
196+
.. data-parallel kernel and fuse them together to generate a single SYCL kernel.
197+
.. The kernel is automatically offloaded to the specified device where the fusion
198+
.. operation is invoked. Here is a simple example of a Black-Scholes formula
199+
.. computation where kernel fusion occurs at different ``dpnp`` math functions:
200+
201+
.. .. literalinclude:: ./../../../numba_dpex/examples/blacksholes_njit.py
202+
.. :language: python
203+
.. :pyobject: blackscholes
204+
.. :caption: **EXAMPLE:** Data parallel kernel implementing the vector sum a+b
205+
.. :name: blackscholes_dpjit
206+
207+
208+
.. .. |numba.extending.overload| replace:: ``numba.extending.overload``
209+
.. .. |numba.extending.intrinsic| replace:: ``numba.extending.intrinsic``
210+
.. .. |ol_dpnp_ones(...)| replace:: ``ol_dpnp_ones(...)``
211+
.. .. |numba.np.arrayobj| replace:: ``numba.np.arrayobj``
212+
213+
.. .. _low-level API: https://github.com/IntelPython/dpnp/tree/master/dpnp/backend
214+
.. .. _`ol_dpnp_ones(...)`: https://github.com/IntelPython/numba-dpex/blob/main/numba_dpex/dpnp_iface/arrayobj.py#L358
215+
.. .. _`numba.extending.overload`: https://numba.pydata.org/numba-doc/latest/extending/high-level.html#implementing-functions
216+
.. .. _`numba.extending.intrinsic`: https://numba.pydata.org/numba-doc/latest/extending/high-level.html#implementing-intrinsics
217+
.. .. _nopython mode: https://numba.pydata.org/numba-doc/latest/glossary.html#term-nopython-mode
218+
.. .. _`numba.np.arrayobj`: https://github.com/numba/numba/blob/main/numba/np/arrayobj.py
219+
.. .. _`llvmlite IRBuilder API`: http://llvmlite.pydata.org/en/latest/user-guide/ir/ir-builder.html

0 commit comments

Comments
 (0)