Skip to content

Commit cc747b2

Browse files
committed
Add prange and kernel fusion sections
1 parent 67a911e commit cc747b2

File tree

1 file changed

+73
-12
lines changed

1 file changed

+73
-12
lines changed

docs/source/user_guide/dpnp_offload.rst

Lines changed: 73 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,13 @@
11
.. include:: ./../ext_links.txt
22

3-
Compiling and Offloading ``dpnp`` Functions
4-
===========================================
3+
Compiling and Offloading Mechanisms
4+
====================================
5+
6+
``numba-dpex`` can directly compile and offload different data parallel
7+
programming constructs and function libraries onto SYCL based devices.
8+
9+
``dpnp`` Functions
10+
-------------------
511

612
Data Parallel Extension for NumPy* (``dpnp``) is a drop-in ``NumPy*``
713
replacement library built on top of oneMKL. ``numba-dpex`` allows various
@@ -35,8 +41,8 @@ in the runtime and the function call is inlined in the generated LLVM IR.
3541
The following sections go over as aspects of the dpnp integration inside
3642
numba-dpex.
3743

38-
Repository map
39-
--------------
44+
Repository Map
45+
---------------
4046

4147
- The code for numba-dpex's ``dpnp`` integration runtime resides in the
4248
:file:`numba_dpex/core/runtime` sub-module.
@@ -48,7 +54,7 @@ Repository map
4854
- Tests resides in :file:`numba_dpex/tests/dpjit_tests/dpnp`.
4955

5056
Design
51-
------
57+
-------
5258

5359
``numba_dpex`` uses the |numba.extending.overload| decorator to create a Numba*
5460
implementation of a function that can be used in `nopython mode`_ functions.
@@ -96,17 +102,72 @@ The corresponding intrinsic implementation is in :file:`numba_dpex/dpnp_iface/_i
96102
...
97103
98104
Parallel Range
99-
--------------
105+
---------------
106+
107+
``numba-dpex`` implements the ability to run loops in parallel, the language
108+
construct is adapted from Numba*'s ``prange`` concept that was initially
109+
designed to run OpenMP parallel for loops. In Numba*, the loop-body is scheduled
110+
in seperate threads, and they execute in a ``nopython`` Numba* context.
111+
``prange`` automatically takes care of data privatization. ``numba-dpex``
112+
employs the ``prange`` compilation mechanism to offload parallel loop like
113+
programming constructs onto SYCL enabled devices.
114+
115+
The ``prange`` compilation pass is delegated through Numba's
116+
:file:`numba/parfor/parfor_lowering.py` module where ``numba-dpex`` provides
117+
:file:`numba_dpex/core/parfors/parfor_lowerer.py` module to be used as the
118+
*lowering* mechanism through
119+
:py:class:`numba_dpex.core.parfors.parfor_lowerer.ParforLowerImpl` class. This
120+
provides a custom lowerer for ``prange`` nodes that generates a SYCL kernel for
121+
a ``prange`` node and submits it to a queue. Here is an example of a ``prange``
122+
use case in ``@dpjit`` context:
123+
124+
.. code-block:: python
125+
126+
from numba import prange
127+
import dpnp
128+
from numba_dpex import dpjit
129+
130+
131+
@dpjit
132+
def foo(a, b):
133+
x = dpnp.ones(10)
134+
for i in prange(10):
135+
x[i] = a[i] + b[i]
136+
return x
137+
138+
139+
a = dpnp.ones(10)
140+
b = dpnp.ones(10)
141+
142+
c = foo(a, b)
143+
print(c)
144+
print(type(c))
145+
146+
Each ``prange`` instruction in Numba* has an optional *lowerer* attribute. The
147+
lowerer attribute determines how the parfor instruction should be lowered to
148+
LLVM IR. In addition, the lower attribute decides which ``prange`` instructions
149+
can be fused together. At this point ``numba-dpex`` does not generate
150+
device-specific code and the lowerer used is same for all device types. However,
151+
a different :py:class:`numba_dpex.core.parfors.parfor_lowerer.ParforLowerImpl`
152+
instance is returned for every ``prange`` instruction for each corresponding CFD
153+
(Compute Follows Data) inferred device to prevent illegal ``prange`` fusion.
154+
100155

101-
``numba-dpex`` implements the ability to run loops in parallel,
102-
similar to OpenMP parallel for loops and Numba*’s ``prange``. The loop-
103-
body is scheduled in seperate threads, and they execute in a ``nopython`` numba
104-
context. ``prange`` automatically takes care of data privatization:
156+
Fusion of Kernels
157+
------------------
105158

159+
``numba-dpex`` can identify each NumPy* (or ``dpnp``) array expression as a
160+
data-parallel kernel and fuse them together to generate a single SYCL kernel.
161+
The kernel is automatically offloaded to the specified device where the fusion
162+
operation is invoked. Here is a simple example of a Black-Scholes formula
163+
computation where kernel fusion occurs at different ``dpnp`` math functions:
106164

165+
.. literalinclude:: ./../../../numba_dpex/examples/blacksholes_njit.py
166+
:language: python
167+
:pyobject: blackscholes
168+
:caption: **EXAMPLE:** Data parallel kernel implementing the vector sum a+b
169+
:name: blackscholes_dpjit
107170

108-
- prange, reduction prange
109-
- blackscholes, math example
110171

111172
.. |numba.extending.overload| replace:: ``numba.extending.overload``
112173
.. |numba.extending.intrinsic| replace:: ``numba.extending.intrinsic``

0 commit comments

Comments
 (0)