1
1
.. include :: ./../ext_links.txt
2
2
3
- Compiling and Offloading Mechanisms
4
- ====================================
3
+ Compiling and Offloading `` dpnp `` statements
4
+ ============================================
5
5
6
- ``numba-dpex `` can directly compile and offload different data parallel
7
- programming constructs and function libraries onto SYCL based devices.
6
+ Data Parallel Extension for NumPy* (``dpnp ``) is a drop-in ``NumPy* ``
7
+ replacement library built on top of oneMKL and SYCL. ``numba-dpex `` allows
8
+ various ``dpnp `` library function calls to be JIT-compiled using the
9
+ ``numba_dpex.dpjit `` decorator. Presently, ``numba-dpex `` can compile several
10
+ ``dpnp `` array constructors (``ones ``, ``zeros ``, ``full ``, ``empty ``), most
11
+ universal functions, ``prange `` loops, and vector expressions using
12
+ ``dpnp.ndarray `` objects.
8
13
9
- ``dpnp `` Functions
10
- -------------------
14
+ An example of a supported usage of ``dpnp `` statements in `` numba-dpex `` is
15
+ provided in the following code snippet:
11
16
12
- Data Parallel Extension for NumPy* (``dpnp ``) is a drop-in ``NumPy* ``
13
- replacement library built on top of oneMKL. ``numba-dpex `` allows various
14
- ``dpnp `` library function calls to be jit-compiled thorugh its
15
- ``numba_dpex.dpjit `` decorator.
16
17
17
- ``numba-dpex `` implements its own runtime library to support offloading ``dpnp ``
18
- library functions to SYCL devices. For each ``dpnp `` function signature to be
19
- offloaded, ``numba-dpex `` implements the corresponding direct SYCL function call
20
- in the runtime and the function call is inlined in the generated LLVM IR.
18
+ .. ``numba-dpex`` implements its own runtime library to support offloading ``dpnp``
19
+ .. library functions to SYCL devices. For each ``dpnp`` function signature to be
20
+ .. offloaded, ``numba-dpex`` implements the corresponding direct SYCL function call
21
+ .. in the runtime and the function call is inlined in the generated LLVM IR.
21
22
22
23
.. code-block :: python
23
24
@@ -27,157 +28,192 @@ in the runtime and the function call is inlined in the generated LLVM IR.
27
28
28
29
@dpjit
29
30
def foo ():
30
- return dpnp.ones(10 ) # the function call for this signature
31
- # will be generated through the runtime
32
- # library and inlined into the LLVM IR
31
+ a = dpnp.ones(1024 , device = " gpu" )
32
+ return dpnp.sqrt(a)
33
33
34
34
35
35
a = foo()
36
36
print (a)
37
37
print (type (a))
38
38
39
- :samp: `dpnp.ones(10) ` will be called through |ol_dpnp_ones(...) |_.
40
-
41
- The following sections go over as aspects of the dpnp integration inside
42
- numba-dpex.
43
-
44
- Repository Map
45
- ---------------
46
-
47
- - The code for numba-dpex's ``dpnp `` integration runtime resides in the
48
- :file: `numba_dpex/core/runtime ` sub-module.
49
- - All the |numba.extending.overload |_ for ``dpnp `` array creation/initialization
50
- function signatures are implemented in
51
- :file: `numba_dpex/dpnp_iface/arrayobj.py `
52
- - Each overload's corresponding |numba.extending.intrinsic |_ is implemented in
53
- :file: `numba_dpex/dpnp_iface/_intrinsic.py `
54
- - Tests resides in :file: `numba_dpex/tests/dpjit_tests/dpnp `.
55
-
56
- Design
57
- -------
58
-
59
- ``numba_dpex `` uses the |numba.extending.overload | decorator to create a Numba*
60
- implementation of a function that can be used in `nopython mode `_ functions.
61
- This is done through translation of ``dpnp `` function signature so that they can
62
- be called in ``numba_dpex.dpjit `` decorated code.
63
-
64
- The specific SYCL operation for a certain ``dpnp `` function is performed by the
65
- runtime interface. During compiling a function decorated with the ``@dpjit ``
66
- decorator, ``numba-dpex `` generates the corresponding SYCL function call through
67
- its runtime library and injects it into the LLVM IR through
68
- |numba.extending.intrinsic |_. The ``@intrinsic `` decorator is used for marking a
69
- ``dpnp `` function as typing and implementing the function in nopython mode using
70
- the `llvmlite IRBuilder API `_. This is an escape hatch to build custom LLVM IR
71
- that will be inlined into the caller.
72
-
73
- The code injection logic to enable ``dpnp `` functions calls in the Numba IR is
74
- implemented by :mod: `numba_dpex.core.dpnp_iface.arrayobj ` module which replaces
75
- Numba*'s :mod: `numba.np.arrayobj `. Each ``dpnp `` function signature is provided
76
- with a concrete implementation to generates the actual code using Numba's
77
- ``overload `` function API. e.g.:
78
-
79
- .. code-block :: python
80
-
81
- @overload (dpnp.ones, prefer_literal = True )
82
- def ol_dpnp_ones (
83
- shape , dtype = None , order = " C" , device = None , usm_type = " device" , sycl_queue = None
84
- ):
85
- ...
86
-
87
- The corresponding intrinsic implementation is in :file: `numba_dpex/dpnp_iface/_intrinsic.py `.
88
-
89
- .. code-block :: python
90
-
91
- @intrinsic
92
- def impl_dpnp_ones (
93
- ty_context ,
94
- ty_shape ,
95
- ty_dtype ,
96
- ty_order ,
97
- ty_device ,
98
- ty_usm_type ,
99
- ty_sycl_queue ,
100
- ty_retty_ref ,
101
- ):
102
- ...
39
+ .. :samp:`dpnp.ones(10)` will be called through |ol_dpnp_ones(...)|_.
40
+
41
+
42
+ .. Design
43
+ .. -------
44
+
45
+ .. ``numba_dpex`` uses the |numba.extending.overload| decorator to create a Numba*
46
+ .. implementation of a function that can be used in `nopython mode`_ functions.
47
+ .. This is done through translation of ``dpnp`` function signature so that they can
48
+ .. be called in ``numba_dpex.dpjit`` decorated code.
49
+
50
+ .. The specific SYCL operation for a certain ``dpnp`` function is performed by the
51
+ .. runtime interface. During compiling a function decorated with the ``@dpjit``
52
+ .. decorator, ``numba-dpex`` generates the corresponding SYCL function call through
53
+ .. its runtime library and injects it into the LLVM IR through
54
+ .. |numba.extending.intrinsic|_. The ``@intrinsic`` decorator is used for marking a
55
+ .. ``dpnp`` function as typing and implementing the function in nopython mode using
56
+ .. the `llvmlite IRBuilder API`_. This is an escape hatch to build custom LLVM IR
57
+ .. that will be inlined into the caller.
58
+
59
+ .. The code injection logic to enable ``dpnp`` functions calls in the Numba IR is
60
+ .. implemented by :mod:`numba_dpex.core.dpnp_iface.arrayobj` module which replaces
61
+ .. Numba*'s :mod:`numba.np.arrayobj`. Each ``dpnp`` function signature is provided
62
+ .. with a concrete implementation to generates the actual code using Numba's
63
+ .. ``overload`` function API. e.g.:
64
+
65
+ .. .. code-block:: python
66
+
67
+ .. @overload(dpnp.ones, prefer_literal=True)
68
+ .. def ol_dpnp_ones(
69
+ .. shape, dtype=None, order="C", device=None, usm_type="device", sycl_queue=None
70
+ .. ):
71
+ .. ...
72
+
73
+ .. The corresponding intrinsic implementation is in :file:`numba_dpex/dpnp_iface/_intrinsic.py`.
74
+
75
+ .. .. code-block:: python
76
+
77
+ .. @intrinsic
78
+ .. def impl_dpnp_ones(
79
+ .. ty_context,
80
+ .. ty_shape,
81
+ .. ty_dtype,
82
+ .. ty_order,
83
+ .. ty_device,
84
+ .. ty_usm_type,
85
+ .. ty_sycl_queue,
86
+ .. ty_retty_ref,
87
+ .. ):
88
+ .. ...
103
89
104
90
Parallel Range
105
91
---------------
106
92
107
- ``numba-dpex `` implements the ability to run loops in parallel, the language
108
- construct is adapted from Numba*'s ``prange `` concept that was initially
109
- designed to run OpenMP parallel for loops. In Numba*, the loop-body is scheduled
110
- in seperate threads, and they execute in a ``nopython `` Numba* context.
111
- ``prange `` automatically takes care of data privatization. ``numba-dpex ``
112
- employs the ``prange `` compilation mechanism to offload parallel loop like
113
- programming constructs onto SYCL enabled devices.
114
-
115
- The ``prange `` compilation pass is delegated through Numba's
116
- :file: `numba/parfor/parfor_lowering.py ` module where ``numba-dpex `` provides
117
- :file: `numba_dpex/core/parfors/parfor_lowerer.py ` module to be used as the
118
- *lowering * mechanism through
119
- :py:class: `numba_dpex.core.parfors.parfor_lowerer.ParforLowerImpl ` class. This
120
- provides a custom lowerer for ``prange `` nodes that generates a SYCL kernel for
121
- a ``prange `` node and submits it to a queue. Here is an example of a ``prange ``
122
- use case in ``@dpjit `` context:
93
+ ``numba-dpex `` supports using the ``numba.prange `` statements with
94
+ ``dpnp.ndarray `` objects. All such ``prange `` loops are offloaded as kernels and
95
+ executed on a device inferred using the compute follows data programming model.
96
+ The next examples shows using a ``prange `` loop.
97
+
98
+ .. implements the ability to run loops in parallel, the language
99
+ .. construct is adapted from Numba*'s ``prange`` concept that was initially
100
+ .. designed to run OpenMP parallel for loops. In Numba*, the loop-body is scheduled
101
+ .. in seperate threads, and they execute in a ``nopython`` Numba* context.
102
+ .. ``prange`` automatically takes care of data privatization. ``numba-dpex``
103
+ .. employs the ``prange`` compilation mechanism to offload parallel loop like
104
+ .. programming constructs onto SYCL enabled devices.
105
+
106
+ .. The ``prange`` compilation pass is delegated through Numba's
107
+ .. :file:`numba/parfor/parfor_lowering.py` module where ``numba-dpex`` provides
108
+ .. :file:`numba_dpex/core/parfors/parfor_lowerer.py` module to be used as the
109
+ .. *lowering* mechanism through
110
+ .. :py:class:`numba_dpex.core.parfors.parfor_lowerer.ParforLowerImpl` class. This
111
+ .. provides a custom lowerer for ``prange`` nodes that generates a SYCL kernel for
112
+ .. a ``prange`` node and submits it to a queue. Here is an example of a ``prange``
113
+ .. use case in ``@dpjit`` context:
123
114
124
115
.. code-block :: python
125
116
126
- from numba import prange
127
117
import dpnp
128
- from numba_dpex import dpjit
118
+ from numba_dpex import dpjit, prange
129
119
130
120
131
121
@dpjit
132
- def foo (a , b ):
133
- x = dpnp.ones(10 )
134
- for i in prange( 10 ):
135
- x[i] = a[i] + b[i]
136
- return x
137
-
122
+ def foo ():
123
+ x = dpnp.ones(1024 , device = " gpu " )
124
+ o = dpnp.empty_like(a)
125
+ for i in prange(x.shape[ 0 ]):
126
+ o[i] = x[i] * x[i]
127
+ return o
138
128
139
- a = dpnp.ones(10 )
140
- b = dpnp.ones(10 )
141
129
142
- c = foo(a, b )
130
+ c = foo()
143
131
print (c)
144
132
print (type (c))
145
133
146
- Each ``prange `` instruction in Numba* has an optional *lowerer * attribute. The
147
- lowerer attribute determines how the parfor instruction should be lowered to
148
- LLVM IR. In addition, the lower attribute decides which ``prange `` instructions
149
- can be fused together. At this point ``numba-dpex `` does not generate
150
- device-specific code and the lowerer used is same for all device types. However,
151
- a different :py:class: `numba_dpex.core.parfors.parfor_lowerer.ParforLowerImpl `
152
- instance is returned for every ``prange `` instruction for each corresponding CFD
153
- (Compute Follows Data) inferred device to prevent illegal ``prange `` fusion.
154
-
155
-
156
- Fusion of Kernels
157
- ------------------
158
-
159
- ``numba-dpex `` can identify each NumPy* (or ``dpnp ``) array expression as a
160
- data-parallel kernel and fuse them together to generate a single SYCL kernel.
161
- The kernel is automatically offloaded to the specified device where the fusion
162
- operation is invoked. Here is a simple example of a Black-Scholes formula
163
- computation where kernel fusion occurs at different ``dpnp `` math functions:
164
-
165
- .. literalinclude :: ./../../../numba_dpex/examples/blacksholes_njit.py
166
- :language: python
167
- :pyobject: blackscholes
168
- :caption: **EXAMPLE: ** Data parallel kernel implementing the vector sum a+b
169
- :name: blackscholes_dpjit
170
-
171
-
172
- .. |numba.extending.overload | replace :: ``numba.extending.overload ``
173
- .. |numba.extending.intrinsic | replace :: ``numba.extending.intrinsic ``
174
- .. |ol_dpnp_ones(...) | replace :: ``ol_dpnp_ones(...) ``
175
- .. |numba.np.arrayobj | replace :: ``numba.np.arrayobj ``
176
-
177
- .. _low-level API : https://github.com/IntelPython/dpnp/tree/master/dpnp/backend
178
- .. _`ol_dpnp_ones(...)` : https://github.com/IntelPython/numba-dpex/blob/main/numba_dpex/dpnp_iface/arrayobj.py#L358
179
- .. _`numba.extending.overload` : https://numba.pydata.org/numba-doc/latest/extending/high-level.html#implementing-functions
180
- .. _`numba.extending.intrinsic` : https://numba.pydata.org/numba-doc/latest/extending/high-level.html#implementing-intrinsics
181
- .. _nopython mode : https://numba.pydata.org/numba-doc/latest/glossary.html#term-nopython-mode
182
- .. _`numba.np.arrayobj` : https://github.com/numba/numba/blob/main/numba/np/arrayobj.py
183
- .. _`llvmlite IRBuilder API` : http://llvmlite.pydata.org/en/latest/user-guide/ir/ir-builder.html
134
+ .. Each ``prange`` instruction in Numba* has an optional *lowerer* attribute. The
135
+ .. lowerer attribute determines how the parfor instruction should be lowered to
136
+ .. LLVM IR. In addition, the lower attribute decides which ``prange`` instructions
137
+ .. can be fused together. At this point ``numba-dpex`` does not generate
138
+ .. device-specific code and the lowerer used is same for all device types. However,
139
+ .. a different :py:class:`numba_dpex.core.parfors.parfor_lowerer.ParforLowerImpl`
140
+ .. instance is returned for every ``prange`` instruction for each corresponding CFD
141
+ .. (Compute Follows Data) inferred device to prevent illegal ``prange`` fusion.
142
+
143
+ ``prange `` loop statements can also be used to write reduction loops as
144
+ demonstrated by the following naive pairwise distance computation.
145
+
146
+ .. code-block :: python
147
+
148
+ from numba_dpex import dpjit, prange
149
+ import dpnp
150
+ import dpctl
151
+
152
+
153
+ @dpjit
154
+ def pairwise_distance (X1 , X2 , D ):
155
+ """ Naïve pairwise distance impl - take an array representing M points in N
156
+ dimensions, and return the M x M matrix of Euclidean distances
157
+
158
+ Args:
159
+ X1 : Set of points
160
+ X2 : Set of points
161
+ D : Outputted distance matrix
162
+ """
163
+ # Size of inputs
164
+ X1_rows = X1.shape[0 ]
165
+ X2_rows = X2.shape[0 ]
166
+ X1_cols = X1.shape[1 ]
167
+
168
+ float0 = X1.dtype.type(0.0 )
169
+
170
+ # Outermost parallel loop over the matrix X1
171
+ for i in prange(X1_rows):
172
+ # Loop over the matrix X2
173
+ for j in range (X2_rows):
174
+ d = float0
175
+ # Compute exclidean distance
176
+ for k in range (X1_cols):
177
+ tmp = X1[i, k] - X2[j, k]
178
+ d += tmp * tmp
179
+ # Write computed distance to distance matrix
180
+ D[i, j] = dpnp.sqrt(d)
181
+
182
+
183
+ q = dpctl.SyclQueue()
184
+ X1 = dpnp.ones((10 , 2 ), sycl_queue = q)
185
+ X2 = dpnp.zeros((10 , 2 ), sycl_queue = q)
186
+ D = dpnp.empty((10 , 2 ), sycl_queue = q)
187
+
188
+ pairwise_distance(X1, X2, D)
189
+ print (D)
190
+
191
+
192
+ .. Fusion of Kernels
193
+ .. ------------------
194
+
195
+ .. ``numba-dpex`` can identify each NumPy* (or ``dpnp``) array expression as a
196
+ .. data-parallel kernel and fuse them together to generate a single SYCL kernel.
197
+ .. The kernel is automatically offloaded to the specified device where the fusion
198
+ .. operation is invoked. Here is a simple example of a Black-Scholes formula
199
+ .. computation where kernel fusion occurs at different ``dpnp`` math functions:
200
+
201
+ .. .. literalinclude:: ./../../../numba_dpex/examples/blacksholes_njit.py
202
+ .. :language: python
203
+ .. :pyobject: blackscholes
204
+ .. :caption: **EXAMPLE:** Data parallel kernel implementing the vector sum a+b
205
+ .. :name: blackscholes_dpjit
206
+
207
+
208
+ .. .. |numba.extending.overload| replace:: ``numba.extending.overload``
209
+ .. .. |numba.extending.intrinsic| replace:: ``numba.extending.intrinsic``
210
+ .. .. |ol_dpnp_ones(...)| replace:: ``ol_dpnp_ones(...)``
211
+ .. .. |numba.np.arrayobj| replace:: ``numba.np.arrayobj``
212
+
213
+ .. .. _low-level API: https://github.com/IntelPython/dpnp/tree/master/dpnp/backend
214
+ .. .. _`ol_dpnp_ones(...)`: https://github.com/IntelPython/numba-dpex/blob/main/numba_dpex/dpnp_iface/arrayobj.py#L358
215
+ .. .. _`numba.extending.overload`: https://numba.pydata.org/numba-doc/latest/extending/high-level.html#implementing-functions
216
+ .. .. _`numba.extending.intrinsic`: https://numba.pydata.org/numba-doc/latest/extending/high-level.html#implementing-intrinsics
217
+ .. .. _nopython mode: https://numba.pydata.org/numba-doc/latest/glossary.html#term-nopython-mode
218
+ .. .. _`numba.np.arrayobj`: https://github.com/numba/numba/blob/main/numba/np/arrayobj.py
219
+ .. .. _`llvmlite IRBuilder API`: http://llvmlite.pydata.org/en/latest/user-guide/ir/ir-builder.html
0 commit comments