|
4 | 4 | Overview
|
5 | 5 | ========
|
6 | 6 |
|
7 |
| -Data Parallel Extension for Numba* (`numba-dpex`_) is an extension to |
8 |
| -the `Numba*`_ Python JIT compiler adding an architecture-agnostic kernel |
9 |
| -programming API, and a new front-end to compile the Data Parallel Extension |
10 |
| -for Numpy* (`dpnp`_) library. The ``dpnp`` Python library is a data-parallel |
11 |
| -implementation of `NumPy*`_'s API using the `SYCL*`_ language. |
12 |
| - |
13 |
| -.. ``numba-dpex``'s support for ``dpnp`` compilation is a new way for Numba* users |
14 |
| -.. to write code in a NumPy-like API that is already supported by Numba*, while at |
15 |
| -.. the same time automatically running such code parallelly on various types of |
16 |
| -.. architecture. |
17 |
| -
|
18 |
| -``numba-dpex`` is an open-source project and can be installed as part of `Intel |
19 |
| -AI Analytics Toolkit`_ or the `Intel Distribution for Python*`_. The package is |
20 |
| -also available on Anaconda cloud and as a Docker image on GitHub. Please refer |
21 |
| -the :doc:`getting_started` page to learn more. |
22 |
| - |
23 |
| -Main Features |
24 |
| -------------- |
25 |
| - |
26 |
| -Portable Kernel Programming |
27 |
| -~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
28 |
| - |
29 |
| -The ``numba-dpex`` kernel programming API has a design similar to Numba's |
30 |
| -``cuda.jit`` sub-module. The API is modeled after the `SYCL*`_ language and uses |
31 |
| -the `DPC++`_ SYCL runtime. Currently, compilation of kernels is supported for |
32 |
| -SPIR-V-based OpenCL and `oneAPI Level Zero`_ devices CPU and GPU devices. In the |
33 |
| -future, compilation support for other types of hardware that are supported by |
34 |
| -DPC++ will be added. |
35 |
| - |
36 |
| -The following example illustrates a vector addition kernel written with |
37 |
| -``numba-dpex`` kernel API. |
| 7 | +Data Parallel Extension for Numba* (`numba-dpex`_) is a free and open-source |
| 8 | +LLVM-based code generator for portable accelerator programming in Python. The |
| 9 | +code generator implements a new pseudo-kernel programming domain-specific |
| 10 | +language (DSL) called `KAPI` that is modeled after the C++ DSL `SYCL*`_. The |
| 11 | +SYCL language is an open standard developed under the Unified Acceleration |
| 12 | +Foundation (`UXL`_) as a vendor-agnostic way of programming different types of |
| 13 | +data-parallel hardware such as multi-core CPUs, GPUs, and FPGAs. Numba-dpex and |
| 14 | +KAPI aim to bring the same vendor-agnostic and standard-compliant programming |
| 15 | +model to Python. |
| 16 | + |
| 17 | +Numba-dpex is built on top of the open-source `Numba*`_ JIT compiler that |
| 18 | +implements a CPython bytecode parser and code generator to lower the bytecode to |
| 19 | +LLVM IR. The Numba* compiler is able to compile a large sub-set of Python and |
| 20 | +most of the NumPy library. Numba-dpex uses Numba*'s tooling to implement the |
| 21 | +parsing and typing support for the data types and functions defined in the KAPI |
| 22 | +DSL. A custom code generator is then used to lower KAPI to a form of LLVM IR |
| 23 | +that includes special LLVM instructions that define a low-level data-parallel |
| 24 | +kernel API. Thus, a function defined in KAPI is compiled to a data-parallel |
| 25 | +kernel that can run on different types of hardware. Currently, compilation of |
| 26 | +KAPI is possible for x86 CPU devices, Intel Gen9 integrated GPUs, Intel UHD |
| 27 | +integrated GPUs, and Intel discrete GPUs. |
| 28 | + |
| 29 | + |
| 30 | +The following example shows a pairwise distance matrix computation in KAPI. |
38 | 31 |
|
39 | 32 | .. code-block:: python
|
40 | 33 |
|
41 |
| - import dpnp |
42 |
| - import numba_dpex as dpex |
43 |
| -
|
44 |
| -
|
45 |
| - @dpex.kernel |
46 |
| - def vecadd_kernel(a, b, c): |
47 |
| - i = dpex.get_global_id(0) |
48 |
| - c[i] = a[i] + b[i] |
49 |
| -
|
50 |
| -
|
51 |
| - a = dpnp.ones(1024, device="gpu") |
52 |
| - b = dpnp.ones(1024, device="gpu") |
53 |
| - c = dpnp.empty_like(a) |
54 |
| -
|
55 |
| - vecadd_kernel[dpex.Range(1024)](a, b, c) |
56 |
| - print(c) |
57 |
| -
|
58 |
| -In the above example, three arrays are allocated on a default ``gpu`` device |
59 |
| -using the ``dpnp`` library. The arrays are then passed as input arguments to the |
60 |
| -kernel function. The compilation target and the subsequent execution of the |
61 |
| -kernel is determined by the input arguments and follow the |
62 |
| -"compute-follows-data" programming model as specified in the `Python* Array API |
63 |
| -Standard`_. To change the execution target to a CPU, the device keyword needs to |
64 |
| -be changed to ``cpu`` when allocating the ``dpnp`` arrays. It is also possible |
65 |
| -to leave the ``device`` keyword undefined and let the ``dpnp`` library select a |
66 |
| -default device based on environment flag settings. Refer the |
67 |
| -:doc:`user_guide/kernel_programming/index` for further details. |
68 |
| - |
69 |
| -``dpjit`` decorator |
70 |
| -~~~~~~~~~~~~~~~~~~~ |
71 |
| - |
72 |
| -The ``numba-dpex`` package provides a new decorator ``dpjit`` that extends |
73 |
| -Numba's ``njit`` decorator. The new decorator is equivalent to |
74 |
| -``numba.njit(parallel=True)``, but additionally supports compiling ``dpnp`` |
75 |
| -functions, ``prange`` loops, and array expressions that use ``dpnp.ndarray`` |
76 |
| -objects. |
77 |
| - |
78 |
| -Unlike Numba's NumPy parallelization that only supports CPUs, ``dpnp`` |
79 |
| -expressions are first converted to data-parallel kernels and can then be |
80 |
| -`offloaded` to different types of devices. As ``dpnp`` implements the same API |
81 |
| -as NumPy*, an existing ``numba.njit`` decorated function that uses |
82 |
| -``numpy.ndarray`` may be refactored to use ``dpnp.ndarray`` and decorated with |
83 |
| -``dpjit``. Such a refactoring can allow the parallel regions to be offloaded |
84 |
| -to a supported GPU device, providing users an additional option to execute their |
85 |
| -code parallelly. |
86 |
| - |
87 |
| -The vector addition example depicted using the kernel API can also be |
88 |
| -expressed in several different ways using ``dpjit``. |
89 |
| - |
90 |
| -.. code-block:: python |
91 |
| -
|
92 |
| - import dpnp |
93 |
| - import numba_dpex as dpex |
94 |
| -
|
95 |
| -
|
96 |
| - @dpex.dpjit |
97 |
| - def vecadd_v1(a, b): |
98 |
| - return a + b |
99 |
| -
|
| 34 | + from numba_dpex import kernel_api as kapi |
| 35 | + import math |
100 | 36 |
|
101 |
| - @dpex.dpjit |
102 |
| - def vecadd_v2(a, b): |
103 |
| - return dpnp.add(a, b) |
104 | 37 |
|
| 38 | + def pairwise_distance_kernel(item: kapi.Item, data, distance): |
| 39 | + i = item.get_id(0) |
| 40 | + j = item.get_id(1) |
105 | 41 |
|
106 |
| - @dpex.dpjit |
107 |
| - def vecadd_v3(a, b): |
108 |
| - c = dpnp.empty_like(a) |
109 |
| - for i in prange(a.shape[0]): |
110 |
| - c[i] = a[i] + b[i] |
111 |
| - return c |
| 42 | + data_dims = data.shape[1] |
112 | 43 |
|
113 |
| -As with the kernel API example, a ``dpjit`` function if invoked with ``dpnp`` |
114 |
| -input arguments follows the compute-follows-data programming model. Refer |
115 |
| -:doc:`user_manual/dpnp_offload/index` for further details. |
| 44 | + d = data.dtype.type(0.0) |
| 45 | + for k in range(data_dims): |
| 46 | + tmp = data[i, k] - data[j, k] |
| 47 | + d += tmp * tmp |
116 | 48 |
|
| 49 | + distance[j, i] = math.sqrt(d) |
117 | 50 |
|
118 |
| -.. Project Goal |
119 |
| -.. ------------ |
120 | 51 |
|
121 |
| -.. If C++ is not your language, you can skip writing data-parallel kernels in SYCL |
122 |
| -.. and directly write them in Python. |
| 52 | +Skipping over much of the language details, at a high-level the |
| 53 | +``pairwise_distance_kernel`` can be viewed as a data-parallel function that gets |
| 54 | +executed individually by a set of "work items". That is, each work item runs the |
| 55 | +same function for a subset of the elements of the input ``data`` and |
| 56 | +``distance`` arrays. For programmers familiar with the CUDA or OpenCL languages, |
| 57 | +it is the same programming model that is referred to as Single Program Multiple |
| 58 | +Data (SPMD). As Python has no concept of a work item the KAPI function itself is |
| 59 | +sequential and needs to be compiled to convert it into a parallel version. The |
| 60 | +next example shows the changes to the original script to compile and run the |
| 61 | +``pairwise_distance_kernel`` in parallel. |
123 | 62 |
|
124 |
| -.. Our package ``numba-dpex`` extends the Numba compiler to allow kernel creation |
125 |
| -.. directly in Python via a custom compute API |
126 |
| -
|
127 |
| -
|
128 |
| -.. Contributing |
129 |
| -.. ------------ |
130 |
| -
|
131 |
| -.. Refer the `contributing guide |
132 |
| -.. <https://github.com/IntelPython/numba-dpex/blob/main/CONTRIBUTING>`_ for |
133 |
| -.. information on coding style and standards used in ``numba-dpex``. |
134 |
| -
|
135 |
| -.. License |
136 |
| -.. ------- |
137 |
| -
|
138 |
| -.. ``numba-dpex`` is Licensed under Apache License 2.0 that can be found in `LICENSE |
139 |
| -.. <https://github.com/IntelPython/numba-dpex/blob/main/LICENSE>`_. All usage and |
140 |
| -.. contributions to the project are subject to the terms and conditions of this |
141 |
| -.. license. |
| 63 | +.. code-block:: python |
142 | 64 |
|
| 65 | + from numba_dpex import kernel, call_kernel |
| 66 | + import dpnp |
143 | 67 |
|
144 |
| -.. Along with the kernel programming API an auto-offload feature is also provided. |
145 |
| -.. The feature enables automatic generation of kernels from data-parallel NumPy |
146 |
| -.. library calls and array expressions, Numba ``prange`` loops, and `other |
147 |
| -.. "data-parallel by construction" expressions |
148 |
| -.. <https://numba.pydata.org/numba-doc/latest/user/parallel.html>`_ that Numba is |
149 |
| -.. able to parallelize. Following two examples demonstrate the two ways in which |
150 |
| -.. kernels may be written using numba-dpex. |
| 68 | + data = dpnp.random.ranf((10000, 3), device="gpu") |
| 69 | + distance = dpnp.empty(shape=(data.shape[0], data.shape[0]), device="gpu") |
| 70 | + exec_range = kapi.Range(data.shape[0], data.shape[0]) |
| 71 | + call_kernel(kernel(pairwise_distance_kernel), exec_range, data, distance) |
| 72 | +
|
| 73 | +To compile a KAPI function into a data-parallel kernel and run it on a device, |
| 74 | +three things need to be done: allocate the arguments to the function on the |
| 75 | +device where the function is to execute, compile the function by applying a |
| 76 | +numba-dpex decorator, and `launch` or execute the compiled kernel on the device. |
| 77 | + |
| 78 | +Allocating arrays or scalars to be passed to a compiled KAPI function is not |
| 79 | +done directly in numba-dpex. Instead, numba-dpex supports passing in |
| 80 | +tensors/ndarrays created using either the `dpnp`_ NumPy drop-in replacement |
| 81 | +library or the `dpctl`_ SYCl-based Python Array API library. To trigger |
| 82 | +compilation, the ``numba_dpex.kernel`` decorator has to be used, and finally to |
| 83 | +launch a compiled kernel the ``numba_dpex.call_kernel`` function should be |
| 84 | +invoked. |
| 85 | + |
| 86 | +For a more detailed description about programming with numba-dpex, refer |
| 87 | +the :doc:`programming_model`, :doc:`user_guide/index` and the |
| 88 | +:doc:`autoapi/index` sections of the documentation. To setup numba-dpex and try |
| 89 | +it out refer the :doc:`getting_started` section. |
0 commit comments