|
1 | 1 | .. _index:
|
2 | 2 | .. include:: ./../../ext_links.txt
|
3 | 3 |
|
4 |
| -Kernel Programming Basics |
5 |
| -========================= |
6 |
| - |
7 |
| -`Data Parallel Extensions for Python*`_ introduce a concept of an *offload kernel*, defined as |
8 |
| -a part of a Python program being submitted for execution to the device queue. |
9 |
| - |
10 |
| -.. image:: ./../../../asset/images/kernel-queue-device.png |
11 |
| - :scale: 50% |
12 |
| - :align: center |
13 |
| - :alt: Offload Kernel |
14 |
| - |
15 |
| -There are multiple ways how to write offload kernels. CUDA*, OpenCl*, and SYCL* offer similar programming model |
16 |
| -known as the *data parallel kernel programming*. In this model you express the work in terms of *work items*. |
17 |
| -You split data into small pieces, and each piece will be a unit of work, or a *work item*. The total number of |
18 |
| -work items is called *global size*. You can also group work items into bigger chunks called *work groups*. |
19 |
| -The number of work items in the work group is called the *local size*. |
20 |
| - |
21 |
| -.. image:: ./../../../asset/images/kernel_prog_model.png |
22 |
| - :scale: 50% |
23 |
| - :align: center |
24 |
| - :alt: Offload Kernel |
25 |
| - |
26 |
| -In this example there are 48 *work items* (8 in dimension 0, and 6 in dimension 1), that is the *global size* is 48. |
27 |
| -Work items are grouped in *work groups* with the *local size* 8 (4 in dimension 0, and 2 in dimension 1). There are |
28 |
| -total 48/8 = 6 work groups. |
29 |
| - |
30 |
| -In the *data parallel kernel programming* model you write a function that processes a given work item. |
31 |
| -Such a function is called the *data parallel kernel*. |
32 |
| - |
33 |
| -**Data Parallel Extension for Numba** offers a way to write data parallel kernels directly using Python using |
34 |
| -``numba_dpex.kernel``. It bears similarities with ``numba.cuda`` and ``numba.roc``, but unlike these proprietary |
35 |
| -programming models ``numba_dpex`` is built on top of `SYCL*`_ , which is hardware agnostic, meaning |
36 |
| -that with ``numba_dpex.kernel`` programming model you will be able to write a portable code targeting different |
37 |
| -hardware vendors. |
38 |
| - |
39 |
| -.. note:: |
40 |
| - The current version of ``numba-dpex`` supports Intel SYCL devices only |
41 |
| - |
42 |
| -.. toctree:: |
43 |
| - :caption: This document will cover the following chapters: |
44 |
| - :maxdepth: 2 |
45 |
| - |
46 |
| - writing_kernels |
47 |
| - synchronization |
48 |
| - device-functions |
49 |
| - atomic-operations |
50 |
| - memory_allocation_address_space |
51 |
| - reduction |
52 |
| - ufunc |
53 |
| - supported-python-features |
| 4 | +Kernel Programming |
| 5 | +================== |
| 6 | + |
| 7 | +The tutorial covers the most important features of the KAPI kernel programming |
| 8 | +API and introduces the concepts needed to express data-parallel kernels in |
| 9 | +numba-dpex. |
| 10 | + |
| 11 | + |
| 12 | +Preliminary concepts |
| 13 | +-------------------- |
| 14 | + |
| 15 | +Data parallelism |
| 16 | +++++++++++++++++ |
| 17 | + |
| 18 | +Single Program Multiple Data |
| 19 | +++++++++++++++++++++++++++++ |
| 20 | + |
| 21 | +Range v/s Nd-Range Kernels |
| 22 | +++++++++++++++++++++++++++ |
| 23 | + |
| 24 | +Work items and Work groups |
| 25 | +++++++++++++++++++++++++++ |
| 26 | + |
| 27 | +Basic concepts |
| 28 | +-------------- |
| 29 | + |
| 30 | + |
| 31 | +Writing a *range* kernel |
| 32 | +++++++++++++++++++++++++ |
| 33 | + |
| 34 | +A *range* kernel represents the simplest form of parallelism that can be |
| 35 | +expressed in KAPI. A range kernel represents a data-parallel execution of the |
| 36 | +same function by a set of work items. In KAPI, an instance of the |
| 37 | +:py:class:`numba_dpex.kernel_api.Range` class represents the set of work items |
| 38 | +and each work item in the ``Range`` is represented by an instance of the |
| 39 | +:py:class:`numba_dpex.kernel_api.Item` class. As such these two classes are |
| 40 | +essential to writing a range kernel in KAPI. |
| 41 | + |
| 42 | +.. literalinclude:: ./../../../../numba_dpex/examples/kernel/vector_sum.py |
| 43 | + :language: python |
| 44 | + :lines: 8-9, 11-15 |
| 45 | + :caption: **EXAMPLE:** A KAPI range kernel |
| 46 | + :name: ex_kernel_declaration_vector_sum |
| 47 | + |
| 48 | +:ref:`ex_kernel_declaration_vector_sum` shows an example of a range kernel. |
| 49 | +Every range kernel requires its first argument to be an ``Item`` and |
| 50 | +needs to be launched via :py:func:`numba_dpex.experimental.launcher.call_kernel` |
| 51 | +by passing an instance a ``Range`` object. |
| 52 | + |
| 53 | +Do note that a ``Range`` object only controls the creation of work items, the |
| 54 | +distribution of work and data over a ``Range`` still needs to be defined by the |
| 55 | +user-written function. In the example, each work item access a single element of |
| 56 | +each of the three array and performs a single addition operation. It is possible |
| 57 | +to write the kernel differently so that each work item accesses multiple data |
| 58 | +elements or conditionally performs different amount of work. The data access |
| 59 | +patterns in a work item can have performance implications and programmers should |
| 60 | +refer a more topical material such as the `oneAPI GPU optimization guide`_ to |
| 61 | +learn more. |
| 62 | + |
| 63 | +A range kernel is meant to express a basic `parallel-for` calculation that is |
| 64 | +ideally suited for embarrassingly parallel kernels such as elementwise |
| 65 | +computations over ndarrays. The API for expressing a range kernel does not |
| 66 | +allow advanced features such as synchronization of work items and fine-grained |
| 67 | +control over memory allocation on a device. |
| 68 | + |
| 69 | +Writing an *nd-range* kernel |
| 70 | +++++++++++++++++++++++++++++ |
| 71 | + |
| 72 | +The ``device_func`` decorator |
| 73 | ++++++++++++++++++++++++++++++ |
| 74 | + |
| 75 | +Supported mathematical operations |
| 76 | ++++++++++++++++++++++++++++++++++ |
| 77 | + |
| 78 | +Supported Python operators |
| 79 | +++++++++++++++++++++++++++ |
| 80 | + |
| 81 | +Supported kernel arguments |
| 82 | +++++++++++++++++++++++++++ |
| 83 | + |
| 84 | +Launching a kernel |
| 85 | +++++++++++++++++++ |
| 86 | + |
| 87 | +Advanced topics |
| 88 | +--------------- |
| 89 | + |
| 90 | +Local memory allocation |
| 91 | ++++++++++++++++++++++++ |
| 92 | + |
| 93 | +Private memory allocation |
| 94 | ++++++++++++++++++++++++++ |
| 95 | + |
| 96 | +Group barrier synchronization |
| 97 | ++++++++++++++++++++++++++++++ |
| 98 | + |
| 99 | +Atomic operations |
| 100 | ++++++++++++++++++ |
| 101 | + |
| 102 | +Async kernel execution |
| 103 | +++++++++++++++++++++++ |
| 104 | + |
| 105 | +Specializing a kernel or a device_func |
| 106 | +++++++++++++++++++++++++++++++++++++++ |
0 commit comments