Skip to content

Commit ea1d569

Browse files
author
Diptorup Deb
committed
Edits and updates the outline of kernel programming tutorial
1 parent a6e0194 commit ea1d569

File tree

7 files changed

+104
-320
lines changed

7 files changed

+104
-320
lines changed

docs/source/ext_links.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,3 +26,4 @@
2626
.. _Intel Advisor: https://www.intel.com/content/www/us/en/developer/tools/oneapi/advisor.html
2727
.. _oneMKL: https://www.intel.com/content/www/us/en/docs/oneapi/programming-guide/2023-2/intel-oneapi-math-kernel-library-onemkl.html
2828
.. _UXL: https://uxlfoundation.org/
29+
.. _oneAPI GPU optimization guide: https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2024-0/general-purpose-computing-on-gpu.html

docs/source/user_guide/kernel_programming/atomic-operations.rst

Lines changed: 0 additions & 27 deletions
This file was deleted.
Lines changed: 103 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -1,53 +1,106 @@
11
.. _index:
22
.. include:: ./../../ext_links.txt
33

4-
Kernel Programming Basics
5-
=========================
6-
7-
`Data Parallel Extensions for Python*`_ introduce a concept of an *offload kernel*, defined as
8-
a part of a Python program being submitted for execution to the device queue.
9-
10-
.. image:: ./../../../asset/images/kernel-queue-device.png
11-
:scale: 50%
12-
:align: center
13-
:alt: Offload Kernel
14-
15-
There are multiple ways how to write offload kernels. CUDA*, OpenCl*, and SYCL* offer similar programming model
16-
known as the *data parallel kernel programming*. In this model you express the work in terms of *work items*.
17-
You split data into small pieces, and each piece will be a unit of work, or a *work item*. The total number of
18-
work items is called *global size*. You can also group work items into bigger chunks called *work groups*.
19-
The number of work items in the work group is called the *local size*.
20-
21-
.. image:: ./../../../asset/images/kernel_prog_model.png
22-
:scale: 50%
23-
:align: center
24-
:alt: Offload Kernel
25-
26-
In this example there are 48 *work items* (8 in dimension 0, and 6 in dimension 1), that is the *global size* is 48.
27-
Work items are grouped in *work groups* with the *local size* 8 (4 in dimension 0, and 2 in dimension 1). There are
28-
total 48/8 = 6 work groups.
29-
30-
In the *data parallel kernel programming* model you write a function that processes a given work item.
31-
Such a function is called the *data parallel kernel*.
32-
33-
**Data Parallel Extension for Numba** offers a way to write data parallel kernels directly using Python using
34-
``numba_dpex.kernel``. It bears similarities with ``numba.cuda`` and ``numba.roc``, but unlike these proprietary
35-
programming models ``numba_dpex`` is built on top of `SYCL*`_ , which is hardware agnostic, meaning
36-
that with ``numba_dpex.kernel`` programming model you will be able to write a portable code targeting different
37-
hardware vendors.
38-
39-
.. note::
40-
The current version of ``numba-dpex`` supports Intel SYCL devices only
41-
42-
.. toctree::
43-
:caption: This document will cover the following chapters:
44-
:maxdepth: 2
45-
46-
writing_kernels
47-
synchronization
48-
device-functions
49-
atomic-operations
50-
memory_allocation_address_space
51-
reduction
52-
ufunc
53-
supported-python-features
4+
Kernel Programming
5+
==================
6+
7+
The tutorial covers the most important features of the KAPI kernel programming
8+
API and introduces the concepts needed to express data-parallel kernels in
9+
numba-dpex.
10+
11+
12+
Preliminary concepts
13+
--------------------
14+
15+
Data parallelism
16+
++++++++++++++++
17+
18+
Single Program Multiple Data
19+
++++++++++++++++++++++++++++
20+
21+
Range v/s Nd-Range Kernels
22+
++++++++++++++++++++++++++
23+
24+
Work items and Work groups
25+
++++++++++++++++++++++++++
26+
27+
Basic concepts
28+
--------------
29+
30+
31+
Writing a *range* kernel
32+
++++++++++++++++++++++++
33+
34+
A *range* kernel represents the simplest form of parallelism that can be
35+
expressed in KAPI. A range kernel represents a data-parallel execution of the
36+
same function by a set of work items. In KAPI, an instance of the
37+
:py:class:`numba_dpex.kernel_api.Range` class represents the set of work items
38+
and each work item in the ``Range`` is represented by an instance of the
39+
:py:class:`numba_dpex.kernel_api.Item` class. As such these two classes are
40+
essential to writing a range kernel in KAPI.
41+
42+
.. literalinclude:: ./../../../../numba_dpex/examples/kernel/vector_sum.py
43+
:language: python
44+
:lines: 8-9, 11-15
45+
:caption: **EXAMPLE:** A KAPI range kernel
46+
:name: ex_kernel_declaration_vector_sum
47+
48+
:ref:`ex_kernel_declaration_vector_sum` shows an example of a range kernel.
49+
Every range kernel requires its first argument to be an ``Item`` and
50+
needs to be launched via :py:func:`numba_dpex.experimental.launcher.call_kernel`
51+
by passing an instance a ``Range`` object.
52+
53+
Do note that a ``Range`` object only controls the creation of work items, the
54+
distribution of work and data over a ``Range`` still needs to be defined by the
55+
user-written function. In the example, each work item access a single element of
56+
each of the three array and performs a single addition operation. It is possible
57+
to write the kernel differently so that each work item accesses multiple data
58+
elements or conditionally performs different amount of work. The data access
59+
patterns in a work item can have performance implications and programmers should
60+
refer a more topical material such as the `oneAPI GPU optimization guide`_ to
61+
learn more.
62+
63+
A range kernel is meant to express a basic `parallel-for` calculation that is
64+
ideally suited for embarrassingly parallel kernels such as elementwise
65+
computations over ndarrays. The API for expressing a range kernel does not
66+
allow advanced features such as synchronization of work items and fine-grained
67+
control over memory allocation on a device.
68+
69+
Writing an *nd-range* kernel
70+
++++++++++++++++++++++++++++
71+
72+
The ``device_func`` decorator
73+
+++++++++++++++++++++++++++++
74+
75+
Supported mathematical operations
76+
+++++++++++++++++++++++++++++++++
77+
78+
Supported Python operators
79+
++++++++++++++++++++++++++
80+
81+
Supported kernel arguments
82+
++++++++++++++++++++++++++
83+
84+
Launching a kernel
85+
++++++++++++++++++
86+
87+
Advanced topics
88+
---------------
89+
90+
Local memory allocation
91+
+++++++++++++++++++++++
92+
93+
Private memory allocation
94+
+++++++++++++++++++++++++
95+
96+
Group barrier synchronization
97+
+++++++++++++++++++++++++++++
98+
99+
Atomic operations
100+
+++++++++++++++++
101+
102+
Async kernel execution
103+
++++++++++++++++++++++
104+
105+
Specializing a kernel or a device_func
106+
++++++++++++++++++++++++++++++++++++++

docs/source/user_guide/kernel_programming/memory_allocation_address_space.rst

Lines changed: 0 additions & 36 deletions
This file was deleted.

docs/source/user_guide/kernel_programming/random.rst

Lines changed: 0 additions & 58 deletions
This file was deleted.

docs/source/user_guide/kernel_programming/ufunc.rst

Lines changed: 0 additions & 52 deletions
This file was deleted.

0 commit comments

Comments
 (0)