You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/examples/basic.rst
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -30,8 +30,8 @@ Code Explanation
30
30
:lineno-start: 8
31
31
32
32
First, we need to define a ``KernelBuilder`` instance.
33
-
A ``KernelBuilder`` is essentially a `blueprint` that describes the information required to compile the CUDA kernel.
34
-
The constructor takes the name of the kernel function and the `.cu` file where the code is located.
33
+
A ``KernelBuilder`` is essentially a ``blueprint`` that describes the information required to compile the CUDA kernel.
34
+
The constructor takes the name of the kernel function and the ``.cu`` file where the code is located.
35
35
Optionally, we can also provide the kernel source as the third parameter.
36
36
37
37
@@ -40,15 +40,15 @@ Optionally, we can also provide the kernel source as the third parameter.
40
40
:lineno-start: 11
41
41
42
42
CUDA kernels often have tunable parameters that can impact their performance, such as block size, thread granularity, register usage, and the use of shared memory.
43
-
Here, we define two tunable parameters: the number of threads per blocks and the number of elements processed per thread.
43
+
Here, we define two tunable parameters: the number of threads per block and the number of elements processed per thread.
44
44
45
45
46
46
47
47
.. literalinclude:: basic.cpp
48
48
:lines: 15-16
49
49
:lineno-start: 15
50
50
51
-
The values returned by ``tune`` are placeholder objecs.
51
+
The values returned by ``tune`` are placeholder objects.
52
52
These objects can be combined using C++ operators to create new expressions objects.
53
53
Note that ``elements_per_block`` does not actually contain a specific value;
54
54
instead, it is an abstract expression that, upon kernel instantiation, is evaluated as the product of ``threads_per_block`` and ``elements_per_thread``.
@@ -64,7 +64,7 @@ The following properties are supported:
64
64
65
65
* ``problem_size``: This is an N-dimensional vector that represents the size of the problem. In this case, is one-dimensional and ``kl::arg0`` means that the size is specified as the first kernel argument (`argument 0`).
66
66
* ``block_size``: A triplet ``(x, y, z)`` representing the block dimensions.
67
-
* ``grid_divsor``: This property is used to calculate the size of the grid (i.e., the number of blocks along each axis). For each kernel launch, the problem size is divided by the divisors to calculate the grid size. In other words, this property expresses the number of elements processed per thread block.
67
+
* ``grid_divisor``: This property is used to calculate the size of the grid (i.e., the number of blocks along each axis). For each kernel launch, the problem size is divided by the divisors to calculate the grid size. In other words, this property expresses the number of elements processed per thread block.
68
68
* ``template_args``: This property specifies template arguments, which can be type names and integral values.
69
69
* ``define``: Define preprocessor constants.
70
70
* ``shared_memory``: Specify the amount of shared memory required, in bytes.
Copy file name to clipboardExpand all lines: docs/examples/pragma.rst
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@ Pragma Kernels
2
2
===========================
3
3
4
4
In the previous examples, we demonstrated how a tunable kernel can be specified by defining a ``KernelBuilder`` instance in the host-side code.
5
-
While this API offers flexiblity, it can be cumbersome and requires keeping the kernel code in CUDA in sync with the host-side code in C++.
5
+
While this API offers flexibility, it can be cumbersome and requires keeping the kernel code in CUDA in sync with the host-side code in C++.
6
6
7
7
Kernel Launcher also provides a way to define kernel specifications directly in the CUDA code by using pragma directives to annotate the kernel code.
8
8
Although this method is less flexible than the ``KernelBuilder`` API, it is much more convenient and suitable for most CUDA kernels.
@@ -30,7 +30,7 @@ The kernel contains the following ``pragma`` directives:
30
30
:lineno-start: 1
31
31
32
32
The tune directives specify the tunable parameters: ``threads_per_block`` and ``items_per_thread``.
33
-
Since ``items_per_thread`` is also the name of the template parameter, so it is passed to the kernel as a compile-time constant via this parameter.
33
+
Since ``items_per_thread`` is also the name of the template parameter, it is passed to the kernel as a compile-time constant via this parameter.
34
34
The value of ``threads_per_block`` is not passed to the kernel but is used by subsequent pragmas.
35
35
36
36
.. literalinclude:: vector_add_annotated.cu
@@ -44,7 +44,7 @@ In this case, the constant ``items_per_block`` is defined as the product of ``th
44
44
:lines: 4-6
45
45
:lineno-start: 4
46
46
47
-
The ``problem_size`` directive defines the problem size (as discussed in as discussed in :doc:`basic`), ``block_size`` specifies the thread block size, and ``grid_divisor`` specifies how the problem size should be divided to obtain the thread grid size.
47
+
The ``problem_size`` directive defines the problem size (as discussed in :doc:`basic`), ``block_size`` specifies the thread block size, and ``grid_divisor`` specifies how the problem size should be divided to obtain the thread grid size.
48
48
Alternatively, ``grid_size`` can be used to specify the grid size directly.
49
49
50
50
@@ -67,7 +67,7 @@ In this example, the tuning key is ``"vector_add_" + T``, where ``T`` is the nam
67
67
Host Code
68
68
---------
69
69
70
-
The below code shows how to call the kernel from the host in C++::
70
+
The code below shows how to call the kernel from the host in C++::
Copy file name to clipboardExpand all lines: docs/examples/wisdom.rst
+2-1Lines changed: 2 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,6 +6,7 @@ Wisdom Files
6
6
7
7
In the previous example, we demonstrated how to compile a kernel by providing both a ``KernelBuilder`` instance (describing the `blueprint` for the kernel) and a ``Config`` instance (describing the configuration of the tunable parameters).
8
8
9
+
9
10
However, determining the optimal configuration can often be challenging, as it depends on both the problem size and the specific type of GPU being used.
10
11
To address this problem, Kernel Launcher provides a solution in the form of **wisdom files** (terminology borrowed from `FFTW <http://www.fftw.org/>`_).
11
12
@@ -86,7 +87,7 @@ To do so, we need to run the program with the environment variable ``KERNEL_LAUN
86
87
This generates a file called ``vector_add_1000000.json`` in the directory set by ``set_global_capture_directory``.
87
88
88
89
Alternatively, it is possible to capture several kernels at once by using the wildcard ``*``.
89
-
For example, the following command export all kernels that are start with ``vector_``::
90
+
For example, the following command exports all kernels that start with ``vector_``::
Copy file name to clipboardExpand all lines: docs/index.rst
+12-9Lines changed: 12 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,9 +19,9 @@ Kernel Launcher
19
19
20
20
.. image:: /logo.png
21
21
:width:670
22
-
:alt:kernel launcher
22
+
:alt:Kernel Launcher logo
23
23
24
-
**Kernel Launcher** is a C++ library that makes it easy to dynamically compile *CUDA* kernels at runtime (using `NVRTC <https://docs.nvidia.com/cuda/nvrtc/index.html>`_) and launching them in a type-safe manner using C++ magic. There are two main reasons for using runtime compilation:
24
+
**Kernel Launcher** is a C++ library designed to dynamically compile *CUDA* kernels at runtime (using `NVRTC <https://docs.nvidia.com/cuda/nvrtc/index.html>`_) and to launch them in a type-safe manner using C++ magic. Runtime compilation offers two significant advantages:
25
25
26
26
* Kernels that have tunable parameters (block size, elements per thread, loop unroll factors, etc.) where the optimal configuration depends on dynamic factors such as the GPU type and problem size.
27
27
@@ -33,12 +33,14 @@ Kernel Tuner Integration
33
33
34
34
.. image:: /kernel_tuner_integration.png
35
35
:width:670
36
-
:alt:kernel launcher integration
36
+
:alt:Kernel Launcher and Kernel Tuner integration
37
37
38
38
39
-
Kernel Launcher's tight integration with `Kernel Tuner <https://kerneltuner.github.io/>`_ results in highly-tuned kernels, as visualized above.
40
-
Kernel Launcher **captures** kernel launches within your application, which are then **tuned** by Kernel Tuner and saved as **wisdom** files.
41
-
These files are processed by Kernel Launcher during execution to **compile** the tuned kernel at runtime.
39
+
The tight integration of **Kernel Launcher** with `Kernel Tuner <https://kerneltuner.github.io/>`_ ensures that kernels are highly optimized, as illustrated in the image above.
40
+
Kernel Launcher can **capture** kernel launches within your application at runtime.
41
+
These captured kernels can then be **tuned** by Kernel Tuner and the tuning results are saved as **wisdom** files.
42
+
These wisdom files are used by Kernel Launcher during execution to **compile** the tuned kernel at runtime.
43
+
42
44
43
45
See :doc:`examples/wisdom` for an example of how this works in practise.
44
46
@@ -48,21 +50,22 @@ See :doc:`examples/wisdom` for an example of how this works in practise.
48
50
Basic Example
49
51
=============
50
52
51
-
This sections hows a basic code example. See :ref:`example` for a more advance example.
53
+
This section presents a simple code example illustrating how to use the Kernel Launcher.
54
+
For a more detailed example, refer to :ref:`example`.
52
55
53
56
Consider the following CUDA kernel for vector addition.
54
57
This kernel has a template parameter ``T`` and a tunable parameter ``ELEMENTS_PER_THREAD``.
55
58
56
59
.. literalinclude:: examples/vector_add.cu
57
60
58
61
59
-
The following C++ snippet shows how to use *Kernel Launcher* in host code:
62
+
The following C++ snippet demonstrates how to use the Kernel Launcher in the host code:
0 commit comments