Skip to content

Commit d5d188e

Browse files
Changed title, fixed references, added intro borrowed from README
1 parent d62efb4 commit d5d188e

File tree

6 files changed

+66
-9
lines changed

6 files changed

+66
-9
lines changed

docs/sphinx-combined/cli_overview.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,12 @@ CLI Options
44
Every benchmark created with NVBench supports command-line interface,
55
with a variety of options.
66

7+
.. _cli-overview:
8+
79
.. include:: ../cli_help.md
810
:parser: myst_parser.sphinx_
911

12+
.. _cli-overview-axes:
1013

1114
.. include:: ../cli_help_axis.md
1215
:parser: myst_parser.sphinx_

docs/sphinx-combined/conf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
import os
22

3-
project = "NVBench API"
3+
project = "NVBench: CUDA Kernel Benchmarking Library"
44
author = "NVIDIA Corporation"
55

66
extensions = [

docs/sphinx-combined/cpp_benchmarks.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@ NVBENCH_BENCH(my_benchmark);
7474
7575
A full example can be found in [examples/stream.cu][CppExample_Stream].
7676
77+
(parameter-axes)=
7778
## Parameter Axes
7879
7980
Some kernels will be used with a variety of options, input data types/sizes, and
@@ -166,6 +167,7 @@ NVBENCH_BENCH(benchmark).add_string_axis("RNG Distribution", {"Uniform", "Gaussi
166167
A common use for string axes is to encode enum values, as shown in
167168
[examples/enums.cu][CppExample_Enums].
168169
170+
(type-axes)=
169171
### Type Axes
170172
171173
Another common situation involves benchmarking a templated kernel with multiple
@@ -244,6 +246,7 @@ times. Keep the rapid growth of these combinations in mind when choosing the
244246
number of values in an axis. See the section about combinatorial explosion for
245247
more examples and information.
246248

249+
(throughput-measurements)=
247250
## Throughput Measurements
248251

249252
In additional to raw timing information, NVBench can track a kernel's

docs/sphinx-combined/index.rst

Lines changed: 41 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,45 @@
1-
NVBench: CUDA Kernel Benchmarking Library
2-
=========================================
1+
CUDA Kernel Benchmarking Library
2+
================================
33

4-
The library presently supports kernel benchmarking in C++ and in Python.
4+
The library, NVBench, presently supports writing benchmarks in C++ and in Python.
5+
It is designed to simplify CUDA kernel benchmarking. It features:
6+
7+
* :ref:`Parameter sweeps <parameter-axes>`: a powerful and
8+
flexible "axis" system explores a kernel's configuration space. Parameters may
9+
be dynamic numbers/strings or :ref:`static types <type-axes>`.
10+
* :ref:`Runtime customization <cli-overview>`: A rich command-line interface
11+
allows :ref:`redefinition of parameter axes <cli-overview-axes>`, CUDA device
12+
selection, locking GPU clocks (Volta+), changing output formats, and more.
13+
* :ref:`Throughput calculations <throughput-measurements>`: Compute
14+
and report:
15+
16+
* Item throughput (elements/second)
17+
* Global memory bandwidth usage (bytes/second and per-device %-of-peak-bw)
18+
19+
* Multiple output formats: Currently supports markdown (default) and CSV output.
20+
* :ref:`Manual timer mode <explicit-timer-mode>`:
21+
(optional) Explicitly start/stop timing in a benchmark implementation.
22+
* Multiple measurement types:
23+
24+
* Cold Measurements:
25+
26+
* Each sample runs the benchmark once with a clean device L2 cache.
27+
* GPU and CPU times are reported.
28+
29+
* Batch Measurements:
30+
31+
* Executes the benchmark multiple times back-to-back and records total time.
32+
* Reports the average execution time (total time / number of executions).
33+
34+
* :ref:`CPU-only Measurements <cpu-only-benchmarks>`)
35+
36+
* Measures the host-side execution time of a non-GPU benchmark.
37+
* Not suitable for microbenchmarking.
38+
39+
Check out `GPU Mode talk #56 <https://www.youtube.com/watch?v=CtrqBmYtSEki>`_ for an overview
40+
of the challenges inherent to CUDA kernel benchmarking and how NVBench solves them for you!
41+
42+
-------
543

644
.. toctree::
745
:maxdepth: 2

docs/sphinx-combined/py_benchmarks.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -21,14 +21,18 @@ def benchmark_impl(state: State) -> None:
2121
data = generate(n, state.get_stream())
2222

2323
# body that is being timed. Must execute
24-
# on the stream handed over by NVBench
25-
launchable_fn : Callable[[Launch], None] =
24+
# on the stream handed over by NVBench.
25+
# Typically launches a kernel of interest
26+
launch_fn : Callable[[Launch], None] =
2627
lambda launch: impl(data, launch.get_stream())
2728

28-
state.exec(launchable_fn)
29+
state.exec(launch_fn)
2930

3031

3132
bench = register(benchmark_impl)
33+
# provide kernel a name
34+
bench.set_name("my_package_kernel")
35+
# specify default values of parameter to run benchmark with
3236
bench.add_int64_axis("Elements", [1000, 10000, 100000])
3337

3438

docs/sphinx-combined/python_api.rst

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,14 @@
1-
cuda.bench Python API Reference
2-
===============================
1+
`cuda.bench` Python API Reference
2+
=================================
3+
4+
Python package ``cuda.bench`` is designed to empower
5+
users to write CUDA kernel benchmarks in Python.
6+
7+
Alignment with behavior of benchmarks written in C++
8+
allows for meaningful comparison between them.
9+
10+
Classes and functions
11+
---------------------
312

413
.. automodule:: cuda.bench
514
:members:

0 commit comments

Comments
 (0)