Skip to content

Commit b2363ac

Browse files
nnethercoteLegNeato
authored andcommitted
Various guide improvements.
- Fix various typos, badly expressed sentences, etc. - Streamline the "The CUDA Pipeline" section, which is repetitive and contains a broken link to a non-existent image. - Remove reference to LLVM 12/13, which are now very old. - Tweak text about supporting CUDA versions; 12.x support is no longer experimental. - It's now `rust-toolchain.toml`, not `rust-toolchain`. And no need to include an out-of-date copy of it in the docs. - Remove a reference to `spirv_builder` which isn't that helpful. - Fix broken `Dockerfile` link.
1 parent 7535fb8 commit b2363ac

File tree

7 files changed

+54
-87
lines changed

7 files changed

+54
-87
lines changed

guide/src/cuda/gpu_computing.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ In `cust` everything uses RAII (through `Drop` impls) to manage freeing memory a
4444
frees users from having to think about that, which yields safer, more reliable code.
4545

4646
Results are particularly helpful, almost every single call in every CUDA library returns a status code in the form of a cuda result.
47-
Ignoring these statuses is very dangerous and can often lead to random segfaults and overall unrealiable code. For this purpose,
47+
Ignoring these statuses is very dangerous and can often lead to random segfaults and overall unreliable code. For this purpose,
4848
both the CUDA SDK, and other libraries provide macros to handle such statuses. This handling is not very reliable and causes
4949
dependency issues down the line.
5050

guide/src/cuda/pipeline.md

Lines changed: 22 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -1,53 +1,33 @@
11
# The CUDA Pipeline
22

3-
As you may already know, "traditional" cuda is usually in the form of CUDA C/C++ files which use `.cu` extension. These files
4-
can be compiled using NVCC (NVIDIA CUDA Compiler) into an executable.
3+
CUDA is traditionally used via CUDA C/C++ files which have a `.cu` extension. These files can be
4+
compiled using NVCC (NVIDIA CUDA Compiler) into an executable.
55

6-
CUDA files consist of **device** and **host** functions. **device** functions are functions that run on the GPU, also called kernels.
7-
**host** functions run on the CPU and usually include logic on how to allocate GPU memory and call device functions.
6+
CUDA files consist of **device** and **host** functions. **Device** functions run on the GPU, and
7+
are also called kernels. **Host** functions run on the CPU and usually include logic on how to
8+
allocate GPU memory and call device functions.
89

9-
However, a lot goes on behind the scenes that most people don't know about, a lot of it is integral to how rustc_codegen_nvvm works
10-
so we will briefly go over it.
10+
Behind the scenes, NVCC has several stages of compilation.
1111

12-
# Stages
13-
14-
The NVIDIA CUDA Compiler consists of distinct stages of compilation:
15-
16-
[![NVCC]](graphics/cuda-compilation-from-cu-to-executable.png)
17-
18-
NVCC separates device and host functions and compiles them separately.
19-
Most importantly, device functions are compiled to LLVM IR, and then the LLVM IR is fed to a library
20-
called `libnvvm`.
21-
22-
`libnvvm` is a closed source library which takes in a subset of LLVM IR, it optimizes it further, then it
23-
turns it into the next and most important stage of compilation, the PTX ISA.
24-
25-
PTX is a low level, assembly-like format with an open specification which can be targeted by any language.
26-
27-
We won't dig deep into what happens after PTX, but in essence, it is turned into a final format called SASS
28-
which is register allocated and is finally sent to the GPU to execute.
29-
30-
# libnvvm
31-
32-
The stage/library we are most interested in is `libnvvm`. libnvvm is a closed source library that is
33-
distributed in every download of the CUDA SDK. Libnvvm takes a format called NVVM IR, it optimizes it, and
34-
converts it to a single PTX file you can run on NVIDIA GPUs using the driver or runtime API.
35-
36-
NVVM IR is a subset of LLVM IR, that is to say, it is a version of LLVM IR with restrictions. A couple
37-
of examples being:
38-
- Many intrinsics are unsupported
39-
- "Irregular" integer types such as `i4` or `i111` are unsupported and will segfault (however in theory they should be supported)
12+
First, NVCC separates device and host functions and compiles them separately. Device functions are
13+
compiled to [NVVM IR](https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html), a subset of LLVM IR
14+
with additional restrictions including the following.
15+
- Many intrinsics are unsupported.
16+
- "Irregular" integer types such as `i4` or `i111` are unsupported and will segfault (however in
17+
theory they should be supported).
4018
- Global names cannot include `.`.
4119
- Some linkage types are not supported.
42-
- Function ABIs are ignored, everything uses the PTX calling convention.
20+
- Function ABIs are ignored; everything uses the PTX calling convention.
4321

44-
You can find the full specification of the NVVM IR [here](https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html) if you are interested.
45-
46-
# Special PTX features
47-
48-
As far as an assembly format goes, PTX is fairly user friendly for a couple of reasons:
22+
libnvvm is a closed source library which takes NVVM IR, optimizes it further, then converts it to
23+
PTX. PTX is a low level, assembly-like format with an open specification which can be targeted by
24+
any language. For an assembly format, PTX is fairly user-friendly.
4925
- It is well formatted.
5026
- It is mostly fully specified (other than the iffy grammar specification).
51-
- It uses named registers/parameters
52-
- It uses virtual registers (since gpus have thousands of registers, listing all of them out would be unrealistic).
27+
- It uses named registers/parameters.
28+
- It uses virtual registers. (Because gpus have thousands of registers, listing all of them out
29+
would be unrealistic.)
5330
- It uses ASCII as a file encoding.
31+
32+
PTX can be run on NVIDIA GPUs using the driver API or runtime API. Those APIs will convert the PTX
33+
into a final format called SASS which is register allocated and executed on the GPU.

guide/src/faq.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ over CUDA C/C++ with the same (or better!) performance and features, therefore,
2929
Short answer, no.
3030

3131
Long answer, there are a couple of things that make this impossible:
32-
- At the time of writing, libnvvm expects LLVM 7 bitcode, giving it LLVM 12/13 bitcode (which is what rustc uses) does not work.
32+
- At the time of writing, libnvvm expects LLVM 7 bitcode, which is a very old format. Giving it bitcode from later LLVM version (which is what rustc uses) does not work.
3333
- NVVM IR is a __subset__ of LLVM IR, there are tons of things that nvvm will not accept. Such as a lot of function attrs not being allowed.
3434
This is well documented and you can find the spec [here](https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html). Not to mention
3535
many bugs in libnvvm that i have found along the way, the most infuriating of which is nvvm not accepting integer types that arent `i1, i8, i16, i32, or i64`.

guide/src/guide/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,3 @@
11
# Guide
2+
3+
This section covers some of the basics.

guide/src/guide/getting_started.md

Lines changed: 15 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,7 @@ This section covers how to get started writing GPU crates with `cuda_std` and `c
66

77
Before you can use the project to write GPU crates, you will need a couple of prerequisites:
88

9-
- [The CUDA SDK](https://developer.nvidia.com/cuda-downloads), version `11.2-11.8` (and the appropriate driver - [see cuda release notes](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html)).
10-
11-
- We recently [added experimental support for the `12.x`
12-
SDK](https://github.com/Rust-GPU/rust-cuda/issues/100), please file any issues you
13-
see
9+
- [The CUDA SDK](https://developer.nvidia.com/cuda-downloads), version 11.2 or later (and the appropriate driver - [see CUDA release notes](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html)).
1410

1511
This is only for building GPU crates, to execute built PTX you only need CUDA `9+`.
1612

@@ -27,13 +23,14 @@ Before you can use the project to write GPU crates, you will need a couple of pr
2723

2824
- You may wish to use or consult the bundled [Dockerfile](#docker) to assist in your local config
2925

30-
## rust-toolchain
26+
## rust-toolchain.toml
3127

32-
Currently, the Codegen only works on nightly (because it uses rustc internals), and it only works on a specific version of nightly.
33-
This is why you must copy the `rust-toolchain` file in the project repository to your own project. This will ensure
34-
you are on the correct nightly version so the codegen builds.
28+
NVVM codegen currently requires a specific version of Rust nightly, because it uses rustc internals
29+
that are subject to change. Therefore, you must copy the `rust-toolchain.toml` file in the project
30+
repository so that your own project uses the correct nightly version.
3531

36-
Only the codegen requires nightly, `cust` and other CPU-side libraries work perfectly fine on stable.
32+
Note: `cust` and other CPU-side libraries work with stable Rust, but they will end up being
33+
compiled with the version of nightly specified in `rust-toolchain.toml`.
3734

3835
## Cargo.toml
3936

@@ -111,9 +108,9 @@ thread, with the number of threads being decided by the caller (the CPU).
111108
We call these parameters the launch dimensions of the kernel. Launch dimensions are split
112109
up into two basic concepts:
113110

114-
- Threads, a single thread executes the GPU kernel **once**, and it makes the index
111+
- **Threads:** A single thread executes the GPU kernel **once**, and it makes the index
115112
of itself available to the kernel through special registers (functions in our case).
116-
- Blocks, Blocks house multiple threads that they execute on their own. Thread indices
113+
- **Blocks:** A single block houses multiple threads that it execute on its own. Thread indices
117114
are only unique across the thread's block, therefore CUDA also exposes the index
118115
of the current block.
119116

@@ -150,8 +147,8 @@ If you have used CUDA C++ before, this should seem fairly familiar, with a few o
150147
is unsound. The reason being that rustc assumes `&mut` does not alias. However, because every thread gets a copy of the arguments, this would cause it to alias, thereby violating
151148
this invariant and yielding technically unsound code. Pointers do not have such an invariant on the other hand. Therefore, we use a pointer and only make a mutable reference once we
152149
are sure the elements are disjoint: `let elem = &mut *c.add(idx);`.
153-
- We check that the index is not out of bounds before doing anything, this is because it is
154-
common to launch kernels with thread amounts that are not exactly divisible by the length for optimization.
150+
- We check that the index is not out of bounds before doing anything, because it is common to
151+
launch kernels with thread counts that are not exactly divisible by the length for optimization.
155152

156153
Internally what this does is it first checks that a couple of things are right in the kernel:
157154

@@ -165,8 +162,7 @@ It also applies `#[no_mangle]` so the name of the kernel is the same as it is de
165162
## Building the GPU crate
166163

167164
Now that you have some kernels defined in a crate, you can build them easily using `cuda_builder`.
168-
`cuda_builder` is a helper crate similar to `spirv_builder` (if you have used rust-gpu before), it builds
169-
GPU crates while passing everything needed by rustc.
165+
which builds GPU crates while passing everything needed by rustc.
170166

171167
To use it you can simply add it as a build dependency in your CPU crate (the crate running the GPU kernels):
172168

@@ -216,23 +212,11 @@ static PTX: &str = include_str!("some/path.ptx");
216212

217213
Then execute it using cust.
218214

219-
Don't forget to include the current `rust-toolchain` in the top of your project:
220-
221-
```toml
222-
# If you see this, run `rustup self update` to get rustup 1.23 or newer.
223-
224-
# NOTE: above comment is for older `rustup` (before TOML support was added),
225-
# which will treat the first line as the toolchain name, and therefore show it
226-
# to the user in the error, instead of "error: invalid channel name '[toolchain]'".
227-
228-
[toolchain]
229-
channel = "nightly-2021-12-04"
230-
components = ["rust-src", "rustc-dev", "llvm-tools-preview"]
231-
```
215+
Don't forget to include the current `rust-toolchain.toml` at the top of your project.
232216

233217
## Docker
234218

235-
There is also a [Dockerfile](Dockerfile) prepared as a quickstart with all the necessary libraries for base cuda development.
219+
There are also some [Dockerfiles](https://github.com/Rust-GPU/rust-cuda/tree/main/container) prepared as a quickstart with all the necessary libraries for base CUDA development.
236220

237221
You can use it as follows (assuming your clone of Rust CUDA is at the absolute path `RUST_CUDA`):
238222

@@ -244,7 +228,7 @@ You can use it as follows (assuming your clone of Rust CUDA is at the absolute p
244228

245229
**Notes:**
246230

247-
1. refer to [rust-toolchain](#rust-toolchain) to ensure you are using the correct toolchain in your project.
231+
1. refer to [rust-toolchain.toml](#rust-toolchain.toml) to ensure you are using the correct toolchain in your project.
248232
2. despite using Docker, your machine will still need to be running a compatible driver, in this case for Cuda 11.4.1 it is >=470.57.02
249233
3. if you have issues within the container, it can help to start ensuring your gpu is recognized
250234
- ensure `nvidia-smi` provides meaningful output in the container

guide/src/guide/kernel_abi.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -82,12 +82,12 @@ unsafe {
8282

8383
## Arrays
8484

85-
Arrays are passed the same as if they were structs, they are always passed by value as byte arrays.
85+
Like structs, arrays are always passed by value as byte arrays.
8686

8787
## Slices
8888

89-
Slices are passed as **two parameters**, both 32-bit on `nvptx` or 64-bit on `nvptx64`. The first parameter is the pointer
90-
to the beginning of the data, and the second parameter is the length of the slice.
89+
Slices are passed as **two word-sized parameters**: a pointer to the beginning of the data, and an
90+
integer giving the length of the slice.
9191

9292
For example:
9393

guide/src/guide/safety.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -20,9 +20,10 @@ Undefined behavior on the GPU is defined as potentially being able to cause the
2020
- Causing LLVM/NVVM to optimize the code into unknown code.
2121

2222
Behavior considered undefined inside of GPU kernels:
23-
- Most importantly, any behavior that is considered undefined on the CPU, is considered undefined
24-
on the GPU too. See: https://doc.rust-lang.org/reference/behavior-considered-undefined.html.
25-
The only exception being invalid sizes for buffers given to a GPU kernel.
23+
- Most importantly, any behavior that is [considered undefined on the
24+
CPU](https://doc.rust-lang.org/reference/behavior-considered-undefined.html) is considered
25+
undefined on the GPU too. The only exception being invalid sizes for buffers given to a GPU
26+
kernel.
2627

2728
Currently we declare that the invariant that a buffer given to a gpu kernel must be large enough for any access the
2829
kernel is going to make is up to the caller of the kernel to uphold. This idiom may be changed in the future.
@@ -42,7 +43,7 @@ the CPU.
4243

4344
### Streams
4445

45-
Streams will always execute concurrently with eachother. That is to say, kernels launched
46+
Streams will always execute concurrently with each other. That is to say, kernels launched
4647
inside of a single stream guarantee that they will be executed one after the other, in order.
4748

4849
However, kernels launched in different streams have no guarantee of execution order, their execution
@@ -54,9 +55,9 @@ Therefore, it is undefined behavior to write to the same memory location in kern
5455
streams without synchronization.
5556

5657
For example:
57-
1: `Foo` is allocated as a buffer of memory on the GPU.
58-
2: Stream `1` launches kernel `bar` which writes to `Foo`.
59-
3: Stream `2` launches kernel `bar` which also writes to `Foo`.
58+
1. `Foo` is allocated as a buffer of memory on the GPU.
59+
2. Stream `1` launches kernel `bar` which writes to `Foo`.
60+
3. Stream `2` launches kernel `bar` which also writes to `Foo`.
6061

6162
This is undefined behavior because the kernels are likely to be executed concurrently, causing a data
6263
race when multiple kernels try to write to the same memory.
@@ -100,8 +101,8 @@ The following invariants must be upheld by the caller of a kernel, failure to do
100101
behavior to launch the kernel with 3d thread indices (which would cause a data race). However, it is not undefined behavior
101102
to launch the kernel with a dimensionality lower than expected, e.g. launching a 2d kernel with a 1d dimensionality.
102103
- The types expected by the kernel must match:
103-
- If the kernel expects a struct, if the struct is repr(Rust), the struct must be the actual struct from the kernel library,
104-
otherwise, if it is repr(C) (which is reccomended), the fields must all match, including alignment and order of fields.
104+
- If the kernel expects a struct, if the struct is `repr(Rust)`, the struct must be the actual struct from the kernel library,
105+
otherwise, if it is `repr(C)` (which is recommended), the fields must all match, including alignment and order of fields.
105106
- Reference aliasing rules must not be violated, including:
106107
- Immutable references are allowed to be aliased, e.g. if a kernel expects `&T` and `&T`, it is sound to pass the same pointer for both.
107108
- Data behind an immutable reference must not be modified, meaning, it is undefined behavior to pass the same pointer to `&T` and `*mut T`,

0 commit comments

Comments
 (0)