Various guide improvements.

nnethercote · LegNeato · commit b2363ace0148 · 2025-11-18T12:06:31.000-04:00
- Fix various typos, badly expressed sentences, etc.
- Streamline the "The CUDA Pipeline" section, which is repetitive and
  contains a broken link to a non-existent image.
- Remove reference to LLVM 12/13, which are now very old.
- Tweak text about supporting CUDA versions; 12.x support is no longer
  experimental.
- It's now `rust-toolchain.toml`, not `rust-toolchain`. And no need to
  include an out-of-date copy of it in the docs.
- Remove a reference to `spirv_builder` which isn't that helpful.
- Fix broken `Dockerfile` link.
diff --git a/guide/src/cuda/gpu_computing.md b/guide/src/cuda/gpu_computing.md
@@ -44,7 +44,7 @@ In `cust` everything uses RAII (through `Drop` impls) to manage freeing memory a
 frees users from having to think about that, which yields safer, more reliable code.
 
 Results are particularly helpful, almost every single call in every CUDA library returns a status code in the form of a cuda result.
-Ignoring these statuses is very dangerous and can often lead to random segfaults and overall unrealiable code. For this purpose,
+Ignoring these statuses is very dangerous and can often lead to random segfaults and overall unreliable code. For this purpose,
 both the CUDA SDK, and other libraries provide macros to handle such statuses. This handling is not very reliable and causes
 dependency issues down the line. 
 
diff --git a/guide/src/cuda/pipeline.md b/guide/src/cuda/pipeline.md
@@ -1,53 +1,33 @@
 # The CUDA Pipeline
 
-As you may already know, "traditional" cuda is usually in the form of CUDA C/C++ files which use `.cu` extension. These files 
-can be compiled using NVCC (NVIDIA CUDA Compiler) into an executable.
+CUDA is traditionally used via CUDA C/C++ files which have a `.cu` extension. These files can be
+compiled using NVCC (NVIDIA CUDA Compiler) into an executable.
 
-CUDA files consist of **device** and **host** functions. **device** functions are functions that run on the GPU, also called kernels.
-**host** functions run on the CPU and usually include logic on how to allocate GPU memory and call device functions.
+CUDA files consist of **device** and **host** functions. **Device** functions run on the GPU, and
+are also called kernels. **Host** functions run on the CPU and usually include logic on how to
+allocate GPU memory and call device functions.
 
-However, a lot goes on behind the scenes that most people don't know about, a lot of it is integral to how rustc_codegen_nvvm works
-so we will briefly go over it.
+Behind the scenes, NVCC has several stages of compilation.
 
-# Stages
-
-The NVIDIA CUDA Compiler consists of distinct stages of compilation:
-
-[![NVCC]](graphics/cuda-compilation-from-cu-to-executable.png)
-
-NVCC separates device and host functions and compiles them separately. 
-Most importantly, device functions are compiled to LLVM IR, and then the LLVM IR is fed to a library
-called `libnvvm`.
-
-`libnvvm` is a closed source library which takes in a subset of LLVM IR, it optimizes it further, then it
-turns it into the next and most important stage of compilation, the PTX ISA.
-
-PTX is a low level, assembly-like format with an open specification which can be targeted by any language.
-
-We won't dig deep into what happens after PTX, but in essence, it is turned into a final format called SASS
-which is register allocated and is finally sent to the GPU to execute.
-
-# libnvvm
-
-The stage/library we are most interested in is `libnvvm`. libnvvm is a closed source library that is 
-distributed in every download of the CUDA SDK. Libnvvm takes a format called NVVM IR, it optimizes it, and 
-converts it to a single PTX file you can run on NVIDIA GPUs using the driver or runtime API.
-
-NVVM IR is a subset of LLVM IR, that is to say, it is a version of LLVM IR with restrictions. A couple 
-of examples being:
-- Many intrinsics are unsupported
-- "Irregular" integer types such as `i4` or `i111` are unsupported and will segfault (however in theory they should be supported)
+First, NVCC separates device and host functions and compiles them separately. Device functions are
+compiled to [NVVM IR](https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html), a subset of LLVM IR
+with additional restrictions including the following.
+- Many intrinsics are unsupported.
+- "Irregular" integer types such as `i4` or `i111` are unsupported and will segfault (however in
+  theory they should be supported).
 - Global names cannot include `.`.
 - Some linkage types are not supported.
-- Function ABIs are ignored, everything uses the PTX calling convention.
+- Function ABIs are ignored; everything uses the PTX calling convention.
 
-You can find the full specification of the NVVM IR [here](https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html) if you are interested.
-
-# Special PTX features
-
-As far as an assembly format goes, PTX is fairly user friendly for a couple of reasons:
+libnvvm is a closed source library which takes NVVM IR, optimizes it further, then converts it to
+PTX. PTX is a low level, assembly-like format with an open specification which can be targeted by
+any language. For an assembly format, PTX is fairly user-friendly.
 - It is well formatted.
 - It is mostly fully specified (other than the iffy grammar specification).
-- It uses named registers/parameters
-- It uses virtual registers (since gpus have thousands of registers, listing all of them out would be unrealistic).
+- It uses named registers/parameters.
+- It uses virtual registers. (Because gpus have thousands of registers, listing all of them out
+  would be unrealistic.)
 - It uses ASCII as a file encoding.
+
+PTX can be run on NVIDIA GPUs using the driver API or runtime API. Those APIs will convert the PTX
+into a final format called SASS which is register allocated and executed on the GPU.
diff --git a/guide/src/faq.md b/guide/src/faq.md
@@ -29,7 +29,7 @@ over CUDA C/C++ with the same (or better!) performance and features, therefore,
 Short answer, no.
 
 Long answer, there are a couple of things that make this impossible:
-- At the time of writing, libnvvm expects LLVM 7 bitcode, giving it LLVM 12/13 bitcode (which is what rustc uses) does not work.
+- At the time of writing, libnvvm expects LLVM 7 bitcode, which is a very old format. Giving it bitcode from later LLVM version (which is what rustc uses) does not work.
 - NVVM IR is a __subset__ of LLVM IR, there are tons of things that nvvm will not accept. Such as a lot of function attrs not being allowed. 
 This is well documented and you can find the spec [here](https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html). Not to mention
 many bugs in libnvvm that i have found along the way, the most infuriating of which is nvvm not accepting integer types that arent `i1, i8, i16, i32, or i64`.
diff --git a/guide/src/guide/README.md b/guide/src/guide/README.md
@@ -1 +1,3 @@
 # Guide
+
+This section covers some of the basics.
diff --git a/guide/src/guide/getting_started.md b/guide/src/guide/getting_started.md
@@ -6,11 +6,7 @@ This section covers how to get started writing GPU crates with `cuda_std` and `c
 
 Before you can use the project to write GPU crates, you will need a couple of prerequisites:
 
-- [The CUDA SDK](https://developer.nvidia.com/cuda-downloads), version `11.2-11.8` (and the appropriate driver - [see cuda release notes](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html)).
-
-  - We recently [added experimental support for the `12.x`
-    SDK](https://github.com/Rust-GPU/rust-cuda/issues/100), please file any issues you
-    see
+- [The CUDA SDK](https://developer.nvidia.com/cuda-downloads), version 11.2 or later (and the appropriate driver - [see CUDA release notes](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html)).
 
   This is only for building GPU crates, to execute built PTX you only need CUDA `9+`.
 
@@ -27,13 +23,14 @@ Before you can use the project to write GPU crates, you will need a couple of pr
 
 - You may wish to use or consult the bundled [Dockerfile](#docker) to assist in your local config
 
-## rust-toolchain
+## rust-toolchain.toml
 
-Currently, the Codegen only works on nightly (because it uses rustc internals), and it only works on a specific version of nightly.
-This is why you must copy the `rust-toolchain` file in the project repository to your own project. This will ensure
-you are on the correct nightly version so the codegen builds.
+NVVM codegen currently requires a specific version of Rust nightly, because it uses rustc internals
+that are subject to change. Therefore, you must copy the `rust-toolchain.toml` file in the project
+repository so that your own project uses the correct nightly version.
 
-Only the codegen requires nightly, `cust` and other CPU-side libraries work perfectly fine on stable.
+Note: `cust` and other CPU-side libraries work with stable Rust, but they will end up being
+compiled with the version of nightly specified in `rust-toolchain.toml`.
 
 ## Cargo.toml
 
@@ -111,9 +108,9 @@ thread, with the number of threads being decided by the caller (the CPU).
 We call these parameters the launch dimensions of the kernel. Launch dimensions are split
 up into two basic concepts:
 
-- Threads, a single thread executes the GPU kernel **once**, and it makes the index
+- **Threads:** A single thread executes the GPU kernel **once**, and it makes the index
   of itself available to the kernel through special registers (functions in our case).
-- Blocks, Blocks house multiple threads that they execute on their own. Thread indices
+- **Blocks:** A single block houses multiple threads that it execute on its own. Thread indices
   are only unique across the thread's block, therefore CUDA also exposes the index
   of the current block.
 
@@ -150,8 +147,8 @@ If you have used CUDA C++ before, this should seem fairly familiar, with a few o
   is unsound. The reason being that rustc assumes `&mut` does not alias. However, because every thread gets a copy of the arguments, this would cause it to alias, thereby violating
   this invariant and yielding technically unsound code. Pointers do not have such an invariant on the other hand. Therefore, we use a pointer and only make a mutable reference once we
   are sure the elements are disjoint: `let elem = &mut *c.add(idx);`.
-- We check that the index is not out of bounds before doing anything, this is because it is
-  common to launch kernels with thread amounts that are not exactly divisible by the length for optimization.
+- We check that the index is not out of bounds before doing anything, because it is common to
+  launch kernels with thread counts that are not exactly divisible by the length for optimization.
 
 Internally what this does is it first checks that a couple of things are right in the kernel:
 
@@ -165,8 +162,7 @@ It also applies `#[no_mangle]` so the name of the kernel is the same as it is de
 ## Building the GPU crate
 
 Now that you have some kernels defined in a crate, you can build them easily using `cuda_builder`.
-`cuda_builder` is a helper crate similar to `spirv_builder` (if you have used rust-gpu before), it builds
-GPU crates while passing everything needed by rustc.
+which builds GPU crates while passing everything needed by rustc.
 
 To use it you can simply add it as a build dependency in your CPU crate (the crate running the GPU kernels):
 
@@ -216,23 +212,11 @@ static PTX: &str = include_str!("some/path.ptx");
 
 Then execute it using cust.
 
-Don't forget to include the current `rust-toolchain` in the top of your project:
-
-```toml
-# If you see this, run `rustup self update` to get rustup 1.23 or newer.
-
-# NOTE: above comment is for older `rustup` (before TOML support was added),
-# which will treat the first line as the toolchain name, and therefore show it
-# to the user in the error, instead of "error: invalid channel name '[toolchain]'".
-
-[toolchain]
-channel = "nightly-2021-12-04"
-components = ["rust-src", "rustc-dev", "llvm-tools-preview"]
-```
+Don't forget to include the current `rust-toolchain.toml` at the top of your project.
 
 ## Docker
 
-There is also a [Dockerfile](Dockerfile) prepared as a quickstart with all the necessary libraries for base cuda development.
+There are also some [Dockerfiles](https://github.com/Rust-GPU/rust-cuda/tree/main/container) prepared as a quickstart with all the necessary libraries for base CUDA development.
 
 You can use it as follows (assuming your clone of Rust CUDA is at the absolute path `RUST_CUDA`):
 
@@ -244,7 +228,7 @@ You can use it as follows (assuming your clone of Rust CUDA is at the absolute p
 
 **Notes:**
 
-1. refer to [rust-toolchain](#rust-toolchain) to ensure you are using the correct toolchain in your project.
+1. refer to [rust-toolchain.toml](#rust-toolchain.toml) to ensure you are using the correct toolchain in your project.
 2. despite using Docker, your machine will still need to be running a compatible driver, in this case for Cuda 11.4.1 it is >=470.57.02
 3. if you have issues within the container, it can help to start ensuring your gpu is recognized
    - ensure `nvidia-smi` provides meaningful output in the container
diff --git a/guide/src/guide/kernel_abi.md b/guide/src/guide/kernel_abi.md
@@ -82,12 +82,12 @@ unsafe {
 
 ## Arrays 
 
-Arrays are passed the same as if they were structs, they are always passed by value as byte arrays.
+Like structs, arrays are always passed by value as byte arrays.
 
 ## Slices 
 
-Slices are passed as **two parameters**, both 32-bit on `nvptx` or 64-bit on `nvptx64`. The first parameter is the pointer
-to the beginning of the data, and the second parameter is the length of the slice.
+Slices are passed as **two word-sized parameters**: a pointer to the beginning of the data, and an
+integer giving the length of the slice.
 
 For example:
 
diff --git a/guide/src/guide/safety.md b/guide/src/guide/safety.md
@@ -20,9 +20,10 @@ Undefined behavior on the GPU is defined as potentially being able to cause the
 - Causing LLVM/NVVM to optimize the code into unknown code.
 
 Behavior considered undefined inside of GPU kernels:
-- Most importantly, any behavior that is considered undefined on the CPU, is considered undefined
-on the GPU too. See: https://doc.rust-lang.org/reference/behavior-considered-undefined.html.
-The only exception being invalid sizes for buffers given to a GPU kernel.
+- Most importantly, any behavior that is [considered undefined on the
+  CPU](https://doc.rust-lang.org/reference/behavior-considered-undefined.html) is considered
+  undefined on the GPU too. The only exception being invalid sizes for buffers given to a GPU
+  kernel.
 
 Currently we declare that the invariant that a buffer given to a gpu kernel must be large enough for any access the
 kernel is going to make is up to the caller of the kernel to uphold. This idiom may be changed in the future.
@@ -42,7 +43,7 @@ the CPU.
 
 ### Streams
 
-Streams will always execute concurrently with eachother. That is to say, kernels launched
+Streams will always execute concurrently with each other. That is to say, kernels launched
 inside of a single stream guarantee that they will be executed one after the other, in order.
 
 However, kernels launched in different streams have no guarantee of execution order, their execution
@@ -54,9 +55,9 @@ Therefore, it is undefined behavior to write to the same memory location in kern
 streams without synchronization.
 
 For example:
-1: `Foo` is allocated as a buffer of memory on the GPU.
-2: Stream `1` launches kernel `bar` which writes to `Foo`.
-3: Stream `2` launches kernel `bar` which also writes to `Foo`.
+1. `Foo` is allocated as a buffer of memory on the GPU.
+2. Stream `1` launches kernel `bar` which writes to `Foo`.
+3. Stream `2` launches kernel `bar` which also writes to `Foo`.
 
 This is undefined behavior because the kernels are likely to be executed concurrently, causing a data
 race when multiple kernels try to write to the same memory.
@@ -100,8 +101,8 @@ The following invariants must be upheld by the caller of a kernel, failure to do
 behavior to launch the kernel with 3d thread indices (which would cause a data race). However, it is not undefined behavior
 to launch the kernel with a dimensionality lower than expected, e.g. launching a 2d kernel with a 1d dimensionality.
 - The types expected by the kernel must match:
-  - If the kernel expects a struct, if the struct is repr(Rust), the struct must be the actual struct from the kernel library,
-    otherwise, if it is repr(C) (which is reccomended), the fields must all match, including alignment and order of fields.
+  - If the kernel expects a struct, if the struct is `repr(Rust)`, the struct must be the actual struct from the kernel library,
+    otherwise, if it is `repr(C)` (which is recommended), the fields must all match, including alignment and order of fields.
 - Reference aliasing rules must not be violated, including:
   - Immutable references are allowed to be aliased, e.g. if a kernel expects `&T` and `&T`, it is sound to pass the same pointer for both.
   - Data behind an immutable reference must not be modified, meaning, it is undefined behavior to pass the same pointer to `&T` and `*mut T`,

Original file line number	Diff line number	Diff line change
`@@ -1 +1,3 @@`
`1`	`1`	`# Guide`
	`2`	`+`
	`3`	`+This section covers some of the basics.`