Skip to content

Commit 45d560c

Browse files
committed
Use consistent capitalization.
The Guide is currently very inconsistent with capitalization of abbreviations. The general trend is towards lower-case for informal English, but for formal English (such as documentation) I think upper-case is still preferable. - gpu -> GPU - cuda/Cuda -> CUDA - rustacuda -> RustaCUDA - llvm -> LLVM - nvvm -> NVVM - ir -> IR - ptx -> PTX - libnvvm -> libNVVM - Optix/optix -> OptiX - SPIRV -> SPIR-V - cuBlas/cuRand -> cuBLAS/cuRAND - i (the pronoun!) -> I - TLDR -> TL;DR
1 parent 7185fcc commit 45d560c

File tree

14 files changed

+77
-77
lines changed

14 files changed

+77
-77
lines changed

guide/src/cuda/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
The CUDA Toolkit is an ecosystem for executing extremely fast code on NVIDIA GPUs for the purpose of general computing.
44

5-
CUDA includes many libraries for this purpose, including the Driver API, Runtime API, the PTX ISA, libnvvm, etc. CUDA
5+
CUDA includes many libraries for this purpose, including the Driver API, Runtime API, the PTX ISA, libNVVM, etc. CUDA
66
is currently the best option for computing in terms of libraries and control available, however, it unfortunately only works
77
on NVIDIA GPUs.
88

guide/src/cuda/gpu_computing.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -13,13 +13,13 @@ of time and/or take different code paths.
1313

1414
CUDA is currently one of the best choices for fast GPU computing for multiple reasons:
1515
- It offers deep control over how kernels are dispatched and how memory is managed.
16-
- It has a rich ecosystem of tutorials, guides, and libraries such as cuRand, cuBlas, libnvvm, optix, the PTX ISA, etc.
16+
- It has a rich ecosystem of tutorials, guides, and libraries such as cuRAND, cuBLAS, libNVVM, OptiX, the PTX ISA, etc.
1717
- It is mostly unmatched in performance because it is solely meant for computing and offers rich control.
1818
And more...
1919

20-
However, CUDA can only run on NVIDIA GPUs, which precludes AMD gpus from tools that use it. However, this is a drawback that
21-
is acceptable by many because of the significant developer cost of supporting both NVIDIA gpus with CUDA and
22-
AMD gpus with OpenCL, since OpenCL is generally slower, clunkier, and lacks libraries and docs on par with CUDA.
20+
However, CUDA can only run on NVIDIA GPUs, which precludes AMD GPUs from tools that use it. However, this is a drawback that
21+
is acceptable by many because of the significant developer cost of supporting both NVIDIA GPUs with CUDA and
22+
AMD GPUs with OpenCL, since OpenCL is generally slower, clunkier, and lacks libraries and docs on par with CUDA.
2323

2424
# Why Rust?
2525

@@ -28,22 +28,22 @@ accomplish; The initial hurdle of getting Rust to compile to something CUDA can
2828
polish part.
2929

3030
On top of its rich language features (macros, enums, traits, proc macros, great errors, etc), Rust's safety guarantees
31-
can be applied in gpu programming too; A field that has historically been full of implied invariants and unsafety, such
31+
can be applied in GPU programming too; A field that has historically been full of implied invariants and unsafety, such
3232
as (but not limited to):
3333
- Expecting some amount of dynamic shared memory from the caller.
3434
- Expecting a certain layout for thread blocks/threads.
3535
- Manually handling the indexing of data, leaving code prone to data races if not managed correctly.
3636
- Forgetting to free memory, using uninitialized memory, etc.
3737

38-
Not to mention the standardized tooling that makes the building, documentation, sharing, and linting of gpu kernel libraries easily possible.
38+
Not to mention the standardized tooling that makes the building, documentation, sharing, and linting of GPU kernel libraries easily possible.
3939
Most of the reasons for using rust on the CPU apply to using Rust for the GPU, these reasons have been stated countless times so
40-
i will not repeat them here.
40+
I will not repeat them here.
4141

4242
A couple of particular rust features make writing CUDA code much easier: RAII and Results.
4343
In `cust` everything uses RAII (through `Drop` impls) to manage freeing memory and returning handles, which
4444
frees users from having to think about that, which yields safer, more reliable code.
4545

46-
Results are particularly helpful, almost every single call in every CUDA library returns a status code in the form of a cuda result.
46+
Results are particularly helpful, almost every single call in every CUDA library returns a status code in the form of a CUDA result.
4747
Ignoring these statuses is very dangerous and can often lead to random segfaults and overall unreliable code. For this purpose,
4848
both the CUDA SDK, and other libraries provide macros to handle such statuses. This handling is not very reliable and causes
4949
dependency issues down the line.

guide/src/cuda/pipeline.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,13 +19,13 @@ with additional restrictions including the following.
1919
- Some linkage types are not supported.
2020
- Function ABIs are ignored; everything uses the PTX calling convention.
2121

22-
libnvvm is a closed source library which takes NVVM IR, optimizes it further, then converts it to
22+
libNVVM is a closed source library which takes NVVM IR, optimizes it further, then converts it to
2323
PTX. PTX is a low level, assembly-like format with an open specification which can be targeted by
2424
any language. For an assembly format, PTX is fairly user-friendly.
2525
- It is well formatted.
2626
- It is mostly fully specified (other than the iffy grammar specification).
2727
- It uses named registers/parameters.
28-
- It uses virtual registers. (Because gpus have thousands of registers, listing all of them out
28+
- It uses virtual registers. (Because GPUs have thousands of registers, listing all of them out
2929
would be unrealistic.)
3030
- It uses ASCII as a file encoding.
3131

guide/src/faq.md

Lines changed: 21 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -29,10 +29,10 @@ over CUDA C/C++ with the same (or better!) performance and features, therefore,
2929
Short answer, no.
3030

3131
Long answer, there are a couple of things that make this impossible:
32-
- At the time of writing, libnvvm expects LLVM 7 bitcode, which is a very old format. Giving it bitcode from later LLVM version (which is what rustc uses) does not work.
33-
- NVVM IR is a __subset__ of LLVM IR, there are tons of things that nvvm will not accept. Such as a lot of function attrs not being allowed.
32+
- At the time of writing, libNVVM expects LLVM 7 bitcode, which is a very old format. Giving it bitcode from later LLVM version (which is what rustc uses) does not work.
33+
- NVVM IR is a __subset__ of LLVM IR, there are tons of things that NVVM will not accept. Such as a lot of function attrs not being allowed.
3434
This is well documented and you can find the spec [here](https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html). Not to mention
35-
many bugs in libnvvm that i have found along the way, the most infuriating of which is nvvm not accepting integer types that arent `i1, i8, i16, i32, or i64`.
35+
many bugs in libNVVM that I have found along the way, the most infuriating of which is nvvm not accepting integer types that arent `i1, i8, i16, i32, or i64`.
3636
This required special handling in the codegen to convert these "irregular" types into vector types.
3737

3838
## What is the point of using Rust if a lot of things in kernels are unsafe?
@@ -153,13 +153,13 @@ things to gain in terms of safety using Rust.
153153
The reasoning for this is the same reasoning as to why you would use CUDA over opengl/vulkan compute shaders:
154154
- CUDA usually outperforms shaders if kernels are written well and launch configurations are optimal.
155155
- CUDA has many useful features such as shared memory, unified memory, graphs, fine grained thread control, streams, the PTX ISA, etc.
156-
- rust-gpu does not perform many optimizations, and with rustc_codegen_ssa's less than ideal codegen, the optimizations by llvm and libnvvm are needed.
157-
- SPIRV is arguably still not suitable for serious GPU kernel codegen, it is underspecced, complex, and does not mention many things which are needed.
158-
While libnvvm (which uses a well documented subset of LLVM IR) and the PTX ISA are very thoroughly documented/specified.
156+
- rust-gpu does not perform many optimizations, and with rustc_codegen_ssa's less than ideal codegen, the optimizations by LLVM and libNVVM are needed.
157+
- SPIR-V is arguably still not suitable for serious GPU kernel codegen, it is underspecced, complex, and does not mention many things which are needed.
158+
While libNVVM (which uses a well documented subset of LLVM IR) and the PTX ISA are very thoroughly documented/specified.
159159
- rust-gpu is primarily focused on graphical shaders, compute shaders are secondary, which the rust ecosystem needs, but it also
160160
needs a project 100% focused on computing, and computing only.
161-
- SPIRV cannot access many useful CUDA libraries such as Optix, cuDNN, cuBLAS, etc.
162-
- SPIRV debug info is still very young and rust-gpu cannot generate it. While rustc_codegen_nvvm does, which can be used
161+
- SPIR-V cannot access many useful CUDA libraries such as OptiX, cuDNN, cuBLAS, etc.
162+
- SPIR-V debug info is still very young and rust-gpu cannot generate it. While rustc_codegen_nvvm does, which can be used
163163
for profiling kernels in something like nsight compute.
164164
165165
Moreover, CUDA is the primary tool used in big computing industries such as VFX and scientific computing. Therefore
@@ -190,17 +190,17 @@ when it is finished, which causes further uses of CUDA to fail.
190190
191191
Modules are the second big difference in the driver API. Modules are similar to shared libraries, they
192192
contain all of the globals and functions (kernels) inside of a PTX/cubin file. The driver API
193-
is language-agnostic, it purely works off of ptx/cubin files. To answer why this is important we
194-
need to cover what cubins and ptx files are briefly.
193+
is language-agnostic, it purely works off PTX/cubin files. To answer why this is important we
194+
need to cover what cubins and PTX files are briefly.
195195
196196
PTX is a low level assembly-like language which is the penultimate step before what the GPU actually
197197
executes. It is human-readable and you can dump it from a CUDA C++ program with `nvcc ./file.cu --ptx`.
198198
This PTX is then optimized and lowered into a final format called SASS (Source and Assembly) and
199199
turned into a cubin (CUDA binary) file.
200200
201-
Driver API modules can be loaded as either ptx, cubin, or fatbin files. If they are loaded as
202-
ptx then the driver API will JIT compile the PTX to cubin then cache it. You can also
203-
compile ptx to cubin yourself using ptx-compiler and cache it.
201+
Driver API modules can be loaded as either PTX, cubin, or fatbin files. If they are loaded as
202+
PTX then the driver API will JIT compile the PTX to cubin then cache it. You can also
203+
compile PTX to cubin yourself using ptx-compiler and cache it.
204204
205205
This pipeline provides much better control over what functions you actually need to load and cache.
206206
You can separate different functions into different modules you can load dynamically (and even dynamically reload).
@@ -217,7 +217,7 @@ need to manage many kernels being dispatched at the same time as efficiently as
217217
218218
## Why target NVIDIA GPUs only instead of using something that can work on AMD?
219219
220-
This is a complex issue with many arguments for both sides, so i will give you
220+
This is a complex issue with many arguments for both sides, so I will give you
221221
both sides as well as my opinion.
222222
223223
Pros for using OpenCL over CUDA:
@@ -235,7 +235,7 @@ new features cannot be reliably relied upon because they are unlikely to work on
235235
- OpenCL can only be written in OpenCL C (based on C99), OpenCL C++ is a thing, but again, not everything
236236
supports it. This makes complex programs more difficult to create.
237237
- OpenCL has less tools and libraries.
238-
- OpenCL is nowhere near as language-agnostic as CUDA. CUDA works almost fully off of an assembly format (ptx)
238+
- OpenCL is nowhere near as language-agnostic as CUDA. CUDA works almost fully off of an assembly format (PTX)
239239
and debug info. Essentially how CPU code works. This makes writing language-agnostic things in OpenCL near impossible and
240240
locks you into using OpenCL C.
241241
- OpenCL is plagued with serious driver bugs which have not been fixed, or that occur only on certain vendors.
@@ -245,10 +245,10 @@ Pros for using CUDA over OpenCL:
245245
VFX computing.
246246
- CUDA is a proprietary tool, meaning that NVIDIA is able to push out bug fixes and features much faster
247247
than releasing a new spec and waiting for vendors to implement it. This allows for more features being added,
248-
such as cooperative kernels, cuda graphs, unified memory, new profilers, etc.
248+
such as cooperative kernels, CUDA graphs, unified memory, new profilers, etc.
249249
- CUDA is a single entity, meaning that if something does or does not work on one system it is unlikely
250250
that that will be different on another system. Assuming you are not using different architectures, where
251-
one gpu may be lacking a feature.
251+
one GPU may be lacking a feature.
252252
- CUDA is usually 10-30% faster than OpenCL overall, this is likely due to subpar OpenCL drivers by NVIDIA,
253253
but it is unlikely this performance gap will change in the near future.
254254
- CUDA has a much richer set of libraries and tools than OpenCL, such as cuFFT, cuBLAS, cuRand, cuDNN, OptiX, NSight Compute, cuFile, etc.
@@ -264,8 +264,8 @@ Cons for using CUDA over OpenCL:
264264
265265
# What makes cust and RustaCUDA different?
266266
267-
Cust is a fork of rustacuda which changes a lot of things inside of it, as well as adds new features that
268-
are not inside of rustacuda.
267+
Cust is a fork of RustaCUDA which changes a lot of things inside of it, as well as adds new features that
268+
are not inside of RustaCUDA.
269269
270270
The most significant changes (This list is not complete!!) are:
271271
- Drop code no longer panics on failure to drop raw CUDA handles, this is so that InvalidAddress errors, which cause
@@ -286,8 +286,8 @@ Changes that are currently in progress but not done/experimental:
286286
- Graphs
287287
- PTX validation
288288
289-
Just like rustacuda, cust makes no assumptions of what language was used to generate the ptx/cubin. It could be
289+
Just like RustaCUDA, cust makes no assumptions of what language was used to generate the PTX/cubin. It could be
290290
C, C++, futhark, or best of all, Rust!
291291
292-
Cust's name is literally just rust + cuda mashed together in a horrible way.
292+
Cust's name is literally just rust + CUDA mashed together in a horrible way.
293293
Or you can pretend it stands for custard if you really like custard.

guide/src/features.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,9 +18,9 @@ around to adding it yet.
1818

1919
| Feature Name | Support Level | Notes |
2020
| ------------ | ------------- | ----- |
21-
| Opt-Levels | ✔️ | behaves mostly the same (because llvm is still used for optimizations). Except that libnvvm opts are run on anything except no-opts because nvvm only has -O0 and -O3 |
21+
| Opt-Levels | ✔️ | behaves mostly the same (because LLVM is still used for optimizations). Except that libNVVM opts are run on anything except no-opts because NVVM only has -O0 and -O3 |
2222
| codegen-units | ✔️ |
23-
| LTO || we load bitcode modules lazily using dependency graphs, which then forms a single module optimized by libnvvm, so all the benefits of LTO are on without pre-libnvvm LTO being needed. |
23+
| LTO || we load bitcode modules lazily using dependency graphs, which then forms a single module optimized by libNVVM, so all the benefits of LTO are on without pre-libNVVM LTO being needed. |
2424
| Closures | ✔️ |
2525
| Enums | ✔️ |
2626
| Loops | ✔️ |

guide/src/guide/compute_capabilities.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -142,7 +142,7 @@ Note: While the 'a' variant enables all these features during compilation (allow
142142

143143
For more details on suffixes, see [NVIDIA's blog post on family-specific architecture features](https://developer.nvidia.com/blog/nvidia-blackwell-and-nvidia-cuda-12-9-introduce-family-specific-architecture-features/).
144144

145-
### Manual Compilation (Without CudaBuilder)
145+
### Manual Compilation (Without `cuda_builder`)
146146

147147
If you're invoking `rustc` directly instead of using `cuda_builder`, you only need to specify the architecture through LLVM args:
148148

guide/src/guide/getting_started.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,9 @@ Before you can use the project to write GPU crates, you will need a couple of pr
1717
- Finally, if neither are present or unusable, it will attempt to download and use prebuilt LLVM. This currently only
1818
works on Windows however.
1919

20-
- The OptiX SDK if using the optix library (the pathtracer example uses it for denoising).
20+
- The OptiX SDK if using the OptiX library (the pathtracer example uses it for denoising).
2121

22-
- You may also need to add `libnvvm` to PATH, the builder should do it for you but in case it does not work, add libnvvm to PATH, it should be somewhere like `CUDA_ROOT/nvvm/bin`,
22+
- You may also need to add `libnvvm` to PATH, the builder should do it for you but in case it does not work, add `libnvvm` to PATH, it should be somewhere like `CUDA_ROOT/nvvm/bin`,
2323

2424
- You may wish to use or consult the bundled [Dockerfile](#docker) to assist in your local config
2525

@@ -102,7 +102,7 @@ Now we can finally start writing an actual GPU kernel.
102102
Firstly, we must explain a couple of things about GPU kernels, specifically, how they are executed. GPU Kernels (functions) are the entry point for executing anything on the GPU, they are the functions which will be executed from the CPU. GPU kernels do not return anything, they write their data to buffers passed into them.
103103

104104
CUDA's execution model is very very complex and it is unrealistic to explain all of it in
105-
this section, but the TLDR of it is that CUDA will execute the GPU kernel once on every
105+
this section, but the TL;DR of it is that CUDA will execute the GPU kernel once on every
106106
thread, with the number of threads being decided by the caller (the CPU).
107107

108108
We call these parameters the launch dimensions of the kernel. Launch dimensions are split
@@ -115,7 +115,7 @@ up into two basic concepts:
115115
of the current block.
116116

117117
One important thing to note is that block and thread dimensions may be 1d, 2d, or 3d.
118-
That is to say, i can launch `1` block of `6x6x6`, `6x6`, or `6` threads. I could
118+
That is to say, I can launch `1` block of `6x6x6`, `6x6`, or `6` threads. I could
119119
also launch `5x5x5` blocks. This is very useful for 2d/3d applications because it makes
120120
the 2d/3d index calculations much simpler. CUDA exposes thread and block indices
121121
for each dimension through special registers. We expose thread index queries through
@@ -229,7 +229,7 @@ You can use it as follows (assuming your clone of Rust CUDA is at the absolute p
229229
**Notes:**
230230

231231
1. refer to [rust-toolchain.toml](#rust-toolchain.toml) to ensure you are using the correct toolchain in your project.
232-
2. despite using Docker, your machine will still need to be running a compatible driver, in this case for Cuda 11.4.1 it is >=470.57.02
233-
3. if you have issues within the container, it can help to start ensuring your gpu is recognized
232+
2. despite using Docker, your machine will still need to be running a compatible driver, in this case for CUDA 11.4.1 it is >=470.57.02
233+
3. if you have issues within the container, it can help to start ensuring your GPU is recognized
234234
- ensure `nvidia-smi` provides meaningful output in the container
235-
- NVidia provides a number of samples https://github.com/NVIDIA/cuda-samples. In particular, you may want to try `make`ing and running the [`deviceQuery`](https://github.com/NVIDIA/cuda-samples/tree/ba04faaf7328dbcc87bfc9acaf17f951ee5ddcf3/Samples/deviceQuery) sample. If all is well you should see many details about your gpu
235+
- NVidia provides a number of samples https://github.com/NVIDIA/cuda-samples. In particular, you may want to try `make`ing and running the [`deviceQuery`](https://github.com/NVIDIA/cuda-samples/tree/ba04faaf7328dbcc87bfc9acaf17f951ee5ddcf3/Samples/deviceQuery) sample. If all is well you should see many details about your GPU

guide/src/guide/safety.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ Behavior considered undefined inside of GPU kernels:
2525
undefined on the GPU too. The only exception being invalid sizes for buffers given to a GPU
2626
kernel.
2727

28-
Currently we declare that the invariant that a buffer given to a gpu kernel must be large enough for any access the
28+
Currently we declare that the invariant that a buffer given to a GPU kernel must be large enough for any access the
2929
kernel is going to make is up to the caller of the kernel to uphold. This idiom may be changed in the future.
3030

3131
- Any kind of data race, this has the same semantics as data races in CPU code. Such as:

guide/src/guide/tips.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,5 +10,5 @@ will get much better in the future but currently it will cause some undesirable
1010

1111
- Don't use recursion, CUDA allows it but threads have very limited stacks (local memory) and stack overflows
1212
yield confusing `InvalidAddress` errors. If you are getting such an error, run the executable in cuda-memcheck,
13-
it should yield a write failure to `Local` memory at an address of about 16mb. You can also put the ptx file through
13+
it should yield a write failure to `Local` memory at an address of about 16mb. You can also put the PTX file through
1414
`cuobjdump` and it should yield ptxas warnings for functions without a statically known stack usage.

0 commit comments

Comments
 (0)