You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Guide is currently very inconsistent with capitalization of
abbreviations.
The general trend is towards lower-case for informal English, but for
formal English (such as documentation) I think upper-case is still
preferable.
- gpu -> GPU
- cuda/Cuda -> CUDA
- rustacuda -> RustaCUDA
- llvm -> LLVM
- nvvm -> NVVM
- ir -> IR
- ptx -> PTX
- libnvvm -> libNVVM
- Optix/optix -> OptiX
- SPIRV -> SPIR-V
- cuBlas/cuRand -> cuBLAS/cuRAND
- i (the pronoun!) -> I
- TLDR -> TL;DR
Copy file name to clipboardExpand all lines: guide/src/faq.md
+21-21Lines changed: 21 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -29,10 +29,10 @@ over CUDA C/C++ with the same (or better!) performance and features, therefore,
29
29
Short answer, no.
30
30
31
31
Long answer, there are a couple of things that make this impossible:
32
-
- At the time of writing, libnvvm expects LLVM 7 bitcode, which is a very old format. Giving it bitcode from later LLVM version (which is what rustc uses) does not work.
33
-
- NVVM IR is a __subset__ of LLVM IR, there are tons of things that nvvm will not accept. Such as a lot of function attrs not being allowed.
32
+
- At the time of writing, libNVVM expects LLVM 7 bitcode, which is a very old format. Giving it bitcode from later LLVM version (which is what rustc uses) does not work.
33
+
- NVVM IR is a __subset__ of LLVM IR, there are tons of things that NVVM will not accept. Such as a lot of function attrs not being allowed.
34
34
This is well documented and you can find the spec [here](https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html). Not to mention
35
-
many bugs in libnvvm that i have found along the way, the most infuriating of which is nvvm not accepting integer types that arent `i1, i8, i16, i32, or i64`.
35
+
many bugs in libNVVM that I have found along the way, the most infuriating of which is nvvm not accepting integer types that arent `i1, i8, i16, i32, or i64`.
36
36
This required special handling in the codegen to convert these "irregular" types into vector types.
37
37
38
38
## What is the point of using Rust if a lot of things in kernels are unsafe?
@@ -153,13 +153,13 @@ things to gain in terms of safety using Rust.
153
153
The reasoning for this is the same reasoning as to why you would use CUDA over opengl/vulkan compute shaders:
154
154
- CUDA usually outperforms shaders if kernels are written well and launch configurations are optimal.
155
155
- CUDA has many useful features such as shared memory, unified memory, graphs, fine grained thread control, streams, the PTX ISA, etc.
156
-
- rust-gpu does not perform many optimizations, and with rustc_codegen_ssa's less than ideal codegen, the optimizations by llvm and libnvvm are needed.
157
-
- SPIRV is arguably still not suitable for serious GPU kernel codegen, it is underspecced, complex, and does not mention many things which are needed.
158
-
While libnvvm (which uses a well documented subset of LLVM IR) and the PTX ISA are very thoroughly documented/specified.
156
+
- rust-gpu does not perform many optimizations, and with rustc_codegen_ssa's less than ideal codegen, the optimizations by LLVM and libNVVM are needed.
157
+
- SPIR-V is arguably still not suitable for serious GPU kernel codegen, it is underspecced, complex, and does not mention many things which are needed.
158
+
While libNVVM (which uses a well documented subset of LLVM IR) and the PTX ISA are very thoroughly documented/specified.
159
159
- rust-gpu is primarily focused on graphical shaders, compute shaders are secondary, which the rust ecosystem needs, but it also
160
160
needs a project 100% focused on computing, and computing only.
161
-
- SPIRV cannot access many useful CUDA libraries such as Optix, cuDNN, cuBLAS, etc.
162
-
- SPIRV debug info is still very young and rust-gpu cannot generate it. While rustc_codegen_nvvm does, which can be used
161
+
- SPIR-V cannot access many useful CUDA libraries such as OptiX, cuDNN, cuBLAS, etc.
162
+
- SPIR-V debug info is still very young and rust-gpu cannot generate it. While rustc_codegen_nvvm does, which can be used
163
163
for profiling kernels in something like nsight compute.
164
164
165
165
Moreover, CUDA is the primary tool used in big computing industries such as VFX and scientific computing. Therefore
@@ -190,17 +190,17 @@ when it is finished, which causes further uses of CUDA to fail.
190
190
191
191
Modules are the second big difference in the driver API. Modules are similar to shared libraries, they
192
192
contain all of the globals and functions (kernels) inside of a PTX/cubin file. The driver API
193
-
is language-agnostic, it purely works off of ptx/cubin files. To answer why this is important we
194
-
need to cover what cubins and ptx files are briefly.
193
+
is language-agnostic, it purely works off PTX/cubin files. To answer why this is important we
194
+
need to cover what cubins and PTX files are briefly.
195
195
196
196
PTX is a low level assembly-like language which is the penultimate step before what the GPU actually
197
197
executes. It is human-readable and you can dump it from a CUDA C++ program with `nvcc ./file.cu --ptx`.
198
198
This PTX is then optimized and lowered into a final format called SASS (Source and Assembly) and
199
199
turned into a cubin (CUDA binary) file.
200
200
201
-
Driver API modules can be loaded as either ptx, cubin, or fatbin files. If they are loaded as
202
-
ptx then the driver API will JIT compile the PTX to cubin then cache it. You can also
203
-
compile ptx to cubin yourself using ptx-compiler and cache it.
201
+
Driver API modules can be loaded as either PTX, cubin, or fatbin files. If they are loaded as
202
+
PTX then the driver API will JIT compile the PTX to cubin then cache it. You can also
203
+
compile PTX to cubin yourself using ptx-compiler and cache it.
204
204
205
205
This pipeline provides much better control over what functions you actually need to load and cache.
206
206
You can separate different functions into different modules you can load dynamically (and even dynamically reload).
@@ -217,7 +217,7 @@ need to manage many kernels being dispatched at the same time as efficiently as
217
217
218
218
## Why target NVIDIA GPUs only instead of using something that can work on AMD?
219
219
220
-
This is a complex issue with many arguments for both sides, so i will give you
220
+
This is a complex issue with many arguments for both sides, so I will give you
221
221
both sides as well as my opinion.
222
222
223
223
Pros for using OpenCL over CUDA:
@@ -235,7 +235,7 @@ new features cannot be reliably relied upon because they are unlikely to work on
235
235
- OpenCL can only be written in OpenCL C (based on C99), OpenCL C++ is a thing, but again, not everything
236
236
supports it. This makes complex programs more difficult to create.
237
237
- OpenCL has less tools and libraries.
238
-
- OpenCL is nowhere near as language-agnostic as CUDA. CUDA works almost fully off of an assembly format (ptx)
238
+
- OpenCL is nowhere near as language-agnostic as CUDA. CUDA works almost fully off of an assembly format (PTX)
239
239
and debug info. Essentially how CPU code works. This makes writing language-agnostic things in OpenCL near impossible and
240
240
locks you into using OpenCL C.
241
241
- OpenCL is plagued with serious driver bugs which have not been fixed, or that occur only on certain vendors.
@@ -245,10 +245,10 @@ Pros for using CUDA over OpenCL:
245
245
VFX computing.
246
246
- CUDA is a proprietary tool, meaning that NVIDIA is able to push out bug fixes and features much faster
247
247
than releasing a new spec and waiting for vendors to implement it. This allows for more features being added,
248
-
such as cooperative kernels, cuda graphs, unified memory, new profilers, etc.
248
+
such as cooperative kernels, CUDA graphs, unified memory, new profilers, etc.
249
249
- CUDA is a single entity, meaning that if something does or does not work on one system it is unlikely
250
250
that that will be different on another system. Assuming you are not using different architectures, where
251
-
one gpu may be lacking a feature.
251
+
one GPU may be lacking a feature.
252
252
- CUDA is usually 10-30% faster than OpenCL overall, this is likely due to subpar OpenCL drivers by NVIDIA,
253
253
but it is unlikely this performance gap will change in the near future.
254
254
- CUDA has a much richer set of libraries and tools than OpenCL, such as cuFFT, cuBLAS, cuRand, cuDNN, OptiX, NSight Compute, cuFile, etc.
@@ -264,8 +264,8 @@ Cons for using CUDA over OpenCL:
264
264
265
265
# What makes cust and RustaCUDA different?
266
266
267
-
Cust is a fork of rustacuda which changes a lot of things inside of it, as well as adds new features that
268
-
are not inside of rustacuda.
267
+
Cust is a fork of RustaCUDA which changes a lot of things inside of it, as well as adds new features that
268
+
are not inside of RustaCUDA.
269
269
270
270
The most significant changes (This list is not complete!!) are:
271
271
- Drop code no longer panics on failure to drop raw CUDA handles, this is so that InvalidAddress errors, which cause
@@ -286,8 +286,8 @@ Changes that are currently in progress but not done/experimental:
286
286
- Graphs
287
287
- PTX validation
288
288
289
-
Just like rustacuda, cust makes no assumptions of what language was used to generate the ptx/cubin. It could be
289
+
Just like RustaCUDA, cust makes no assumptions of what language was used to generate the PTX/cubin. It could be
290
290
C, C++, futhark, or best of all, Rust!
291
291
292
-
Cust's name is literally just rust + cuda mashed together in a horrible way.
292
+
Cust's name is literally just rust + CUDA mashed together in a horrible way.
293
293
Or you can pretend it stands for custard if you really like custard.
Copy file name to clipboardExpand all lines: guide/src/features.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,9 +18,9 @@ around to adding it yet.
18
18
19
19
| Feature Name | Support Level | Notes |
20
20
| ------------ | ------------- | ----- |
21
-
| Opt-Levels | ✔️ | behaves mostly the same (because llvm is still used for optimizations). Except that libnvvm opts are run on anything except no-opts because nvvm only has -O0 and -O3 |
21
+
| Opt-Levels | ✔️ | behaves mostly the same (because LLVM is still used for optimizations). Except that libNVVM opts are run on anything except no-opts because NVVM only has -O0 and -O3 |
22
22
| codegen-units | ✔️ |
23
-
| LTO | ➖ | we load bitcode modules lazily using dependency graphs, which then forms a single module optimized by libnvvm, so all the benefits of LTO are on without pre-libnvvm LTO being needed. |
23
+
| LTO | ➖ | we load bitcode modules lazily using dependency graphs, which then forms a single module optimized by libNVVM, so all the benefits of LTO are on without pre-libNVVM LTO being needed. |
Copy file name to clipboardExpand all lines: guide/src/guide/compute_capabilities.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -142,7 +142,7 @@ Note: While the 'a' variant enables all these features during compilation (allow
142
142
143
143
For more details on suffixes, see [NVIDIA's blog post on family-specific architecture features](https://developer.nvidia.com/blog/nvidia-blackwell-and-nvidia-cuda-12-9-introduce-family-specific-architecture-features/).
144
144
145
-
### Manual Compilation (Without CudaBuilder)
145
+
### Manual Compilation (Without `cuda_builder`)
146
146
147
147
If you're invoking `rustc` directly instead of using `cuda_builder`, you only need to specify the architecture through LLVM args:
Copy file name to clipboardExpand all lines: guide/src/guide/getting_started.md
+7-7Lines changed: 7 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,9 +17,9 @@ Before you can use the project to write GPU crates, you will need a couple of pr
17
17
- Finally, if neither are present or unusable, it will attempt to download and use prebuilt LLVM. This currently only
18
18
works on Windows however.
19
19
20
-
- The OptiX SDK if using the optix library (the pathtracer example uses it for denoising).
20
+
- The OptiX SDK if using the OptiX library (the pathtracer example uses it for denoising).
21
21
22
-
- You may also need to add `libnvvm` to PATH, the builder should do it for you but in case it does not work, add libnvvm to PATH, it should be somewhere like `CUDA_ROOT/nvvm/bin`,
22
+
- You may also need to add `libnvvm` to PATH, the builder should do it for you but in case it does not work, add `libnvvm` to PATH, it should be somewhere like `CUDA_ROOT/nvvm/bin`,
23
23
24
24
- You may wish to use or consult the bundled [Dockerfile](#docker) to assist in your local config
25
25
@@ -102,7 +102,7 @@ Now we can finally start writing an actual GPU kernel.
102
102
Firstly, we must explain a couple of things about GPU kernels, specifically, how they are executed. GPU Kernels (functions) are the entry point for executing anything on the GPU, they are the functions which will be executed from the CPU. GPU kernels do not return anything, they write their data to buffers passed into them.
103
103
104
104
CUDA's execution model is very very complex and it is unrealistic to explain all of it in
105
-
this section, but the TLDR of it is that CUDA will execute the GPU kernel once on every
105
+
this section, but the TL;DR of it is that CUDA will execute the GPU kernel once on every
106
106
thread, with the number of threads being decided by the caller (the CPU).
107
107
108
108
We call these parameters the launch dimensions of the kernel. Launch dimensions are split
@@ -115,7 +115,7 @@ up into two basic concepts:
115
115
of the current block.
116
116
117
117
One important thing to note is that block and thread dimensions may be 1d, 2d, or 3d.
118
-
That is to say, i can launch `1` block of `6x6x6`, `6x6`, or `6` threads. I could
118
+
That is to say, I can launch `1` block of `6x6x6`, `6x6`, or `6` threads. I could
119
119
also launch `5x5x5` blocks. This is very useful for 2d/3d applications because it makes
120
120
the 2d/3d index calculations much simpler. CUDA exposes thread and block indices
121
121
for each dimension through special registers. We expose thread index queries through
@@ -229,7 +229,7 @@ You can use it as follows (assuming your clone of Rust CUDA is at the absolute p
229
229
**Notes:**
230
230
231
231
1. refer to [rust-toolchain.toml](#rust-toolchain.toml) to ensure you are using the correct toolchain in your project.
232
-
2. despite using Docker, your machine will still need to be running a compatible driver, in this case for Cuda 11.4.1 it is >=470.57.02
233
-
3. if you have issues within the container, it can help to start ensuring your gpu is recognized
232
+
2. despite using Docker, your machine will still need to be running a compatible driver, in this case for CUDA 11.4.1 it is >=470.57.02
233
+
3. if you have issues within the container, it can help to start ensuring your GPU is recognized
234
234
- ensure `nvidia-smi` provides meaningful output in the container
235
-
- NVidia provides a number of samples https://github.com/NVIDIA/cuda-samples. In particular, you may want to try `make`ing and running the [`deviceQuery`](https://github.com/NVIDIA/cuda-samples/tree/ba04faaf7328dbcc87bfc9acaf17f951ee5ddcf3/Samples/deviceQuery) sample. If all is well you should see many details about your gpu
235
+
- NVidia provides a number of samples https://github.com/NVIDIA/cuda-samples. In particular, you may want to try `make`ing and running the [`deviceQuery`](https://github.com/NVIDIA/cuda-samples/tree/ba04faaf7328dbcc87bfc9acaf17f951ee5ddcf3/Samples/deviceQuery) sample. If all is well you should see many details about your GPU
0 commit comments