Rust-GPU
diff --git a/‎guide/src/cuda/README.md‎
Lines changed: 1 addition & 1 deletion b/‎guide/src/cuda/README.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎guide/src/cuda/gpu_computing.md‎
Lines changed: 8 additions & 8 deletions b/‎guide/src/cuda/gpu_computing.md‎
Lines changed: 8 additions & 8 deletions
diff --git a/‎guide/src/cuda/pipeline.md‎
Lines changed: 2 additions & 2 deletions b/‎guide/src/cuda/pipeline.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎guide/src/faq.md‎
Lines changed: 21 additions & 21 deletions b/‎guide/src/faq.md‎
Lines changed: 21 additions & 21 deletions
diff --git a/‎guide/src/features.md‎
Lines changed: 2 additions & 2 deletions b/‎guide/src/features.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎guide/src/guide/compute_capabilities.md‎
Lines changed: 1 addition & 1 deletion b/‎guide/src/guide/compute_capabilities.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎guide/src/guide/getting_started.md‎
Lines changed: 7 additions & 7 deletions b/‎guide/src/guide/getting_started.md‎
Lines changed: 7 additions & 7 deletions
diff --git a/‎guide/src/guide/safety.md‎
Lines changed: 1 addition & 1 deletion b/‎guide/src/guide/safety.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎guide/src/guide/tips.md‎
Lines changed: 1 addition & 1 deletion b/‎guide/src/guide/tips.md‎
Lines changed: 1 addition & 1 deletion
@@ -2,7 +2,7 @@
 
 The CUDA Toolkit is an ecosystem for executing extremely fast code on NVIDIA GPUs for the purpose of general computing.
 
-CUDA includes many libraries for this purpose, including the Driver API, Runtime API, the PTX ISA, libnvvm, etc. CUDA
+CUDA includes many libraries for this purpose, including the Driver API, Runtime API, the PTX ISA, libNVVM, etc. CUDA
 is currently the best option for computing in terms of libraries and control available, however, it unfortunately only works
 on NVIDIA GPUs.
 
 
@@ -13,13 +13,13 @@ of time and/or take different code paths.
 
 CUDA is currently one of the best choices for fast GPU computing for multiple reasons:
 - It offers deep control over how kernels are dispatched and how memory is managed.
-- It has a rich ecosystem of tutorials, guides, and libraries such as cuRand, cuBlas, libnvvm, optix, the PTX ISA, etc.
+- It has a rich ecosystem of tutorials, guides, and libraries such as cuRAND, cuBLAS, libNVVM, OptiX, the PTX ISA, etc.
 - It is mostly unmatched in performance because it is solely meant for computing and offers rich control.
 And more...
 
-However, CUDA can only run on NVIDIA GPUs, which precludes AMD gpus from tools that use it. However, this is a drawback that 
-is acceptable by many because of the significant developer cost of supporting both NVIDIA gpus with CUDA and 
-AMD gpus with OpenCL, since OpenCL is generally slower, clunkier, and lacks libraries and docs on par with CUDA.
+However, CUDA can only run on NVIDIA GPUs, which precludes AMD GPUs from tools that use it. However, this is a drawback that 
+is acceptable by many because of the significant developer cost of supporting both NVIDIA GPUs with CUDA and 
+AMD GPUs with OpenCL, since OpenCL is generally slower, clunkier, and lacks libraries and docs on par with CUDA.
 
 # Why Rust?
 
@@ -28,22 +28,22 @@ accomplish; The initial hurdle of getting Rust to compile to something CUDA can
 polish part. 
 
 On top of its rich language features (macros, enums, traits, proc macros, great errors, etc), Rust's safety guarantees
-can be applied in gpu programming too; A field that has historically been full of implied invariants and unsafety, such
+can be applied in GPU programming too; A field that has historically been full of implied invariants and unsafety, such
 as (but not limited to):
 - Expecting some amount of dynamic shared memory from the caller.
 - Expecting a certain layout for thread blocks/threads.
 - Manually handling the indexing of data, leaving code prone to data races if not managed correctly.
 - Forgetting to free memory, using uninitialized memory, etc.
 
-Not to mention the standardized tooling that makes the building, documentation, sharing, and linting of gpu kernel libraries easily possible.
+Not to mention the standardized tooling that makes the building, documentation, sharing, and linting of GPU kernel libraries easily possible.
 Most of the reasons for using rust on the CPU apply to using Rust for the GPU, these reasons have been stated countless times so
-i will not repeat them here. 
+I will not repeat them here. 
 
 A couple of particular rust features make writing CUDA code much easier: RAII and Results.
 In `cust` everything uses RAII (through `Drop` impls) to manage freeing memory and returning handles, which 
 frees users from having to think about that, which yields safer, more reliable code.
 
-Results are particularly helpful, almost every single call in every CUDA library returns a status code in the form of a cuda result.
+Results are particularly helpful, almost every single call in every CUDA library returns a status code in the form of a CUDA result.
 Ignoring these statuses is very dangerous and can often lead to random segfaults and overall unreliable code. For this purpose,
 both the CUDA SDK, and other libraries provide macros to handle such statuses. This handling is not very reliable and causes
 dependency issues down the line. 
 
@@ -19,13 +19,13 @@ with additional restrictions including the following.
 - Some linkage types are not supported.
 - Function ABIs are ignored; everything uses the PTX calling convention.
 
-libnvvm is a closed source library which takes NVVM IR, optimizes it further, then converts it to
+libNVVM is a closed source library which takes NVVM IR, optimizes it further, then converts it to
 PTX. PTX is a low level, assembly-like format with an open specification which can be targeted by
 any language. For an assembly format, PTX is fairly user-friendly.
 - It is well formatted.
 - It is mostly fully specified (other than the iffy grammar specification).
 - It uses named registers/parameters.
-- It uses virtual registers. (Because gpus have thousands of registers, listing all of them out
+- It uses virtual registers. (Because GPUs have thousands of registers, listing all of them out
   would be unrealistic.)
 - It uses ASCII as a file encoding.
 
 
@@ -29,10 +29,10 @@ over CUDA C/C++ with the same (or better!) performance and features, therefore,
 Short answer, no.
 
 Long answer, there are a couple of things that make this impossible:
-- At the time of writing, libnvvm expects LLVM 7 bitcode, which is a very old format. Giving it bitcode from later LLVM version (which is what rustc uses) does not work.
-- NVVM IR is a __subset__ of LLVM IR, there are tons of things that nvvm will not accept. Such as a lot of function attrs not being allowed. 
+- At the time of writing, libNVVM expects LLVM 7 bitcode, which is a very old format. Giving it bitcode from later LLVM version (which is what rustc uses) does not work.
+- NVVM IR is a __subset__ of LLVM IR, there are tons of things that NVVM will not accept. Such as a lot of function attrs not being allowed. 
 This is well documented and you can find the spec [here](https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html). Not to mention
-many bugs in libnvvm that i have found along the way, the most infuriating of which is nvvm not accepting integer types that arent `i1, i8, i16, i32, or i64`.
+many bugs in libNVVM that I have found along the way, the most infuriating of which is nvvm not accepting integer types that arent `i1, i8, i16, i32, or i64`.
 This required special handling in the codegen to convert these "irregular" types into vector types.
 
 ## What is the point of using Rust if a lot of things in kernels are unsafe?
@@ -153,13 +153,13 @@ things to gain in terms of safety using Rust.
 The reasoning for this is the same reasoning as to why you would use CUDA over opengl/vulkan compute shaders:
 - CUDA usually outperforms shaders if kernels are written well and launch configurations are optimal.
 - CUDA has many useful features such as shared memory, unified memory, graphs, fine grained thread control, streams, the PTX ISA, etc.
-- rust-gpu does not perform many optimizations, and with rustc_codegen_ssa's less than ideal codegen, the optimizations by llvm and libnvvm are needed.
-- SPIRV is arguably still not suitable for serious GPU kernel codegen, it is underspecced, complex, and does not mention many things which are needed.
-While libnvvm (which uses a well documented subset of LLVM IR) and the PTX ISA are very thoroughly documented/specified.
+- rust-gpu does not perform many optimizations, and with rustc_codegen_ssa's less than ideal codegen, the optimizations by LLVM and libNVVM are needed.
+- SPIR-V is arguably still not suitable for serious GPU kernel codegen, it is underspecced, complex, and does not mention many things which are needed.
+While libNVVM (which uses a well documented subset of LLVM IR) and the PTX ISA are very thoroughly documented/specified.
 - rust-gpu is primarily focused on graphical shaders, compute shaders are secondary, which the rust ecosystem needs, but it also 
 needs a project 100% focused on computing, and computing only.
-- SPIRV cannot access many useful CUDA libraries such as Optix, cuDNN, cuBLAS, etc.
-- SPIRV debug info is still very young and rust-gpu cannot generate it. While rustc_codegen_nvvm does, which can be used
+- SPIR-V cannot access many useful CUDA libraries such as OptiX, cuDNN, cuBLAS, etc.
+- SPIR-V debug info is still very young and rust-gpu cannot generate it. While rustc_codegen_nvvm does, which can be used
 for profiling kernels in something like nsight compute.
 
 Moreover, CUDA is the primary tool used in big computing industries such as VFX and scientific computing. Therefore 
@@ -190,17 +190,17 @@ when it is finished, which causes further uses of CUDA to fail.
 
 Modules are the second big difference in the driver API. Modules are similar to shared libraries, they
 contain all of the globals and functions (kernels) inside of a PTX/cubin file. The driver API
-is language-agnostic, it purely works off of ptx/cubin files. To answer why this is important we
-need to cover what cubins and ptx files are briefly.
+is language-agnostic, it purely works off PTX/cubin files. To answer why this is important we
+need to cover what cubins and PTX files are briefly.
 
 PTX is a low level assembly-like language which is the penultimate step before what the GPU actually
 executes. It is human-readable and you can dump it from a CUDA C++ program with `nvcc ./file.cu --ptx`.
 This PTX is then optimized and lowered into a final format called SASS (Source and Assembly) and 
 turned into a cubin (CUDA binary) file. 
 
-Driver API modules can be loaded as either ptx, cubin, or fatbin files. If they are loaded as 
-ptx then the driver API will JIT compile the PTX to cubin then cache it. You can also
-compile ptx to cubin yourself using ptx-compiler and cache it.
+Driver API modules can be loaded as either PTX, cubin, or fatbin files. If they are loaded as 
+PTX then the driver API will JIT compile the PTX to cubin then cache it. You can also
+compile PTX to cubin yourself using ptx-compiler and cache it.
 
 This pipeline provides much better control over what functions you actually need to load and cache.
 You can separate different functions into different modules you can load dynamically (and even dynamically reload).
@@ -217,7 +217,7 @@ need to manage many kernels being dispatched at the same time as efficiently as
 
 ## Why target NVIDIA GPUs only instead of using something that can work on AMD?
 
-This is a complex issue with many arguments for both sides, so i will give you
+This is a complex issue with many arguments for both sides, so I will give you
 both sides as well as my opinion. 
 
 Pros for using OpenCL over CUDA:
@@ -235,7 +235,7 @@ new features cannot be reliably relied upon because they are unlikely to work on
 - OpenCL can only be written in OpenCL C (based on C99), OpenCL C++ is a thing, but again, not everything
 supports it. This makes complex programs more difficult to create.
 - OpenCL has less tools and libraries.
-- OpenCL is nowhere near as language-agnostic as CUDA. CUDA works almost fully off of an assembly format (ptx)
+- OpenCL is nowhere near as language-agnostic as CUDA. CUDA works almost fully off of an assembly format (PTX)
 and debug info. Essentially how CPU code works. This makes writing language-agnostic things in OpenCL near impossible and
 locks you into using OpenCL C.
 - OpenCL is plagued with serious driver bugs which have not been fixed, or that occur only on certain vendors.
@@ -245,10 +245,10 @@ Pros for using CUDA over OpenCL:
 VFX computing.
 - CUDA is a proprietary tool, meaning that NVIDIA is able to push out bug fixes and features much faster
 than releasing a new spec and waiting for vendors to implement it. This allows for more features being added, 
-such as cooperative kernels, cuda graphs, unified memory, new profilers, etc.
+such as cooperative kernels, CUDA graphs, unified memory, new profilers, etc.
 - CUDA is a single entity, meaning that if something does or does not work on one system it is unlikely 
 that that will be different on another system. Assuming you are not using different architectures, where
-one gpu may be lacking a feature.
+one GPU may be lacking a feature.
 - CUDA is usually 10-30% faster than OpenCL overall, this is likely due to subpar OpenCL drivers by NVIDIA,
 but it is unlikely this performance gap will change in the near future.
 - CUDA has a much richer set of libraries and tools than OpenCL, such as cuFFT, cuBLAS, cuRand, cuDNN, OptiX, NSight Compute, cuFile, etc.
@@ -264,8 +264,8 @@ Cons for using CUDA over OpenCL:
 
 # What makes cust and RustaCUDA different?
 
-Cust is a fork of rustacuda which changes a lot of things inside of it, as well as adds new features that
-are not inside of rustacuda. 
+Cust is a fork of RustaCUDA which changes a lot of things inside of it, as well as adds new features that
+are not inside of RustaCUDA. 
 
 The most significant changes (This list is not complete!!) are:
 - Drop code no longer panics on failure to drop raw CUDA handles, this is so that InvalidAddress errors, which cause 
@@ -286,8 +286,8 @@ Changes that are currently in progress but not done/experimental:
 - Graphs
 - PTX validation
 
-Just like rustacuda, cust makes no assumptions of what language was used to generate the ptx/cubin. It could be 
+Just like RustaCUDA, cust makes no assumptions of what language was used to generate the PTX/cubin. It could be 
 C, C++, futhark, or best of all, Rust!
 
-Cust's name is literally just rust + cuda mashed together in a horrible way.
+Cust's name is literally just rust + CUDA mashed together in a horrible way.
 Or you can pretend it stands for custard if you really like custard.
@@ -18,9 +18,9 @@ around to adding it yet.
 
 | Feature Name | Support Level | Notes |
 | ------------ | ------------- | ----- |
-| Opt-Levels | ✔️ | behaves mostly the same (because llvm is still used for optimizations). Except that libnvvm opts are run on anything except no-opts because nvvm only has -O0 and -O3 |
+| Opt-Levels | ✔️ | behaves mostly the same (because LLVM is still used for optimizations). Except that libNVVM opts are run on anything except no-opts because NVVM only has -O0 and -O3 |
 | codegen-units | ✔️ |
-| LTO | ➖ | we load bitcode modules lazily using dependency graphs, which then forms a single module optimized by libnvvm, so all the benefits of LTO are on without pre-libnvvm LTO being needed. |
+| LTO | ➖ | we load bitcode modules lazily using dependency graphs, which then forms a single module optimized by libNVVM, so all the benefits of LTO are on without pre-libNVVM LTO being needed. |
 | Closures | ✔️ |
 | Enums | ✔️ |
 | Loops | ✔️ |
 
@@ -142,7 +142,7 @@ Note: While the 'a' variant enables all these features during compilation (allow
 
 For more details on suffixes, see [NVIDIA's blog post on family-specific architecture features](https://developer.nvidia.com/blog/nvidia-blackwell-and-nvidia-cuda-12-9-introduce-family-specific-architecture-features/).
 
-### Manual Compilation (Without CudaBuilder)
+### Manual Compilation (Without `cuda_builder`)
 
 If you're invoking `rustc` directly instead of using `cuda_builder`, you only need to specify the architecture through LLVM args:
 
 
@@ -17,9 +17,9 @@ Before you can use the project to write GPU crates, you will need a couple of pr
   - Finally, if neither are present or unusable, it will attempt to download and use prebuilt LLVM. This currently only
     works on Windows however.
 
-- The OptiX SDK if using the optix library (the pathtracer example uses it for denoising).
+- The OptiX SDK if using the OptiX library (the pathtracer example uses it for denoising).
 
-- You may also need to add `libnvvm` to PATH, the builder should do it for you but in case it does not work, add libnvvm to PATH, it should be somewhere like `CUDA_ROOT/nvvm/bin`,
+- You may also need to add `libnvvm` to PATH, the builder should do it for you but in case it does not work, add `libnvvm` to PATH, it should be somewhere like `CUDA_ROOT/nvvm/bin`,
 
 - You may wish to use or consult the bundled [Dockerfile](#docker) to assist in your local config
 
@@ -102,7 +102,7 @@ Now we can finally start writing an actual GPU kernel.
 Firstly, we must explain a couple of things about GPU kernels, specifically, how they are executed. GPU Kernels (functions) are the entry point for executing anything on the GPU, they are the functions which will be executed from the CPU. GPU kernels do not return anything, they write their data to buffers passed into them.
 
 CUDA's execution model is very very complex and it is unrealistic to explain all of it in
-this section, but the TLDR of it is that CUDA will execute the GPU kernel once on every
+this section, but the TL;DR of it is that CUDA will execute the GPU kernel once on every
 thread, with the number of threads being decided by the caller (the CPU).
 
 We call these parameters the launch dimensions of the kernel. Launch dimensions are split
@@ -115,7 +115,7 @@ up into two basic concepts:
   of the current block.
 
 One important thing to note is that block and thread dimensions may be 1d, 2d, or 3d.
-That is to say, i can launch `1` block of `6x6x6`, `6x6`, or `6` threads. I could
+That is to say, I can launch `1` block of `6x6x6`, `6x6`, or `6` threads. I could
 also launch `5x5x5` blocks. This is very useful for 2d/3d applications because it makes
 the 2d/3d index calculations much simpler. CUDA exposes thread and block indices
 for each dimension through special registers. We expose thread index queries through
@@ -229,7 +229,7 @@ You can use it as follows (assuming your clone of Rust CUDA is at the absolute p
 **Notes:**
 
 1. refer to [rust-toolchain.toml](#rust-toolchain.toml) to ensure you are using the correct toolchain in your project.
-2. despite using Docker, your machine will still need to be running a compatible driver, in this case for Cuda 11.4.1 it is >=470.57.02
-3. if you have issues within the container, it can help to start ensuring your gpu is recognized
+2. despite using Docker, your machine will still need to be running a compatible driver, in this case for CUDA 11.4.1 it is >=470.57.02
+3. if you have issues within the container, it can help to start ensuring your GPU is recognized
    - ensure `nvidia-smi` provides meaningful output in the container
-   - NVidia provides a number of samples https://github.com/NVIDIA/cuda-samples. In particular, you may want to try `make`ing and running the [`deviceQuery`](https://github.com/NVIDIA/cuda-samples/tree/ba04faaf7328dbcc87bfc9acaf17f951ee5ddcf3/Samples/deviceQuery) sample. If all is well you should see many details about your gpu
+   - NVidia provides a number of samples https://github.com/NVIDIA/cuda-samples. In particular, you may want to try `make`ing and running the [`deviceQuery`](https://github.com/NVIDIA/cuda-samples/tree/ba04faaf7328dbcc87bfc9acaf17f951ee5ddcf3/Samples/deviceQuery) sample. If all is well you should see many details about your GPU
@@ -25,7 +25,7 @@ Behavior considered undefined inside of GPU kernels:
   undefined on the GPU too. The only exception being invalid sizes for buffers given to a GPU
   kernel.
 
-Currently we declare that the invariant that a buffer given to a gpu kernel must be large enough for any access the
+Currently we declare that the invariant that a buffer given to a GPU kernel must be large enough for any access the
 kernel is going to make is up to the caller of the kernel to uphold. This idiom may be changed in the future.
 
 - Any kind of data race, this has the same semantics as data races in CPU code. Such as:
 
@@ -10,5 +10,5 @@ will get much better in the future but currently it will cause some undesirable
 
 - Don't use recursion, CUDA allows it but threads have very limited stacks (local memory) and stack overflows
 yield confusing `InvalidAddress` errors. If you are getting such an error, run the executable in cuda-memcheck,
-it should yield a write failure to `Local` memory at an address of about 16mb. You can also put the ptx file through
+it should yield a write failure to `Local` memory at an address of about 16mb. You can also put the PTX file through
 `cuobjdump` and it should yield ptxas warnings for functions without a statically known stack usage.