diff --git a/guide/book.toml b/guide/book.toml index 2bca2762..f45759fc 100644 --- a/guide/book.toml +++ b/guide/book.toml @@ -2,5 +2,5 @@ authors = ["Riccardo D'Ambrosio"] language = "en" src = "src" -title = "GPU Computing with Rust using CUDA" -description = "Writing extremely fast GPU Computing code with rust using rustc_codegen_nvvm and CUDA" +title = "The Rust CUDA Guide" +description = "How to write GPU compute code with Rust using rustc_codegen_nvvm and CUDA" diff --git a/guide/src/README.md b/guide/src/README.md index a74826c2..c2bce261 100644 --- a/guide/src/README.md +++ b/guide/src/README.md @@ -1,3 +1,3 @@ # Introduction -Welcome to the rust-cuda guide! Let's dive right in. +Welcome to the Rust CUDA guide! Let's dive right in. diff --git a/guide/src/SUMMARY.md b/guide/src/SUMMARY.md index a93a363d..21cd8d26 100644 --- a/guide/src/SUMMARY.md +++ b/guide/src/SUMMARY.md @@ -12,9 +12,9 @@ - [The CUDA Toolkit](cuda/README.md) - [GPU Computing](cuda/gpu_computing.md) - [The CUDA Pipeline](cuda/pipeline.md) -- [rustc_codegen_nvvm](nvvm/README.md) - - [Custom Rustc Backends](nvvm/backends.md) - - [rustc_codegen_nvvm](nvvm/nvvm.md) +- [`rustc_codegen_nvvm`](nvvm/README.md) + - [Custom rustc Backends](nvvm/backends.md) + - [`rustc_codegen_nvvm`](nvvm/nvvm.md) - [Types](nvvm/types.md) - [PTX Generation](nvvm/ptxgen.md) - [Debugging](nvvm/debugging.md) diff --git a/guide/src/cuda/README.md b/guide/src/cuda/README.md index f8608e1c..4a8243d7 100644 --- a/guide/src/cuda/README.md +++ b/guide/src/cuda/README.md @@ -2,7 +2,7 @@ The CUDA Toolkit is an ecosystem for executing extremely fast code on NVIDIA GPUs for the purpose of general computing. -CUDA includes many libraries for this purpose, including the Driver API, Runtime API, the PTX ISA, libnvvm, etc. CUDA +CUDA includes many libraries for this purpose, including the Driver API, Runtime API, the PTX ISA, libNVVM, etc. CUDA is currently the best option for computing in terms of libraries and control available, however, it unfortunately only works on NVIDIA GPUs. diff --git a/guide/src/cuda/gpu_computing.md b/guide/src/cuda/gpu_computing.md index eba5c508..9079dde1 100644 --- a/guide/src/cuda/gpu_computing.md +++ b/guide/src/cuda/gpu_computing.md @@ -1,4 +1,4 @@ -# GPU Computing +# GPU computing You probably already know what GPU computing is, but if you don't, it is utilizing the extremely parallel nature of GPUs for purposes other than rendering. It is widely used in many scientific and consumer fields. @@ -13,41 +13,41 @@ of time and/or take different code paths. CUDA is currently one of the best choices for fast GPU computing for multiple reasons: - It offers deep control over how kernels are dispatched and how memory is managed. -- It has a rich ecosystem of tutorials, guides, and libraries such as cuRand, cuBlas, libnvvm, optix, the PTX ISA, etc. +- It has a rich ecosystem of tutorials, guides, and libraries such as cuRAND, cuBLAS, libNVVM, OptiX, the PTX ISA, etc. - It is mostly unmatched in performance because it is solely meant for computing and offers rich control. And more... -However, CUDA can only run on NVIDIA GPUs, which precludes AMD gpus from tools that use it. However, this is a drawback that -is acceptable by many because of the significant developer cost of supporting both NVIDIA gpus with CUDA and -AMD gpus with OpenCL, since OpenCL is generally slower, clunkier, and lacks libraries and docs on par with CUDA. +However, CUDA can only run on NVIDIA GPUs, which precludes AMD GPUs from tools that use it. However, this is a drawback that +is acceptable by many because of the significant developer cost of supporting both NVIDIA GPUs with CUDA and +AMD GPUs with OpenCL, since OpenCL is generally slower, clunkier, and lacks libraries and docs on par with CUDA. # Why Rust? -Rust is a great choice for GPU programming, however, it has needed a kickstart, which is what rustc_codegen_nvvm tries to +Rust is a great choice for GPU programming, however, it has needed a kickstart, which is what `rustc_codegen_nvvm` tries to accomplish; The initial hurdle of getting Rust to compile to something CUDA can run is over, now comes the design and polish part. On top of its rich language features (macros, enums, traits, proc macros, great errors, etc), Rust's safety guarantees -can be applied in gpu programming too; A field that has historically been full of implied invariants and unsafety, such +can be applied in GPU programming too; A field that has historically been full of implied invariants and unsafety, such as (but not limited to): - Expecting some amount of dynamic shared memory from the caller. - Expecting a certain layout for thread blocks/threads. - Manually handling the indexing of data, leaving code prone to data races if not managed correctly. - Forgetting to free memory, using uninitialized memory, etc. -Not to mention the standardized tooling that makes the building, documentation, sharing, and linting of gpu kernel libraries easily possible. -Most of the reasons for using rust on the CPU apply to using Rust for the GPU, these reasons have been stated countless times so -i will not repeat them here. +Not to mention the standardized tooling that makes the building, documentation, sharing, and linting of GPU kernel libraries easily possible. +Most of the reasons for using Rust on the CPU apply to using Rust for the GPU, these reasons have been stated countless times so +I will not repeat them here. -A couple of particular rust features make writing CUDA code much easier: RAII and Results. +A couple of particular Rust features make writing CUDA code much easier: RAII and Results. In `cust` everything uses RAII (through `Drop` impls) to manage freeing memory and returning handles, which frees users from having to think about that, which yields safer, more reliable code. -Results are particularly helpful, almost every single call in every CUDA library returns a status code in the form of a cuda result. +Results are particularly helpful, almost every single call in every CUDA library returns a status code in the form of a CUDA result. Ignoring these statuses is very dangerous and can often lead to random segfaults and overall unreliable code. For this purpose, both the CUDA SDK, and other libraries provide macros to handle such statuses. This handling is not very reliable and causes dependency issues down the line. -Instead of an unreliable system of macros, we can leverage rust results for this. In cust we return special `CudaResult` -results that can be bubbled up using rust's `?` operator, or, similar to `CUDA_SAFE_CALL` can be unwrapped or expected if +Instead of an unreliable system of macros, we can leverage Rust results for this. In cust we return special `CudaResult` +results that can be bubbled up using Rust's `?` operator, or, similar to `CUDA_SAFE_CALL` can be unwrapped or expected if proper error handling is not needed. diff --git a/guide/src/cuda/pipeline.md b/guide/src/cuda/pipeline.md index 0e3481b1..785fd7bd 100644 --- a/guide/src/cuda/pipeline.md +++ b/guide/src/cuda/pipeline.md @@ -1,4 +1,4 @@ -# The CUDA Pipeline +# The CUDA pipeline CUDA is traditionally used via CUDA C/C++ files which have a `.cu` extension. These files can be compiled using NVCC (NVIDIA CUDA Compiler) into an executable. @@ -19,13 +19,13 @@ with additional restrictions including the following. - Some linkage types are not supported. - Function ABIs are ignored; everything uses the PTX calling convention. -libnvvm is a closed source library which takes NVVM IR, optimizes it further, then converts it to +libNVVM is a closed source library which takes NVVM IR, optimizes it further, then converts it to PTX. PTX is a low level, assembly-like format with an open specification which can be targeted by any language. For an assembly format, PTX is fairly user-friendly. - It is well formatted. - It is mostly fully specified (other than the iffy grammar specification). - It uses named registers/parameters. -- It uses virtual registers. (Because gpus have thousands of registers, listing all of them out +- It uses virtual registers. (Because GPUs have thousands of registers, listing all of them out would be unrealistic.) - It uses ASCII as a file encoding. diff --git a/guide/src/faq.md b/guide/src/faq.md index 5197fd01..dfd94780 100644 --- a/guide/src/faq.md +++ b/guide/src/faq.md @@ -1,4 +1,4 @@ -# Frequently Asked Questions +# Frequently asked questions This page will cover a lot of the questions people often have when they encounter this project, so they are addressed all at once. @@ -14,14 +14,14 @@ This can be circumvented by building LLVM in a special way, but this is far beyo which yield considerable performance differences (especially on more complex kernels with more information in the IR). - For some reason (either rustc giving weird LLVM IR or the LLVM PTX backend being broken) the LLVM PTX backend often generates completely invalid PTX for trivial programs, so it is not an acceptable workflow for a production pipeline. -- GPU and CPU codegen is fundamentally different, creating a codegen that is only for the GPU allows us to -seamlessly implement features which would have been impossible or very difficult to implement in the existing codegen, such as: +- GPU and CPU codegen is fundamentally different, creating a codegen backend that is only for the GPU allows us to +seamlessly implement features which would have been impossible or very difficult to implement in the existing codegen backend, such as: - Shared memory, this requires some special generation of globals with custom addrspaces, its just not possible to do without backend explicit handling. - Custom linking logic to do dead code elimination so as to not end up with large PTX files full of dead functions/globals. - Stripping away everything we do not need, no complex ABI handling, no shared lib handling, control over how function calls are generated, etc. So overall, the LLVM PTX backend is fit for smaller kernels/projects/proofs of concept. -It is however not fit for compiling an entire language (core is __very__ big) with dependencies and more. The end goal is for rust to be able to be used +It is however not fit for compiling an entire language (core is __very__ big) with dependencies and more. The end goal is for Rust to be able to be used over CUDA C/C++ with the same (or better!) performance and features, therefore, we must take advantage of all optimizations NVCC has over us. ## If NVVM IR is a subset of LLVM IR, can we not give rustc-generated LLVM IR to NVVM? @@ -29,11 +29,11 @@ over CUDA C/C++ with the same (or better!) performance and features, therefore, Short answer, no. Long answer, there are a couple of things that make this impossible: -- At the time of writing, libnvvm expects LLVM 7 bitcode, which is a very old format. Giving it bitcode from later LLVM version (which is what rustc uses) does not work. -- NVVM IR is a __subset__ of LLVM IR, there are tons of things that nvvm will not accept. Such as a lot of function attrs not being allowed. +- At the time of writing, libNVVM expects LLVM 7 bitcode, which is a very old format. Giving it bitcode from later LLVM version (which is what rustc uses) does not work. +- NVVM IR is a __subset__ of LLVM IR, there are tons of things that NVVM will not accept. Such as a lot of function attrs not being allowed. This is well documented and you can find the spec [here](https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html). Not to mention -many bugs in libnvvm that i have found along the way, the most infuriating of which is nvvm not accepting integer types that arent `i1, i8, i16, i32, or i64`. -This required special handling in the codegen to convert these "irregular" types into vector types. +many bugs in libNVVM that I have found along the way, the most infuriating of which is nvvm not accepting integer types that arent `i1, i8, i16, i32, or i64`. +This required special handling in the codegen backend to convert these "irregular" types into vector types. ## What is the point of using Rust if a lot of things in kernels are unsafe? @@ -117,22 +117,22 @@ no control over it and no 100% reliable way to fix it, therefore we must shift t Moreover, the CUDA GPU kernel model is entirely based on trust, trusting each thread to index into the correct place in buffers, trusting the caller of the kernel to uphold some dimension invariants, etc. This is once again, completely incompatible with how -rust does things. We can provide wrappers to calculate an index that always works, and macros to index a buffer automatically, but +Rust does things. We can provide wrappers to calculate an index that always works, and macros to index a buffer automatically, but indexing in complex ways is a core operation in CUDA and it is impossible for us to prove that whatever the developer is doing is correct. Finally, We would love to be able to use mut refs in kernel parameters, but this is would be unsound. Because each kernel function is *technically* called multiple times in parallel with the same parameters, we would be -aliasing the mutable ref, which Rustc declares as unsound (aliasing mechanics). So raw pointers or slightly-less-unsafe +aliasing the mutable ref, which rustc declares as unsound (aliasing mechanics). So raw pointers or slightly-less-unsafe need to be used. However, they are usually only used for the initial buffer indexing, after which you can turn them into a mutable reference just fine (because you indexed in a way where no other thread will index that element). Also note that shared refs can be used as parameters just fine. -Now that we outlined why this is a thing, why is using rust a benefit if we still need to use unsafe? +Now that we outlined why this is a thing, why is using Rust a benefit if we still need to use unsafe? Well it's simple, eliminating most of the things that a developer needs to think about to have a safe program is still exponentially safer than leaving __everything__ to the developer to think about. -By using rust, we eliminate: +By using Rust, we eliminate: - The forgotten/unhandled CUDA errors problem (yay results!). - The uninitialized memory problem. - The forgetting to dealloc memory problem. @@ -148,23 +148,23 @@ a lot of them, and ease the burden of correctness from the developer. Besides, using Rust only adds to safety, it does not make CUDA *more* unsafe. This means there are only things to gain in terms of safety using Rust. -## Why not use rust-gpu with compute shaders? +## Why not use Rust GPU with compute shaders? The reasoning for this is the same reasoning as to why you would use CUDA over opengl/vulkan compute shaders: - CUDA usually outperforms shaders if kernels are written well and launch configurations are optimal. - CUDA has many useful features such as shared memory, unified memory, graphs, fine grained thread control, streams, the PTX ISA, etc. -- rust-gpu does not perform many optimizations, and with rustc_codegen_ssa's less than ideal codegen, the optimizations by llvm and libnvvm are needed. -- SPIRV is arguably still not suitable for serious GPU kernel codegen, it is underspecced, complex, and does not mention many things which are needed. -While libnvvm (which uses a well documented subset of LLVM IR) and the PTX ISA are very thoroughly documented/specified. -- rust-gpu is primarily focused on graphical shaders, compute shaders are secondary, which the rust ecosystem needs, but it also +- Rust GPU does not perform many optimizations, and with `rustc_codegen_ssa`'s less than ideal codegen, the optimizations by LLVM and libNVVM are needed. +- SPIR-V is arguably still not suitable for serious GPU kernel codegen, it is underspecced, complex, and does not mention many things which are needed. +While libNVVM (which uses a well documented subset of LLVM IR) and the PTX ISA are very thoroughly documented/specified. +- Rust GPU is primarily focused on graphical shaders, compute shaders are secondary, which the Rust ecosystem needs, but it also needs a project 100% focused on computing, and computing only. -- SPIRV cannot access many useful CUDA libraries such as Optix, cuDNN, cuBLAS, etc. -- SPIRV debug info is still very young and rust-gpu cannot generate it. While rustc_codegen_nvvm does, which can be used +- SPIR-V cannot access many useful CUDA libraries such as OptiX, cuDNN, cuBLAS, etc. +- SPIR-V debug info is still very young and Rust GPU cannot generate it. While `rustc_codegen_nvvm` does, which can be used for profiling kernels in something like nsight compute. Moreover, CUDA is the primary tool used in big computing industries such as VFX and scientific computing. Therefore -it is much easier for CUDA C++ users to use rust for GPU computing if most of the concepts are still the same. Plus, -we can interface with existing CUDA code by compiling it to PTX then linking it with our rust code using the CUDA linker +it is much easier for CUDA C++ users to use Rust for GPU computing if most of the concepts are still the same. Plus, +we can interface with existing CUDA code by compiling it to PTX then linking it with our Rust code using the CUDA linker API (which is exposed in a high level wrapper in cust). ## Why use the CUDA Driver API over the Runtime API? @@ -190,17 +190,17 @@ when it is finished, which causes further uses of CUDA to fail. Modules are the second big difference in the driver API. Modules are similar to shared libraries, they contain all of the globals and functions (kernels) inside of a PTX/cubin file. The driver API -is language-agnostic, it purely works off of ptx/cubin files. To answer why this is important we -need to cover what cubins and ptx files are briefly. +is language-agnostic, it purely works off PTX/cubin files. To answer why this is important we +need to cover what cubins and PTX files are briefly. PTX is a low level assembly-like language which is the penultimate step before what the GPU actually executes. It is human-readable and you can dump it from a CUDA C++ program with `nvcc ./file.cu --ptx`. This PTX is then optimized and lowered into a final format called SASS (Source and Assembly) and turned into a cubin (CUDA binary) file. -Driver API modules can be loaded as either ptx, cubin, or fatbin files. If they are loaded as -ptx then the driver API will JIT compile the PTX to cubin then cache it. You can also -compile ptx to cubin yourself using ptx-compiler and cache it. +Driver API modules can be loaded as either PTX, cubin, or fatbin files. If they are loaded as +PTX then the driver API will JIT compile the PTX to cubin then cache it. You can also +compile PTX to cubin yourself using ptx-compiler and cache it. This pipeline provides much better control over what functions you actually need to load and cache. You can separate different functions into different modules you can load dynamically (and even dynamically reload). @@ -217,7 +217,7 @@ need to manage many kernels being dispatched at the same time as efficiently as ## Why target NVIDIA GPUs only instead of using something that can work on AMD? -This is a complex issue with many arguments for both sides, so i will give you +This is a complex issue with many arguments for both sides, so I will give you both sides as well as my opinion. Pros for using OpenCL over CUDA: @@ -235,7 +235,7 @@ new features cannot be reliably relied upon because they are unlikely to work on - OpenCL can only be written in OpenCL C (based on C99), OpenCL C++ is a thing, but again, not everything supports it. This makes complex programs more difficult to create. - OpenCL has less tools and libraries. -- OpenCL is nowhere near as language-agnostic as CUDA. CUDA works almost fully off of an assembly format (ptx) +- OpenCL is nowhere near as language-agnostic as CUDA. CUDA works almost fully off of an assembly format (PTX) and debug info. Essentially how CPU code works. This makes writing language-agnostic things in OpenCL near impossible and locks you into using OpenCL C. - OpenCL is plagued with serious driver bugs which have not been fixed, or that occur only on certain vendors. @@ -245,10 +245,10 @@ Pros for using CUDA over OpenCL: VFX computing. - CUDA is a proprietary tool, meaning that NVIDIA is able to push out bug fixes and features much faster than releasing a new spec and waiting for vendors to implement it. This allows for more features being added, -such as cooperative kernels, cuda graphs, unified memory, new profilers, etc. +such as cooperative kernels, CUDA graphs, unified memory, new profilers, etc. - CUDA is a single entity, meaning that if something does or does not work on one system it is unlikely that that will be different on another system. Assuming you are not using different architectures, where -one gpu may be lacking a feature. +one GPU may be lacking a feature. - CUDA is usually 10-30% faster than OpenCL overall, this is likely due to subpar OpenCL drivers by NVIDIA, but it is unlikely this performance gap will change in the near future. - CUDA has a much richer set of libraries and tools than OpenCL, such as cuFFT, cuBLAS, cuRand, cuDNN, OptiX, NSight Compute, cuFile, etc. @@ -264,8 +264,8 @@ Cons for using CUDA over OpenCL: # What makes cust and RustaCUDA different? -Cust is a fork of rustacuda which changes a lot of things inside of it, as well as adds new features that -are not inside of rustacuda. +Cust is a fork of RustaCUDA which changes a lot of things inside of it, as well as adds new features that +are not inside of RustaCUDA. The most significant changes (This list is not complete!!) are: - Drop code no longer panics on failure to drop raw CUDA handles, this is so that InvalidAddress errors, which cause @@ -286,8 +286,8 @@ Changes that are currently in progress but not done/experimental: - Graphs - PTX validation -Just like rustacuda, cust makes no assumptions of what language was used to generate the ptx/cubin. It could be +Just like RustaCUDA, cust makes no assumptions of what language was used to generate the PTX/cubin. It could be C, C++, futhark, or best of all, Rust! -Cust's name is literally just rust + cuda mashed together in a horrible way. +Cust's name is literally just Rust + CUDA mashed together in a horrible way. Or you can pretend it stands for custard if you really like custard. diff --git a/guide/src/features.md b/guide/src/features.md index 33c8128b..0977c480 100644 --- a/guide/src/features.md +++ b/guide/src/features.md @@ -1,4 +1,4 @@ -# Supported Features +# Supported features This page is used for tracking Cargo/Rust and CUDA features that are currently supported or planned to be supported in the future. As well as tracking some information about how they could @@ -14,13 +14,13 @@ around to adding it yet. | ✔️ | Fully Supported | | 🟨 | Partially Supported | -# Rust Features +# Rust features | Feature Name | Support Level | Notes | | ------------ | ------------- | ----- | -| Opt-Levels | ✔️ | behaves mostly the same (because llvm is still used for optimizations). Except that libnvvm opts are run on anything except no-opts because nvvm only has -O0 and -O3 | +| Opt-Levels | ✔️ | behaves mostly the same (because LLVM is still used for optimizations). Except that libNVVM opts are run on anything except no-opts because NVVM only has -O0 and -O3 | | codegen-units | ✔️ | -| LTO | ➖ | we load bitcode modules lazily using dependency graphs, which then forms a single module optimized by libnvvm, so all the benefits of LTO are on without pre-libnvvm LTO being needed. | +| LTO | ➖ | we load bitcode modules lazily using dependency graphs, which then forms a single module optimized by libNVVM, so all the benefits of LTO are on without pre-libNVVM LTO being needed. | | Closures | ✔️ | | Enums | ✔️ | | Loops | ✔️ | @@ -40,7 +40,7 @@ around to adding it yet. | Float Ops | ✔️ | Maps to libdevice intrinsics, calls to libm are not intercepted though, which we may want to do in the future | | Atomics | ❌ | -# CUDA Libraries +# CUDA libraries | Library Name | Support Level | Notes | | ------------ | ------------- | ----- | @@ -54,9 +54,9 @@ around to adding it yet. | cuSPARSE | ❌ | | AmgX | ❌ | | cuTENSOR | ❌ | -| OptiX | 🟨 | CPU OptiX is mostly complete, GPU OptiX is still heavily in-progress because it needs support from the codegen | +| OptiX | 🟨 | CPU OptiX is mostly complete, GPU OptiX is still heavily in-progress because it needs support from the codegen backend | -# GPU-side Features +# GPU-side features Note: Most of these categories are used __very__ rarely in CUDA code, therefore do not be alarmed that it seems like many things are not supported. We just focus @@ -105,4 +105,4 @@ on things used by the wide majority of users. | Stream Ordered Memory | ✔️ | | Graph Memory Nodes | ❌ | | Unified Memory | ✔️ | -| `__restrict__` | ➖ | Not needed, you get that performance boost automatically through rust's noalias :) | +| `__restrict__` | ➖ | Not needed, you get that performance boost automatically through Rust's noalias :) | diff --git a/guide/src/guide/compute_capabilities.md b/guide/src/guide/compute_capabilities.md index 617169fb..562be2f0 100644 --- a/guide/src/guide/compute_capabilities.md +++ b/guide/src/guide/compute_capabilities.md @@ -1,9 +1,9 @@ -# Compute Capability Gating +# Compute capability gating This section covers how to write code that adapts to different CUDA compute capabilities using conditional compilation. -## What are Compute Capabilities? +## What are compute capabilities? CUDA GPUs have different "compute capabilities" that determine which features they support. Each capability is identified by a version number like `3.5`, `5.0`, `6.1`, @@ -17,7 +17,7 @@ For example: For comprehensive details, see [NVIDIA's CUDA documentation on GPU architectures](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#gpu-compilation). -## Virtual vs Real Architectures +## Virtual vs real Architectures In CUDA terminology: @@ -25,10 +25,10 @@ In CUDA terminology: features - **Real architectures** (`sm_XX`) represent actual GPU hardware -rust-cuda works exclusively with virtual architectures since it only generates PTX. The +Rust CUDA works exclusively with virtual architectures since it only generates PTX. The `NvvmArch::ComputeXX` enum values correspond to CUDA's virtual architectures. -## Using Target Features +## Using target features When building your kernel, the `NvvmArch::ComputeXX` variant you choose enables specific `target_feature` flags. These can be used with `#[cfg(...)]` to conditionally compile @@ -51,12 +51,12 @@ which `NvvmArch::ComputeXX` is used to build the kernel, there is a different an These features let you write optimized code paths for specific GPU generations while still supporting older ones. -## Specifying Compute Capabilites +## Specifying compute capabilites Starting with CUDA 12.9, NVIDIA introduced architecture suffixes that affect compatibility. -### Base Architecture (No Suffix) +### Base architecture (no suffix) Example: `NvvmArch::Compute70` @@ -79,7 +79,7 @@ CudaBuilder::new("kernels") #[cfg(target_feature = "compute_80")] // ✗ Fail (higher base variant) ``` -### Family Suffix ('f') +### Family suffix ('f') Example: `NvvmArch::Compute101f` @@ -108,7 +108,7 @@ CudaBuilder::new("kernels") #[cfg(target_feature = "compute_110")] // ✗ Fail (higher base variant) ``` -### Architecture Suffix ('a') +### Architecture suffix ('a') Example: `NvvmArch::Compute100a` @@ -142,7 +142,7 @@ Note: While the 'a' variant enables all these features during compilation (allow For more details on suffixes, see [NVIDIA's blog post on family-specific architecture features](https://developer.nvidia.com/blog/nvidia-blackwell-and-nvidia-cuda-12-9-introduce-family-specific-architecture-features/). -### Manual Compilation (Without CudaBuilder) +### Manual compilation (without `cuda_builder`) If you're invoking `rustc` directly instead of using `cuda_builder`, you only need to specify the architecture through LLVM args: @@ -162,11 +162,11 @@ cargo build --target nvptx64-nvidia-cuda The codegen backend automatically synthesizes target features based on the architecture type as described above. -### Common Patterns for Base Architectures +### Common patterns for base architectures These patterns work when using base architectures (no suffix), which enable all lower capabilities: -#### At Least a Capability (Default) +#### At least a capability (default) ```rust,no_run // Code that requires compute 6.0 or higher @@ -176,7 +176,7 @@ These patterns work when using base architectures (no suffix), which enable all } ``` -#### Exactly One Capability +#### Exactly one capability ```rust,no_run // Code that targets exactly compute 6.1 (not 6.2+) @@ -186,7 +186,7 @@ These patterns work when using base architectures (no suffix), which enable all } ``` -#### Up To a Maximum Capability +#### Up to a maximum capability ```rust,no_run // Code that works up to compute 6.0 (not 6.1+) @@ -196,7 +196,7 @@ These patterns work when using base architectures (no suffix), which enable all } ``` -#### Targeting Specific Architecture Ranges +#### Targeting specific architecture ranges ```rust,no_run // This block compiles when building for architectures >= 6.0 but < 8.0 @@ -206,7 +206,7 @@ These patterns work when using base architectures (no suffix), which enable all } ``` -## Debugging Capability Issues +## Debugging capability issues If you encounter errors about missing functions or features: @@ -215,9 +215,9 @@ If you encounter errors about missing functions or features: 3. Use `nvidia-smi` to check your GPU's compute capability 4. Add appropriate `#[cfg]` guards or increase the target architecture -## Runtime Behavior +## Runtime behavior -Again, rust-cuda **only generates PTX**, not pre-compiled GPU binaries +Again, Rust CUDA **only generates PTX**, not pre-compiled GPU binaries ("[fatbinaries](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#fatbinaries)"). This PTX is then JIT-compiled by the CUDA driver at _runtime_. diff --git a/guide/src/guide/getting_started.md b/guide/src/guide/getting_started.md index be30c946..d6a781d9 100644 --- a/guide/src/guide/getting_started.md +++ b/guide/src/guide/getting_started.md @@ -1,8 +1,8 @@ -# Getting Started +# Getting started This section covers how to get started writing GPU crates with `cuda_std` and `cuda_builder`. -## Required Libraries +## Required libraries Before you can use the project to write GPU crates, you will need a couple of prerequisites: @@ -10,16 +10,16 @@ Before you can use the project to write GPU crates, you will need a couple of pr This is only for building GPU crates, to execute built PTX you only need CUDA `9+`. -- LLVM 7.x (7.0 to 7.4), The codegen searches multiple places for LLVM: +- LLVM 7.x (7.0 to 7.4), The codegen backend searches multiple places for LLVM: - If `LLVM_CONFIG` is present, it will use that path as `llvm-config`. - Or, if `llvm-config` is present as a binary, it will use that, assuming that `llvm-config --version` returns `7.x.x`. - Finally, if neither are present or unusable, it will attempt to download and use prebuilt LLVM. This currently only works on Windows however. -- The OptiX SDK if using the optix library (the pathtracer example uses it for denoising). +- The OptiX SDK if using the OptiX library (the pathtracer example uses it for denoising). -- You may also need to add `libnvvm` to PATH, the builder should do it for you but in case it does not work, add libnvvm to PATH, it should be somewhere like `CUDA_ROOT/nvvm/bin`, +- You may also need to add `libnvvm` to PATH, the builder should do it for you but in case it does not work, add `libnvvm` to PATH, it should be somewhere like `CUDA_ROOT/nvvm/bin`, - You may wish to use or consult the bundled [Dockerfile](#docker) to assist in your local config @@ -58,9 +58,9 @@ Where `XX` is the latest version of `cuda_std`. We changed our crate's crate types to `cdylib` and `rlib`. We specified `cdylib` because the nvptx targets do not support binary crate types. `rlib` is so that we will be able to use the crate as a dependency, such as if we would like to use it on the CPU. -## lib.rs +## `lib.rs` -Before we can write any GPU kernels, we must add a few directives to our `lib.rs` which are required by the codegen: +Before we can write any GPU kernels, we must add a few directives to our `lib.rs` which are required by the codegen backend: ```rs #![cfg_attr( @@ -76,7 +76,7 @@ This does a couple of things: - It only applies the attributes if we are compiling the crate for the GPU (target_os = "cuda"). - It declares the crate to be `no_std` on CUDA targets. -- It registers a special attribute required by the codegen for things like figuring out +- It registers a special attribute required by the codegen backend for things like figuring out what functions are GPU kernels. - It explicitly includes `kernel` macro and `thread` @@ -86,7 +86,7 @@ If you would like to use `alloc` or things like printing from GPU kernels (which extern crate alloc; ``` -Finally, if you would like to use types such as slices or arrays inside of GPU kernels you must allow `improper_cytypes_definitions` either on the whole crate or the individual GPU kernels. This is because on the CPU, such types are not guaranteed to be passed a certain way, so they should not be used in `extern "C"` functions (which is what kernels are implicitly declared as). However, `rustc_codegen_nvvm` guarantees the way in which things like structs, slices, and arrays are passed. See [Kernel ABI](./kernel_abi.md). +Finally, if you would like to use types such as slices or arrays inside of GPU kernels you must allow `improper_ctypes_definitions` either on the whole crate or the individual GPU kernels. This is because on the CPU, such types are not guaranteed to be passed a certain way, so they should not be used in `extern "C"` functions (which is what kernels are implicitly declared as). However, `rustc_codegen_nvvm` guarantees the way in which things like structs, slices, and arrays are passed. See [Kernel ABI](./kernel_abi.md). ```rs #![allow(improper_ctypes_definitions)] @@ -102,7 +102,7 @@ Now we can finally start writing an actual GPU kernel. Firstly, we must explain a couple of things about GPU kernels, specifically, how they are executed. GPU Kernels (functions) are the entry point for executing anything on the GPU, they are the functions which will be executed from the CPU. GPU kernels do not return anything, they write their data to buffers passed into them. CUDA's execution model is very very complex and it is unrealistic to explain all of it in -this section, but the TLDR of it is that CUDA will execute the GPU kernel once on every +this section, but the TL;DR of it is that CUDA will execute the GPU kernel once on every thread, with the number of threads being decided by the caller (the CPU). We call these parameters the launch dimensions of the kernel. Launch dimensions are split @@ -115,7 +115,7 @@ up into two basic concepts: of the current block. One important thing to note is that block and thread dimensions may be 1d, 2d, or 3d. -That is to say, i can launch `1` block of `6x6x6`, `6x6`, or `6` threads. I could +That is to say, I can launch `1` block of `6x6x6`, `6x6`, or `6` threads. I could also launch `5x5x5` blocks. This is very useful for 2d/3d applications because it makes the 2d/3d index calculations much simpler. CUDA exposes thread and block indices for each dimension through special registers. We expose thread index queries through @@ -156,7 +156,7 @@ Internally what this does is it first checks that a couple of things are right i - The function is `unsafe`. - The function does not return anything. -Then it declares this kernel to the codegen so that the codegen can tell CUDA this is a GPU kernel. +Then it declares this kernel to the codegen backend so it can tell CUDA this is a GPU kernel. It also applies `#[no_mangle]` so the name of the kernel is the same as it is declared in the code. ## Building the GPU crate @@ -171,7 +171,7 @@ To use it you can simply add it as a build dependency in your CPU crate (the cra +cuda_builder = "XX" ``` -Where `XX` is the current version of cuda_builder. +Where `XX` is the current version of `cuda_builder`. Then, you can simply invoke it in the build.rs of your CPU crate: @@ -229,7 +229,7 @@ You can use it as follows (assuming your clone of Rust CUDA is at the absolute p **Notes:** 1. refer to [rust-toolchain.toml](#rust-toolchain.toml) to ensure you are using the correct toolchain in your project. -2. despite using Docker, your machine will still need to be running a compatible driver, in this case for Cuda 11.4.1 it is >=470.57.02 -3. if you have issues within the container, it can help to start ensuring your gpu is recognized +2. despite using Docker, your machine will still need to be running a compatible driver, in this case for CUDA 11.4.1 it is >=470.57.02 +3. if you have issues within the container, it can help to start ensuring your GPU is recognized - ensure `nvidia-smi` provides meaningful output in the container - - NVidia provides a number of samples https://github.com/NVIDIA/cuda-samples. In particular, you may want to try `make`ing and running the [`deviceQuery`](https://github.com/NVIDIA/cuda-samples/tree/ba04faaf7328dbcc87bfc9acaf17f951ee5ddcf3/Samples/deviceQuery) sample. If all is well you should see many details about your gpu + - NVIDIA provides a number of samples https://github.com/NVIDIA/cuda-samples. In particular, you may want to try `make`ing and running the [`deviceQuery`](https://github.com/NVIDIA/cuda-samples/tree/ba04faaf7328dbcc87bfc9acaf17f951ee5ddcf3/Samples/deviceQuery) sample. If all is well you should see many details about your GPU diff --git a/guide/src/guide/kernel_abi.md b/guide/src/guide/kernel_abi.md index c4a9fe3d..5330d207 100644 --- a/guide/src/guide/kernel_abi.md +++ b/guide/src/guide/kernel_abi.md @@ -1,21 +1,21 @@ # Kernel ABI -This section details how parameters are passed to GPU kernels by the Codegen at the current time. -In other words, how the codegen expects you to pass different types to GPU kernels from the CPU. +This section details how parameters are passed to GPU kernels by the codegen backend. In other +words, how the codegen backend expects you to pass different types to GPU kernels from the CPU. ⚠️ If you find any bugs in the ABI please report them. ⚠️ ## Preface -Please note that the following __only__ applies to non-rust call conventions, we make zero guarantees -about the rust call convention, just like rustc. +Please note that the following __only__ applies to non-Rust call conventions, we make zero guarantees +about the Rust call convention, just like rustc. -While we currently override every ABI except rust, you should generally only use `"C"`, any +While we currently override every ABI except Rust, you should generally only use `"C"`, any other ABI we override purely to avoid footguns. Functions marked as `#[kernel]` are enforced to be `extern "C"` by the kernel macro, and it is expected that __all__ GPU kernels be `extern "C"`, not that you should be declaring any kernels without the `#[kernel]` macro, -because the codegen/cuda_std is allowed to rely on the behavior of `#[kernel]` for correctness. +because the codegen backend/`cuda_std` is allowed to rely on the behavior of `#[kernel]` for correctness. ## Structs @@ -119,7 +119,7 @@ unsafe { } ``` -You may get warnings about slices being an improper C-type, but the warnings are safe to ignore, the codegen guarantees +You may get warnings about slices being an improper C-type, but the warnings are safe to ignore, the codegen backend guarantees that slices are passed as pairs of params. You cannot however pass mutable slices, this is because it would violate aliasing rules, each thread receiving a copy of the mutable @@ -135,7 +135,7 @@ ZSTs (zero-sized types) are ignored and become nothing in the final PTX. Primitive types are passed directly by value, same as structs. They map to the special PTX types `.s8`, `.s16`, `.s32`, `.s64`, `.u8`, `.u16`, `.u32`, `.u64`, `.f32`, and `.f64`. With the exception that `u128` and `i128` are passed as byte arrays (but this has no impact on how they are passed from the CPU). -## References And Pointers +## References And pointers References and Pointers are both passed as expected, as pointers. It is therefore expected that you pass such parameters using device memory: diff --git a/guide/src/guide/safety.md b/guide/src/guide/safety.md index bad3dfaf..1a8bc4ec 100644 --- a/guide/src/guide/safety.md +++ b/guide/src/guide/safety.md @@ -25,7 +25,7 @@ Behavior considered undefined inside of GPU kernels: undefined on the GPU too. The only exception being invalid sizes for buffers given to a GPU kernel. -Currently we declare that the invariant that a buffer given to a gpu kernel must be large enough for any access the +Currently we declare that the invariant that a buffer given to a GPU kernel must be large enough for any access the kernel is going to make is up to the caller of the kernel to uphold. This idiom may be changed in the future. - Any kind of data race, this has the same semantics as data races in CPU code. Such as: @@ -90,7 +90,7 @@ Note however, that unified memory can be accessed by multiple GPUs and multiple takes care of copying and moving data automatically from GPUs/CPU when a page fault occurs. For this reason as well as general ease of use, we suggest that unified memory generally be used over regular device memory. -### Kernel Launches +### Kernel launches Kernel Launches are the most unsafe part of CUDA, many things must be checked by the developer to soundly launch a kernel. It is fundamentally impossible for us to verify a large portion of the invariants expected by the kernel/CUDA. diff --git a/guide/src/guide/tips.md b/guide/src/guide/tips.md index 98ddb1a0..616b7807 100644 --- a/guide/src/guide/tips.md +++ b/guide/src/guide/tips.md @@ -4,11 +4,11 @@ This section contains some tips on what to do and what not to do using the proje ## GPU kernels -- Generally don't derive `Debug` for structs in GPU crates. The codegen currently does not do much global +- Generally don't derive `Debug` for structs in GPU crates. The codegen backend currently does not do much global DCE (dead code elimination) so debug can really slow down compile times and make the PTX gigantic. This will get much better in the future but currently it will cause some undesirable effects. - Don't use recursion, CUDA allows it but threads have very limited stacks (local memory) and stack overflows yield confusing `InvalidAddress` errors. If you are getting such an error, run the executable in cuda-memcheck, -it should yield a write failure to `Local` memory at an address of about 16mb. You can also put the ptx file through +it should yield a write failure to `Local` memory at an address of about 16mb. You can also put the PTX file through `cuobjdump` and it should yield ptxas warnings for functions without a statically known stack usage. diff --git a/guide/src/nvvm/README.md b/guide/src/nvvm/README.md index 69efa4fe..701cdf1c 100644 --- a/guide/src/nvvm/README.md +++ b/guide/src/nvvm/README.md @@ -1,10 +1,10 @@ -# rustc_codegen_nvvm +# `rustc_codegen_nvvm` -This section will cover the more technical details of how rustc_codegen_nvvm works +This section will cover the more technical details of how `rustc_codegen_nvvm` works as well as the issues that came with it. It will also explain some technical details about CUDA/PTX/etc, it is not necessarily -limited to rustc_codegen_nvvm. +limited to `rustc_codegen_nvvm`. Basic knowledge of how rustc and LLVM work and what they do is assumed. You can find info about rustc in the [rustc dev guide](https://rustc-dev-guide.rust-lang.org/). diff --git a/guide/src/nvvm/backends.md b/guide/src/nvvm/backends.md index c4de04ca..1249f4ec 100644 --- a/guide/src/nvvm/backends.md +++ b/guide/src/nvvm/backends.md @@ -1,40 +1,41 @@ -# Custom Rustc Backends +# Custom rustc backends -Before we get into the details of rustc_codegen_nvvm, we obviously need to explain what a codegen is! +Before we get into the details of `rustc_codegen_nvvm`, we obviously need to explain what a codegen +backend is! -Custom codegens are rustc's answer to "well what if i want rust to compile to X?". This is a problem +Custom codegen backends are rustc's answer to "well what if I want Rust to compile to X?". This is a problem that comes up in many situations, especially conversations of "well LLVM cannot target this, so we are screwed". To solve this problem, rustc decided to incrementally decouple itself from being attached/reliant on LLVM exclusively. -Previously, rustc only had a single codegen, the LLVM codegen. The LLVM codegen translated MIR directly to LLVM IR. +Previously, rustc only had a single codegen backend, the LLVM codegen backed. This translated MIR directly to LLVM IR. This is great if you just want to support LLVM, but LLVM is not perfect, and inevitably you will hit limits to what LLVM is able to do. Or, you may just want to stop using LLVM, LLVM is not without problems (it is often slow, clunky to deal with, and does not support a lot of targets). -Nowadays, Rustc is almost fully decoupled from LLVM and it is instead generic over the "codegen" backend used. -Rustc instead uses a system of codegen backends that implement traits and then get loaded as dynamically linked libraries. -This allows rust to compile to virtually anything with a surprisingly small amount of work. At the time of writing, there are -five publicly known codegens that exist: -- rustc_codegen_cranelift -- rustc_codegen_llvm -- rustc_codegen_gcc -- rustc_codegen_spirv -- rustc_codegen_nvvm, obviously the best codegen ;) - -`rustc_codegen_cranelift` targets the cranelift backend, which is a codegen backend written in rust that is faster than LLVM but does not have many optimizations +Nowadays, rustc is almost fully decoupled from LLVM and it is instead generic over the codegen backend used. +rustc instead uses a system of codegen backends that implement traits and then get loaded as dynamically linked libraries. +This allows Rust to compile to virtually anything with a surprisingly small amount of work. At the time of writing, there are +five publicly known codegen backends that exist: +- `rustc_codegen_cranelift` +- `rustc_codegen_llvm` +- `rustc_codegen_gcc` +- `rustc_codegen_spirv` +- `rustc_codegen_nvvm`, obviously the best codegen ;) + +`rustc_codegen_cranelift` targets the cranelift backend, which is a codegen backend written in Rust that is faster than LLVM but does not have many optimizations compared to LLVM. `rustc_codegen_llvm` is obvious, it is the backend almost everybody uses which targets LLVM. `rustc_codegen_gcc` targets GCC (GNU Compiler Collection) which is able to target more exotic targets than LLVM, especially for embedded. `rustc_codegen_spirv` targets the SPIR-V (Standard Portable Intermediate Representation 5) format, which is a format mostly used for compiling shader languages such as GLSL or WGSL to a standard representation that Vulkan/OpenGL can use, the reasons -why SPIR-V is not an alternative to CUDA/rustc_codegen_nvvm have been covered in the [FAQ](../../faq.md). +why SPIR-V is not an alternative to CUDA/`rustc_codegen_nvvm` have been covered in the [FAQ](../../faq.md). -Finally, we come to the star of the show, `rustc_codegen_nvvm`. This backend targets NVVM IR for compiling rust to gpu kernels that can be run by CUDA. -What NVVM IR/libnvvm are has been covered in the [CUDA section](../../cuda/pipeline.md). +Finally, we come to the star of the show, `rustc_codegen_nvvm`. This backend targets NVVM IR for compiling Rust to GPU kernels that can be run by CUDA. +What NVVM IR/libNVVM are has been covered in the [CUDA section](../../cuda/pipeline.md). -# rustc_codegen_ssa +# `rustc_codegen_ssa` -`rustc_codegen_ssa` is the central crate behind every single codegen and does much of the hard work. -It abstracts away the MIR lowering logic so that custom codegens only have to implement some -traits and the SSA codegen does everything else. For example: +`rustc_codegen_ssa` is the central crate behind every single codegen backend and does much of the +hard work. It abstracts away the MIR lowering logic so that custom codegen backends only have to +implement some traits and the SSA codegen does everything else. For example: - A trait for getting a type like an integer type. - A trait for optimizing a module. - A trait for linking everything. diff --git a/guide/src/nvvm/debugging.md b/guide/src/nvvm/debugging.md index b5ab7224..6c0491a9 100644 --- a/guide/src/nvvm/debugging.md +++ b/guide/src/nvvm/debugging.md @@ -1,4 +1,4 @@ -# Debugging The Codegen +# Debugging the codegen backend When you try to compile an entire language for a completely different type of hardware, stuff is bound to break. In this section we will cover how to debug 🧊, segfaults, and more. @@ -10,33 +10,33 @@ Segfaults are usually caused in one of two ways: - From NVVM when linking (generating PTX). (more common) The first case can be debugged in two ways: -- Building the codegen in debug mode and using `RUSTC_LOG="rustc_codegen_nvvm=trace"` (`$env:RUSTC_LOG = "rustc_codegen_nvvm=trace";` if using powershell). -Note that this will dump a LOT of output, and when i say a LOT, i am not joking, so please, pipe this to a file. -This will give you a detailed summary of almost every action the codegen has done, you can examine the final few logs to -check what the last action the codegen was doing before segfaulting was. This is usually straightforward because the logs are detailed. +- Building the codegen backend in debug mode and using `RUSTC_LOG="rustc_codegen_nvvm=trace"` (`$env:RUSTC_LOG = "rustc_codegen_nvvm=trace";` if using powershell). +Note that this will dump a LOT of output, and when I say a LOT, i am not joking, so please, pipe this to a file. +This will give you a detailed summary of almost every action the codegen backend has done, you can examine the final few logs to +check what the last action the codegen backend was doing before segfaulting was. This is usually straightforward because the logs are detailed. - Building LLVM 7 with debug assertions. This, coupled with logging should give all the info needed to debug a segfault. It should get LLVM to throw an exception whenever something bad happens. The latter case is a bit worse. -Segfaults in libnvvm are generally because we gave something to libnvvm which it did not expect. In an ideal world, libnvvm would -just throw a validation error, but it wouldn't be an llvm-based library if it threw friendly errors ;). Libnvvm has been known to segfault +Segfaults in libNVVM are generally because we gave something to libnvvm which it did not expect. In an ideal world, libnvvm would +just throw a validation error, but it wouldn't be an LLVM-based library if it threw friendly errors ;). Libnvvm has been known to segfault on things like: - using int types that arent `i1`, `i8`, `i16`, `i32`, or `i64` in functions signatures. (see int_replace.rs). - having debug info on multiple modules (this is technically disallowed per the spec but it still shouldn't segfault). -Generally there is no good way to debug these failures other than hoping libnvvm throws a validation error (which will cause an ICE). -I have created a tiny tool to run `llvm-extract` on an llvm ir file to attempt to isolate segfaulting functions which works to some degree -which i will add to the project soon. +Generally there is no good way to debug these failures other than hoping libNVVM throws a validation error (which will cause an ICE). +I have created a tiny tool to run `llvm-extract` on an LLVM IR file to attempt to isolate segfaulting functions which works to some degree +which I will add to the project soon. ## Miscompilations Miscompilations are rare but annoying. They usually result in one of two things happening: - CUDA rejecting the PTX as a whole (throwing an InvalidPtx error). This is rare but the most common cause is declaring invalid -extern functions (just grep for `extern` in the ptx file and check if it's odd functions that aren't cuda syscalls like vprintf, malloc, free, etc). +extern functions (just grep for `extern` in the PTX file and check if it's odd functions that aren't CUDA syscalls like vprintf, malloc, free, etc). - The PTX containing invalid behavior. This is very specific and rare but if you find this, the best way to debug it is: - - Try to get a minimal working example so we don't have to search through megabytes of llvm ir/ptx. + - Try to get a minimal working example so we don't have to search through megabytes of LLVM IR/PTX. - Use `RUSTFLAGS="--emit=llvm-ir"` and find `crate_name.ll` in `target/nvptx64-nvidia-cuda//deps/` and attach it in any bug report. - Attach the final PTX file. @@ -47,7 +47,7 @@ If that doesn't work, then it might be a bug inside of CUDA itself, but that sho is to set up the crate for debug (and see if it still happens in debug). Then you can run your executable under NSight Compute, go to the source tab, and examine the SASS (basically an assembly lower than PTX) to see if ptxas miscompiled it. -If you set up the codegen for debug, it should give you a mapping from rust code to SASS which should hopefully help to see what exactly is breaking. +If you set up the codegen backend for debug, it should give you a mapping from Rust code to SASS which should hopefully help to see what exactly is breaking. Here is an example of the screen you should see: diff --git a/guide/src/nvvm/nvvm.md b/guide/src/nvvm/nvvm.md index 4e15aa7d..ed2f4651 100644 --- a/guide/src/nvvm/nvvm.md +++ b/guide/src/nvvm/nvvm.md @@ -1,20 +1,20 @@ -# rustc_codegen_nvvm +# `rustc_codegen_nvvm` At the highest level, our codegen workflow goes like this: ``` Source code -> Typechecking -> MIR -> SSA Codegen -> LLVM IR (NVVM IR) -> PTX -> PTX opts/function DCE -> Final PTX | | | | ^ - | | libnvvm +------+ | + | | libNVVM +------+ | | | | | rustc_codegen_nvvm +------------------------------------------------------------| - Rustc +--------------------------------------------------------------------------------------------------- + rustc +--------------------------------------------------------------------------------------------------- ``` Before we do anything, rustc does its normal job, it typechecks, converts everything to MIR, etc. Then, -rustc loads our codegen shared lib and invokes it to codegen the MIR. It creates an instance of +rustc loads our codegen backend shared lib and invokes it to codegen the MIR. It creates an instance of `NvvmCodegenBackend` and it invokes `codegen_crate`. You could do anything inside `codegen_crate` but -we just defer back to rustc_codegen_ssa and tell it to do the job for us: +we just defer back to `rustc_codegen_ssa` and tell it to do the job for us: ```rs fn codegen_crate<'tcx>( @@ -34,30 +34,30 @@ fn codegen_crate<'tcx>( ``` After that, the codegen logic is kind of abstracted away from us, which is a good thing! -We just need to provide the SSA codegen whatever it needs to do its thing. This is +We just need to provide the SSA codegen crate whatever it needs to do its thing. This is done in the form of traits, lots and lots and lots of traits, more traits than you've ever seen, traits -your subconscious has warned you of in nightmares, anyways. Because talking about how the SSA codegen +your subconscious has warned you of in nightmares, anyways. Because talking about how the SSA codegen crate works is kind of useless, we will instead talk first about general concepts and terminology, then dive into each trait. But first, let's talk about the end of the codegen, it is pretty simple, we do a couple of things: *after codegen is done and LLVM has been run to optimize each module* 1. We gather every LLVM bitcode module we created. -2. We create a new libnvvm program. -3. We add every bitcode module to the libnvvm program. +2. We create a new libNVVM program. +3. We add every bitcode module to the libNVVM program. 4. We try to find libdevice and add it to the program (see [nvidia docs](https://docs.nvidia.com/cuda/libdevice-users-guide/introduction.html#what-is-libdevice) on what libdevice is). -5. We run the verifier on the nvvm program just to check that we did not create any invalid NVVM IR. +5. We run the verifier on the NVVM program just to check that we did not create any invalid NVVM IR. 6. We run the compiler which gives us a final PTX string, hooray! 7. Finally, the PTX goes through a small stage where its parsed and function DCE is run to eliminate most of the bloat in the file. Traditionally this is done by the linker but there's no linker to be found for miles here. 8. We write this PTX file to wherever rustc tells us to write the final file. -We will cover the libnvvm steps in more detail later on. +We will cover the libNVVM steps in more detail later on. -# Codegen Units (CGUs) +# Codegen units (CGUs) Ah codegen units, the thing everyone just tells you to set to `1` in Cargo.toml, but what are they? Well, to put it simply, codegen units are rustc splitting up a crate into different modules to then @@ -65,7 +65,7 @@ run LLVM in parallel over. For example, rustc can run LLVM over two different mo save time. This gets a little bit more complex with generics, because MIR is not monomorphized and monomorphized MIR is not a thing, -the codegen monomorphizes instances on the fly. Therefore rustc needs to put any generic functions that one CGU relies on +the compiler monomorphizes instances on the fly. Therefore rustc needs to put any generic functions that one CGU relies on inside of the same CGU because it needs to monomorphize them. # Rlibs diff --git a/guide/src/nvvm/ptxgen.md b/guide/src/nvvm/ptxgen.md index 260f6612..3ec0d2c0 100644 --- a/guide/src/nvvm/ptxgen.md +++ b/guide/src/nvvm/ptxgen.md @@ -1,20 +1,20 @@ -# PTX Generation +# PTX generation -This is the final and most fun part of codegen, taking our LLVM bitcode and giving it to libnvvm. -It is in theory as simple as just giving nvvm every single bitcode module, but in practice, we do a couple -of things before and after to reduce ptx size and speed things up. +This is the final and most fun part of codegen, taking our LLVM bitcode and giving it to libNVVM. +It is in theory as simple as just giving NVVM every single bitcode module, but in practice, we do a couple +of things before and after to reduce PTX size and speed things up. # The NVVM API -libnvvm is a dynamically linked library which is distributed in every download of the CUDA SDK. +libNVVM is a dynamically linked library which is distributed in every download of the CUDA SDK. If you are on windows, it should be somewhere around `C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.3/nvvm/bin` -where `v11.3` is the version of cuda you have downloaded. On Windows it's usually called `nvvm64_40_0.dll`. If you are +where `v11.3` is the version of CUDA you have downloaded. On Windows it's usually called `nvvm64_40_0.dll`. If you are on linux it should be somewhere around `/opt/cuda/nvvm-prev/lib64/libnvvm.so`. You can see its API either in the [API docs](https://docs.nvidia.com/cuda/libnvvm-api/group__compilation.html) or in its header file in the `include` folder. We have our own high level bindings to it published as a crate called `nvvm`. -The libnvvm API could not be simpler, it is just a couple of functions: +The libNVVM API could not be simpler, it is just a couple of functions: - Make new program - Add bitcode module - Lazy add bitcode module @@ -31,16 +31,16 @@ should be a very simple thing that would involve no calls to random functions in haystack, ...right? Why of course not, you didn't seriously think we would make this straight-forward, right? -So, in theory it is very simple, just load the bitcode from the rlib and tell nvvm to load it. +So, in theory it is very simple, just load the bitcode from the rlib and tell NVVM to load it. While this is easy and it works, it has its own very visible issues. Traditionally, if you never use a function, either the compiler destroys it when using LTO, or the linker destroys it in its own dead code pass. The issue is that LTO is not always run, -and we do not have a linker, nvvm *is* our linker. However, nvvm does not eliminate dead functions. -I think you can guess why that is a problem, so unless we want `11mb` ptx files (yes this is actually +and we do not have a linker, NVVM *is* our linker. However, NVVM does not eliminate dead functions. +I think you can guess why that is a problem, so unless we want `11mb` PTX files (yes this is actually how big it was) we need to do something about it. -# Module Merging and DCE +# Module merging and DCE To solve our dead code issue, we take a pretty simple approach. We merge every module (one crate maybe be multiple modules because of codegen units) into a single module to start. Then, we do the following: @@ -58,10 +58,10 @@ into the module if they are used, doing so using dependency graphs. There are a couple of special modules we need to load before we are done, `libdevice` and `libintrinsics`. The first and most important one is libdevice, libdevice is essentially a bitcode module containing hyper-optimized math intrinsics -that nvidia provides for us. You can find it as a `.bc` file in the libdevice folder inside your nvvm install location. +that NVIDIA provides for us. You can find it as a `.bc` file in the libdevice folder inside your NVVM install location. Every function inside of it is prefixed with `__nv_`, you can find docs for it [here](https://docs.nvidia.com/cuda/libdevice-users-guide/index.html). -We declare these intrinsics inside of `ctx_intrinsics.rs` and link to them inside cuda_std. We also use them to codegen +We declare these intrinsics inside of `ctx_intrinsics.rs` and link to them inside `cuda_std`. We also use them to codegen a lot of intrinsics inside `intrinsic.rs`, such as `sqrtf32`. libdevice is also lazy loaded so we do not import useless intrinsics. @@ -69,15 +69,15 @@ libdevice is also lazy loaded so we do not import useless intrinsics. # libintrinsics This is the last special module we load, it is simple, it is just a dumping ground for random wrapper functions -we need to define that `cuda_std` or the codegen needs. You can find the llvm ir definition for it in the codegen directory +we need to define that `cuda_std` or the codegen backend needs. You can find the LLVM IR definition for it in the codegen directory called `libintrinsics.ll`. All of its functions should be declared with the `__nvvm_` prefix. # Compilation Finally, we have everything loaded and we can compile our program. We do one last thing however. -Nvvm has a function for verifying our program to make sure we did not add anything nvvm does not like. We run this -before compilation just to be safe. Although annoyingly this does not catch all errors, nvvm just segfaults sometimes which is unfortunate. +NVVM has a function for verifying our program to make sure we did not add anything nvvm does not like. We run this +before compilation just to be safe. Although annoyingly this does not catch all errors, NVVM just segfaults sometimes which is unfortunate. -Compiling is simple, we just call nvvm's program compile function and panic if it fails, if it doesn't, we get a final PTX string. We +Compiling is simple, we just call NVVM's program compile function and panic if it fails, if it doesn't, we get a final PTX string. We can then just write that to the file that rustc wants us to put the final item in. diff --git a/guide/src/nvvm/types.md b/guide/src/nvvm/types.md index 845a7cb3..23d3933e 100644 --- a/guide/src/nvvm/types.md +++ b/guide/src/nvvm/types.md @@ -1,27 +1,27 @@ # Types -Types! who doesn't love types, especially those that cause libnvvm to randomly segfault or loop forever! -Anyways, types are an integral part of the codegen and everything revolves around them and you will see them everywhere. +Types! who doesn't love types, especially those that cause libNVVM to randomly segfault or loop forever! +Anyways, types are an integral part of the codegen backend and everything revolves around them and you will see them everywhere. `rustc_codegen_ssa` does not actually tell you what your type representation should be, it allows you to decide. For -example, `rust-gpu` represents it as a `SpirvType` enum, while both `rustc_codegen_llvm` and our codegen represent it as -opaque llvm types: +example, Rust GPU represents it as a `SpirvType` enum, while both `rustc_codegen_llvm` and our codegen represent it as +opaque LLVM types: ```rs type Type = &'ll llvm::Type; ``` `llvm::Type` is an opaque type that comes from llvm-c. `'ll` is one of the main lifetimes you will see -throughout the whole codegen, it is used for anything that lasts as long as the current usage of llvm. -LLVM gives you back pointers when you ask for a type or value, some time ago rustc_codegen_llvm fully switched to using +throughout the whole codegen, it is used for anything that lasts as long as the current usage of LLVM. +LLVM gives you back pointers when you ask for a type or value, some time ago `rustc_codegen_llvm` fully switched to using references over pointers, and we follow in their footsteps. One important fact about types is that they are opaque, you cannot take a type and ask "is this X struct?", this is like asking "which chickens were responsible for this omelette?". You can ask if its a number type, a vector type, a void type, etc. -The SSA codegen needs to ask the backend for types for everything it needs to codegen MIR. It does -this using a trait called `BaseTypeMethods`: +The SSA codegen crate needs to ask the backend for types for everything it needs to codegen MIR. It +does this using a trait called `BaseTypeMethods`: ```rs pub trait BaseTypeMethods<'tcx>: Backend<'tcx> { @@ -55,8 +55,9 @@ pub trait BaseTypeMethods<'tcx>: Backend<'tcx> { } ``` -Every codegen implements this some way or another, you can find our implementation in `ty.rs`. Our -implementation is pretty straightforward, LLVM has functions that we link to which get us the types we need: +Every codegen backend implements this some way or another, you can find our implementation in +`ty.rs`. Our implementation is pretty straightforward, LLVM has functions that we link to which get +us the types we need: ```rs impl<'ll, 'tcx> BaseTypeMethods<'tcx> for CodegenCx<'ll, 'tcx> {