diff --git a/blog/2025-07-25-rust-on-every-gpu.mdx b/blog/2025-07-25-rust-on-every-gpu.mdx
new file mode 100644
index 0000000..95a278a
--- /dev/null
+++ b/blog/2025-07-25-rust-on-every-gpu.mdx
@@ -0,0 +1,658 @@
+---
+title: "Rust running on every GPU"
+authors: [LegNeato]
+tags: ["demo", "cuda"]
+---
+
+I've built a [demo](https://github.com/LegNeato/rust-gpu-chimera-demo) of a single
+shared Rust codebase that runs on every major GPU platform:
+
+- **CUDA** for NVIDIA GPUs  
+- **SPIR-V** for Vulkan-compatible GPUs from AMD, Intel, NVIDIA, and Android devices  
+- **Metal** for Apple devices  
+- **DirectX 12** for Windows
+- **WebGPU** for browsers
+- **CPU** fallback for non-GPU systems  
+
+The same compute logic runs on all targets, written entirely in regular Rust. No shader
+or kernel languages are used.
+
+<!-- truncate -->
+
+The code is available on [GitHub](https://github.com/LegNeato/rust-gpu-chimera-demo).
+
+## Background
+
+GPUs are typically programmed using specialized languages like
+[WGSL](https://www.w3.org/TR/WGSL/),
+[GLSL](https://developer.mozilla.org/en-US/docs/Games/Techniques/3D_on_the_web/GLSL_Shaders),
+[MSL](https://developer.apple.com/documentation/metal/performing_calculations_on_a_gpu),
+[HLSL](https://learn.microsoft.com/en-us/windows/win32/direct3dhlsl/dx-graphics-hlsl),
+[Slang](https://github.com/shader-slang/slang), or
+[Triton](https://github.com/openai/triton). These languages are GPU-specific and
+separate from the host application's language and tooling, increasing complexity and
+duplicating logic across CPU and GPU code.
+
+Some of us in the Rust community are taking a different approach and want to program
+GPUs by compiling standard Rust code directly to GPU targets. Three main projects make
+this possible[^1]:
+
+
+1. **[Rust GPU](https://rust-gpu.github.io/):** Compiles Rust code to
+   [SPIR-V](https://www.khronos.org/spir/), the binary format used by
+   [Vulkan](https://www.vulkan.org/) and other modern GPU APIs. This allows GPU code to
+   be written entirely in Rust and used in any Vulkan-compatible workflow.
+
+2. **[Rust CUDA](https://github.com/rust-gpu/rust-cuda):** Compiles Rust code to [NVVM
+   IR](https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html), enabling execution on
+   NVIDIA GPUs through the CUDA runtime.
+
+3. **[Naga](https://github.com/gfx-rs/naga):** A GPU language translation layer
+ developed by the [wgpu](https://github.com/gfx-rs/wgpu) team. It provides a common
+ intermediate representation and supports translating between WGSL, SPIR-V, GLSL, MSL,
+ and HLSL. Naga focuses on portability and is used to bridge different graphics
+ backends.
+
+ <br />
+
+Previously, these projects were siloed as they were started by different people at
+different times with different goals. As a maintainer of Rust GPU and Rust CUDA and a
+contributor to `naga`, I have been working to bring them closer together.
+
+This [demo](https://github.com/LegNeato/rust-gpu-chimera-demo) is the first time all
+major GPU backends run from a single Rust codebase without a ton of hacks. It's the
+culmination of hard work from many contributors and shows that cross-platform GPU
+compute in Rust is now possible.
+
+There are still many rough edges and a lot of work to do, but this is an exciting
+milestone!
+
+## How it works
+
+The demo implements a simple [bitonic
+sort](https://en.wikipedia.org/wiki/Bitonic_sorter). The compute logic is fully generic
+and shared across all targets, running the same code on both CPU and GPU.
+
+GPU terminology is a bit of a mess and it is easy to get turned around. To avoid
+confusion, this is how I will refer to different parts of the system:
+
+- **Host**: The Rust code running on the CPU that launches GPU work.
+- **Device**: The GPU or CPU where the compute kernel runs.
+- **Driver API**: The low-level interface used to communicate with the device (e.g., [CUDA](https://developer.nvidia.com/cuda-zone), [Vulkan](https://www.khronos.org/vulkan/), [Metal](https://developer.apple.com/metal/), [DirectX 12](https://learn.microsoft.com/en-us/windows/win32/direct3d12/direct3d-12-graphics)).
+- **Backend**: The Rust-side abstraction layer or library used to target a driver API
+  (e.g., [`cust`](https://docs.rs/cust), [`ash`](https://docs.rs/ash),
+  [`wgpu`](https://github.com/gfx-rs/wgpu)).
+
+### Backend selection
+
+Backends and driver APIs are selected using a combination of Rust feature flags and compilation targets.
+
+- `cargo build --features wgpu` uses [`wgpu`](https://github.com/gfx-rs/wgpu), which selects the system default driver API.
+- `cargo build --features wgpu,vulkan` forces `wgpu` to use Vulkan, even on platforms where it isn’t the default (for example, to use [MoltenVK](https://github.com/KhronosGroup/MoltenVK) on macOS).
+- `cargo build --features ash` enables a Vulkan-only backend using the [`ash`](https://docs.rs/ash) crate. This is useful when you want direct control over Vulkan without the overhead or abstraction of `wgpu`, and demonstrates that the project is not tied to a single graphics abstraction.
+- `cargo build --features cuda` enables the NVIDIA-specific CUDA backend. This uses the [`cust`](https://docs.rs/cust) crate internally, but support could be added for [`cudarc`](https://github.com/coreylowman/cudarc) as well.
+
+Though this demo doesn't do so, multiple backends could be compiled into a single binary
+and platform-specific code paths could then be selected at runtime. This would allow a
+single Rust binary to dynamically choose the best GPU technology for a user's device.
+
+There is a huge matrix of supported hosts, driver APIs, and backends. For the full list,
+see the
+[README](https://github.com/LegNeato/rust-gpu-chimera?tab=readme-ov-file#supported-configurations).
+
+
+### Kernel compilation
+
+Kernels are compiled to the appropriate device format for the given Rust features,
+target OS, and driver API. These compiled kernels are embedded into the binary at build
+time. While this demo does not support runtime loading, the underlying tools and
+ecosystem do.
+
+The demo uses a single compute kernel and entry point for simplicity. The underlying
+tools and ecosystem support multiple kernels and multiple entry points if your use case
+requires it.
+
+## A look under the hood
+
+There are many pieces that are working together behind the scenes to enable such a
+simple developer experience. Here is roughly what happens for a given command:
+
+- `cargo run --release`:
+
+  _During the build_, `rustc` compiles both the host code and the compute kernel to
+  native CPU code.  
+  No external toolchains or code generation steps are involved.  
+  The kernel runs directly on the CPU as standard Rust code.
+
+- `cargo build --features wgpu`:
+
+  _During the build_, `build.rs` uses [`rustc_codegen_spirv`](https://github.com/rust-gpu/rust-gpu) to compile the GPU kernel to SPIR-V.  
+  The resulting SPIR-V is embedded into the CPU binary as static data.  
+  The host code is compiled normally.
+
+  _At runtime_, the CPU loads the embedded SPIR-V and passes it to
+  [`naga`](https://github.com/gfx-rs/wgpu/tree/trunk/naga), which translates it to the
+  shading language required by the platform.  
+  `wgpu` then forwards the translated code to the appropriate driver API to execute the
+  kernel on the GPU:
+
+  - macOS:
+    - SPIR-V is translated to MSL  
+    - MSL is passed to Metal  
+    - Metal executes the kernel on the GPU
+
+  - Windows:
+    - SPIR-V is translated to HLSL  
+    - HLSL is passed to DirectX 12  
+    - DX12 executes the kernel on the GPU
+
+  - Linux / Android:
+    - SPIR-V is used directly  
+    - SPIR-V is passed to Vulkan  
+    - Vulkan executes the kernel on the GPU
+
+- `cargo build --features cuda`:
+
+  _During the build_, `build.rs` uses
+  [`rustc_codegen_nvvm`](https://github.com/Rust-GPU/Rust-CUDA) to compile the GPU
+  kernel to PTX.  
+  The resulting PTX is embedded into the CPU binary as static data.  
+  The host code is compiled normally.
+
+  _At runtime_, the CPU loads the embedded PTX and passes it to the CUDA driver via the
+  [`cust`](https://docs.rs/cust) crate.  
+  CUDA compiles the PTX to SASS (GPU machine code), uploads it to the GPU, and executes the kernel.
+
+## A Rust-native GPU project
+
+In previous demos, we simply ported GLSL
+[shadertoys](https://rust-gpu.github.io/blog/2025/04/10/shadertoys) and [Vulkan
+examples](https://rust-gpu.github.io/blog/2025/06/24/vulkan-shader-port) to Rust as-is.
+This project instead focuses on demonstrating some of Rust's strengths for GPU
+programming directly.
+
+### `no_std` support
+
+Code that runs on the GPU uses
+[`#![no_std]`](https://doc.rust-lang.org/reference/names/preludes.html#the-no_std-attribute),
+as the standard library isn't available on GPUs:
+
+```rust
+#![cfg_attr(target_arch = "spirv", no_std)]
+#![cfg_attr(target_os = "cuda", no_std)]
+```
+
+Rust's `no_std` ecosystem was designed from day one to support embedded, firmware, and
+kernel development. These environments, like GPUs, lack operating system services. This
+first-class support creates a clear separation between
+[`core`](https://doc.rust-lang.org/core/) (always available) and
+[`std`](https://doc.rust-lang.org/std/) (needs an OS), with compiler-enforced boundaries
+that guarantee if your code compiles, it will link.
+
+Existing `no_std` + no [`alloc`](https://doc.rust-lang.org/alloc/) crates written for
+other purposes can generally run on the GPU without modification. `no_std` is not an
+afterthought or a special "GPU mode", it's how Rust was designed to work.
+
+### Conditional compilation
+
+The project uses sophisticated [conditional compilation](https://doc.rust-lang.org/reference/conditional-compilation.html) patterns:
+
+```rust
+// Platform-specific type availability
+#[cfg(any(target_arch = "spirv", target_os = "cuda"))]
+use shared::BitonicParams;
+
+// Conditional trait implementations
+#[cfg(feature = "cuda")]
+unsafe impl DeviceCopy for BitonicParams {}
+
+// Platform-specific constants with unified naming
+// CUDA terminology alias only appears when using CUDA.
+pub const WORKGROUP_SIZE: u32 = 256;
+#[cfg(feature = "cuda")]
+pub const BLOCK_SIZE: u32 = WORKGROUP_SIZE;
+```
+
+Rust's conditional compilation is a first-class language feature that keeps platform
+code in one place while maintaining clarity. The compiler understands all code paths and
+provides full IDE support across all configurations. This is in stark contrast to traditional GPU tooling.
+
+### Newtypes
+
+There is extensive use of
+[newtypes](https://doc.rust-lang.org/rust-by-example/generics/new_types.html) to prevent
+logical errors at compile time:
+
+```rust
+#[derive(Copy, Clone, Debug)]
+pub struct ThreadId(u32);
+
+#[derive(Copy, Clone, Debug)]
+pub struct ComparisonDistance(u32);
+
+#[derive(Copy, Clone, Debug, PartialEq, Eq, Pod, Zeroable)]
+#[repr(transparent)]  // Same memory layout as u32
+pub struct Stage(u32);
+```
+
+Newtypes turn runtime errors into compile-time errors by making domain concepts part of
+the type system. With
+[`#[repr(transparent)]`](https://doc.rust-lang.org/nomicon/other-reprs.html#reprtransparent),
+these abstractions have zero runtime cost—they compile to identical machine code as raw
+integers. The result is self-documenting APIs that prevent logical errors using the same
+patterns Rust developers use daily on the CPU. Newtypes are especially valuable for GPU
+programming, where debugging is difficult and errors can manifest as silent corruption.
+
+### Enums
+
+[Enums](https://doc.rust-lang.org/book/ch06-01-defining-an-enum.html) replace magic numbers with compile-time checked values:
+
+```rust
+#[derive(Copy, Clone, Debug, PartialEq)]
+#[repr(u32)]  // Guaranteed memory layout for GPU
+pub enum SortOrder {
+    Ascending = 0,
+    Descending = 1,
+}
+```
+
+Enums provide type-safe configuration that can compile to raw integers. With
+`#[repr(u32)]`, you get predictable layout across platforms, while pattern matching
+ensures exhaustive handling of all cases. The result is self-documenting code that reads
+naturally while eliminating entire classes of bugs.
+
+:::warning
+
+Care must still be taken when passing enums _between_ the CPU host and GPU kernel code. Even with a fixed layout using `#[repr(u32)]`, Rust does not guarantee that every possible `u32` value is a valid instance of the enum.
+
+With the above enum, if the host or kernel writes the value "3" into a buffer shared across the device boundary and the other side interprets it as `SortOrder`, this results in undefined behavior. Rust assumes that all enum values are valid according to their discriminant, and violating this assumption can break pattern matching, control flow, and optimization correctness.
+
+We hope to improve the safety of this boundary in the future by making Rust more "GPU
+aware", but there is language design work to do.
+
+:::
+
+### Traits
+
+[Traits](https://doc.rust-lang.org/book/ch10-02-traits.html) enable generic algorithms
+that work across types without runtime dispatch:
+
+```rust
+pub trait SortableKey: Copy + Pod + Zeroable + PartialOrd {
+    fn to_sortable_u32(&self) -> u32;
+    fn from_sortable_u32(val: u32) -> Self;
+    fn should_swap(&self, other: &Self, order: SortOrder) -> bool;
+    fn max_value() -> Self;
+    fn min_value() -> Self;
+}
+
+// Implementations handle type-specific details
+impl SortableKey for f32 {
+    fn to_sortable_u32(&self) -> u32 {
+        let bits = self.to_bits();
+        // Bit manipulation to handle IEEE 754 float ordering correctly
+        if bits & (1 << 31) != 0 {
+            !bits  // Negative floats: flip all bits
+        } else {
+            bits | (1 << 31)  // Positive floats: flip sign bit
+        }
+    }
+}
+```
+
+Traits provide zero-cost abstraction for generic GPU algorithms. You write once and use
+with any type that meets the requirements, with trait bounds documenting exactly what
+the GPU needs from a type. Monomorphization generates specialized code with the same
+performance as hand-written versions, while complex type-specific logic stays
+encapsulated in implementations.
+
+This ecosystem composability is extremely valuable and unmatched. Any Rust crate can
+implement your traits for its types, enabling third-party numeric types to work
+seamlessly with your GPU kernels while maintaining clear contracts between GPU code and
+data types.
+
+### Inline
+
+The project uses [`#[inline]`](https://doc.rust-lang.org/reference/attributes/codegen.html#the-inline-attribute) to ensure abstractions disappear at compile time:
+
+```rust
+impl ComparisonDistance {
+    #[inline]
+    fn from_stage_pass(stage: Stage, pass: Pass) -> Self {
+        Self(1u32 << (stage.as_u32() - pass.as_u32()))
+    }
+
+    #[inline]
+    fn find_partner(&self, thread_id: ThreadId) -> ThreadId {
+        ThreadId::new(thread_id.as_u32() ^ self.0)
+    }
+}
+```
+
+This matters for GPUs because function calls are expensive due to register pressure and
+the lack of a traditional stack. These example methods compile to single instructions,
+and bit manipulation operations like `XOR` and `shift` map directly to GPU hardware
+instructions. The same zero-cost principle that makes Rust suitable for systems
+programming makes it perfect for GPUs.
+
+### Struct composition
+
+Complex programs usually benefit from [semantic
+grouping](https://doc.rust-lang.org/book/ch05-01-defining-structs.html):
+
+```rust
+/// Represents a comparison pair in the bitonic network
+#[derive(Copy, Clone, Debug)]
+pub struct ComparisonPair {
+    lower: ThreadId,
+    upper: ThreadId,
+}
+
+impl ComparisonPair {
+    #[inline]
+    fn try_new(thread_id: ThreadId, partner: ThreadId) -> (bool, Self) {
+        let is_valid = partner.as_u32() > thread_id.as_u32();
+        let pair = Self {
+            lower: thread_id,
+            upper: partner,
+        };
+        (is_valid, pair)
+    }
+
+    #[inline]
+    fn is_in_bounds(&self, num_elements: u32) -> bool {
+        self.upper.as_u32() < num_elements
+    }
+}
+```
+
+Traditional GPU kernels often suffer from parameter explosion, with 10 or more arguments
+passed to a single function. Rust's struct composition provides a cleaner approach. This
+snippet shows:
+
+- *Semantic grouping:* The two thread IDs that form a comparison pair live together as a
+  logical unit.
+- *Encapsulation:* Private fields ensure lower and upper maintain their relationship.
+- *Smart constructors:* `try_new()` returns both validity and the pair, preventing
+  invalid states.
+
+Standard Rust practices transform what would be error-prone index manipulations into
+type-safe, self-documenting GPU code.
+
+### Memory layout control
+
+Rust provides fine-grained [representation control](https://doc.rust-lang.org/reference/type-layout.html), defining explicit and verifiable memory layouts essential for GPU interoperability.
+
+```rust
+#[repr(C)]  // C-compatible layout, field order preserved
+#[derive(Copy, Clone, Debug, Pod, Zeroable)]
+pub struct BitonicParams {
+    pub num_elements: u32,
+    pub stage: Stage,         // Newtypes with #[repr(transparent)]
+    pub pass_of_stage: Pass,  // Compiles to u32 in memory
+    pub sort_order: u32,
+}
+```
+
+The
+[`#[repr(C)]`](https://doc.rust-lang.org/reference/type-layout.html#the-c-representation)
+attribute provides the layout guarantees required for GPU data transfer.
+
+:::warning
+
+Care must still be taken to ensure data is padded correctly. We hope in the future we
+can better integrate the padding requirements for each GPU target into Rust, but for now
+you must do so manually.
+
+:::
+
+### Pattern matching
+
+[Pattern matching](https://doc.rust-lang.org/book/ch06-02-match.html) provides exhaustive case handling and clear intent:
+
+```rust
+/// Compare two values for sorting
+fn should_swap(&self, other: &Self, order: SortOrder) -> bool {
+    match order {
+        SortOrder::Ascending => self > other,
+        SortOrder::Descending => self < other,
+    }
+}
+```
+
+Pattern matching is particularly valuable for GPU programming. It makes all code paths
+explicit, helping the compiler generate efficient code. When dealing with
+platform-specific values or error conditions, pattern matching provides clear, type-safe
+handling that prevents invalid states from reaching GPU kernels.
+
+### Generics
+
+[Generic functions](https://doc.rust-lang.org/book/ch10-01-syntax.html) are leveraged to
+reuse the same logic for multiple types:
+
+```rust
+#[inline]
+fn compare_and_swap<T>(data: &mut [T], pair: ComparisonPair, direction: BitonicDirection)
+where
+    T: Copy + PartialOrd,  // Constraints are explicit
+{
+    let val_i = data[pair.lower.as_usize()];
+    let val_j = data[pair.upper.as_usize()];
+
+    if direction.should_swap(val_i, val_j) {
+        data.swap(pair.lower.as_usize(), pair.upper.as_usize());
+    }
+}
+```
+
+Rust enables generic programming with clear constraints and (mostly) excellent error
+messages. [Trait
+bounds](https://doc.rust-lang.org/book/ch10-02-traits.html#trait-bound-syntax)
+explicitly state what operations are needed, and error messages point to your code
+rather than expanded template instantiations.
+[Monomorphization](https://doc.rust-lang.org/book/ch10-01-syntax.html#performance-of-code-using-generics)
+generates optimal code for each type, and the same generic algorithms work seamlessly on
+both CPU and GPU. This makes it practical to write complex generic GPU algorithms that
+are both reusable and maintainable.
+
+### Derive macros
+
+[Derive macros](https://doc.rust-lang.org/reference/attributes/derive.html) are used to automatically implement traits:
+
+```rust
+#[derive(Copy, Clone, Debug, PartialEq, Eq, Pod, Zeroable)]
+#[repr(transparent)]
+pub struct Stage(pub u32);
+```
+
+Each derive provides specific guarantees: [`Copy`](https://doc.rust-lang.org/std/marker/trait.Copy.html) enables bitwise copying (required for GPU types), [`Clone`](https://doc.rust-lang.org/std/clone/trait.Clone.html) provides explicit cloning support, [`Debug`](https://doc.rust-lang.org/std/fmt/trait.Debug.html) makes types printable for CPU-side debugging, [`PartialEq`](https://doc.rust-lang.org/std/cmp/trait.PartialEq.html) and [`Eq`](https://doc.rust-lang.org/std/cmp/trait.Eq.html) enable comparison operators, `Pod` ("Plain Old Data") ensures the type is safe to transmit to GPU memory, and `Zeroable` confirms it's safe to zero-initialize. One line of derive macros generates all the boilerplate that GPU programming traditionally requires you to write manually, with the compiler verifying these traits are actually valid for your type (unless you use `unsafe`).
+
+### Module system
+
+Rust's [module
+system](https://doc.rust-lang.org/book/ch07-00-managing-growing-projects-with-packages-crates-and-modules.html)
+keeps the code organized. GPU projects quickly become complex with host-side setup code,
+multiple kernels, shared type definitions, and platform-specific implementations. This
+complexity is organized using the same patterns developers use for any large CPU-based
+Rust project.
+
+### Workspaces and workspace dependencies
+
+The project uses [Cargo
+workspaces](https://doc.rust-lang.org/book/ch14-03-cargo-workspaces.html) to organize
+code. For large GPU projects with multiple kernels and utilities, workspaces make it
+easy to share common code while maintaining clear boundaries between components.
+
+
+[Workspace
+dependencies](https://doc.rust-lang.org/cargo/reference/workspaces.html#the-dependencies-table)
+ensure consistency in dependencies across crates:
+
+```toml
+[dependencies]
+
+# Platform-specific dependency inherited from workspace
+[target.'cfg(target_arch = "spirv")'.dependencies]
+spirv-std.workspace = true
+
+# Inherit workspace lints
+[lints]
+workspace = true
+```
+
+### Formatting
+
+Rust GPU code is formatted with `rustfmt`, following the same standards as all Rust
+code. This not only ensures my GPU code looks identical to my CPU code, it makes my GPU
+code consistent with the _entire Rust ecosystem_. Leveraging standard tools like
+`rustfmt` minimizes cognitive overhead and avoids the hassle of configuring third-party
+formatters of varying quality.
+
+### Lint
+
+Linting GPU code in Rust works the same way as for CPU code. Running `cargo clippy`
+highlighted issues and enforced consistent code quality. As shown above, custom lint
+configurations are applied to Rust GPU kernels as well and interact as you would expect
+with workspaces.
+
+### Documentation
+
+Writing doc comments and running `cargo doc` generates documentation for GPU kernels,
+exactly how it happens in regular Rust. The code in doc comments are only run on the CPU
+though.
+
+While some ecosystems offer similar tools, Rust's integration is built-in and works
+seamlessly for both CPU and GPU code. There's no special setup required.
+
+### Build scripts
+
+The project uses a Cargo [build
+script](https://doc.rust-lang.org/cargo/reference/build-scripts.html) to compile and
+embed GPU kernels as part of the normal `cargo build` process:
+
+```rust
+// build.rs
+fn main() {
+    #[cfg(any(feature = "vulkan", feature = "wgpu"))]
+    build_spirv_kernel();
+
+    #[cfg(feature = "cuda")]
+    build_cuda_kernel();
+}
+```
+
+The build script provides seamless integration and good DX by invoking specialized
+compilers (`rustc_codegen_nvvm`, `rustc_codegen_spirv`) to generate GPU binaries during
+`cargo build`. This leverages Rust's dependency resolution and caching while keeping
+everything in the normal Rust ecosystem. The same kernel source code is compiled to
+different targets based on feature flags with no manual build steps or shell scripts
+required.
+
+### Unit tests
+
+One of the most powerful aspects of using Rust is that GPU kernel code can be tested
+on the CPU using standard Rust testing tools:
+
+```rust
+#[cfg(test)]
+mod tests {
+    #[test]
+    fn test_bitonic_sort_correctness() {
+        let mut data = vec![5, 2, 8, 1, 9];
+        // Test the algorithm logic without GPU complexity
+        sort_on_cpu(&mut data);
+        assert_eq!(data, vec![1, 2, 5, 8, 9]);
+    }
+}
+```
+
+:::warning
+
+Care must still be taken to run the kernel code on the CPU in a way that matches its GPU
+behavior.  
+The same invariants—such as workgroup size, memory access patterns, and
+synchronization—must be upheld to ensure correctness across backends.
+
+We hope in the future to have built-in simulators or libraries that handle this for you,
+but they do not exist yet.
+
+:::
+
+CPU testing capability transforms GPU development. You can use standard debugging tools
+like [`gdb`](https://www.gnu.org/software/gdb/) or [`lldb`](https://lldb.llvm.org/) to
+step through your kernel logic. Print debugging with
+[`println!`](https://doc.rust-lang.org/std/macro.println.html) works during development.
+Property-based testing with crates like
+[`proptest`](https://github.com/proptest-rs/proptest) can verify algorithm correctness.
+Code coverage tools like [`cargo-llvm-cov`](https://github.com/taiki-e/cargo-llvm-cov)
+show which paths your tests exercise. 
+
+The development cycle becomes: write the algorithm, test it thoroughly on CPU with
+familiar tools, then run it on GPU knowing the logic is likely correct. This eliminates
+much of the painful "compile, upload, run, crash, guess why" cycle that plagues GPU
+development. When something goes wrong on the GPU, you can usually reproduce it on the
+CPU and debug it properly. Open source projects like
+[renderling](https://github.com/schell/renderling) use this testing capability for great
+effect.
+
+Furthermore, because the kernel code is standard Rust, no GPU hardware is needed in CI
+to test the logic. This is important for open source projects as the [GitHub Actions
+runners](https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners)
+do not have GPUs. One can even test the Vulkan kernel code using a software driver like
+[SwiftShader](https://github.com/google/swiftshader) or
+[lavapipe](https://docs.mesa3d.org/drivers/lavapipe.html) and get some signal that the
+bulk of CUDA logic is correct (modulo any platform-gated logic). This has the potential
+to save on expensive NVIDIA GPU time.
+
+## Rough edges
+
+This demo shows that Rust can target all major GPU platforms, but the developer
+experience is still pretty rough and bolted together.
+
+First, the compiler backends are not integrated into `rustc`. To build GPU code,
+developers must use `rustc_codegen_spirv` or `rustc_codegen_nvvm` directly. Helper
+crates like [`spirv_builder`](https://rust-gpu.github.io/rust-gpu/api/spirv_builder/)
+and
+[`cuda_builder`](https://github.com/Rust-GPU/Rust-CUDA/tree/main/crates/cuda_builder)
+hide some complexity but it's still more involved than using standard Rust. Furthermore,
+a very specific version of nightly Rust is necessary for everything to work. Ideally
+these backends would be built into the main compiler and work with things like standard
+`rustup` tooling. There is no timeline for this, but it is a goal.
+
+Rust CUDA also depends on NVIDIA's toolchain, which is effectively tied to LLVM 7.1.
+Most modern Linux distributions no longer provide packages for this version so it must
+be built manually. This is a significant burden. The project maintains [Docker
+images](https://github.com/orgs/Rust-GPU/packages?repo_name=Rust-CUDA) with the full
+toolchain but ideally those should only be needed by compiler developers and not end
+users like they are now.
+
+Debugging the compilation process is also difficult. Tracing Rust code through the
+various layers and toolchains is challenging. Some parts support debug info, others do
+not. This often leads to opaque errors with no clear indication of which code caused the
+failure. Improving the debugging experience is a high priority and clearly needed.
+
+Finally, Rust GPU and Rust CUDA evolved independently and diverged in their APIs. For
+example, accessing thread indices is done through a function call in Rust CUDA
+(`thread::thread_idx_x()`), while in Rust GPU it requires annotating entrypoint
+arguments. Even the standard library names differ (`cuda_std` vs `spirv_std`) and thus
+`cfg()` directives are required in GPU-aware code even if the APIs are the same. These
+inconsistencies make the user experience harder than it should be and unifying the APIs
+where possible is an obvious next step. Unifying the codebases might make sense too.
+
+
+## Come join us!
+
+We can finally write GPU code in Rust and run it on all major platforms across all major
+GPUs. The next step is to improve the experience. We need to add support for more Rust
+language constructs and APIs. Everything needs to be made more ergonomic, more
+consistent, and fully integrated into the Rust ecosystem. Plus, we haven't even really
+started optimizing performance either. Come help, we're eager to add more users and
+contributors!
+
+To follow along or get involved, check out the [`rust-gpu` repo on
+GitHub](https://github.com/rust-gpu/rust-gpu) or the [`rust-cuda` repo on
+GitHub](https://github.com/rust-gpu/rust-cuda).
+
+
+[^1]: There are other projects and approaches, check out the [GPU ecosystem in Rust
+    overview page](https://rust-gpu.github.io/ecosystem/).