|
| 1 | +# Ray Tracing on the GPU |
| 2 | + |
| 3 | +In this tutorial, we'll take the ray tracer from the previous tutorial and port it to the GPU using **KernelAbstractions.jl** and a GPU backend of choice (CUDA.jl, AMDGPU.jl, OpenCL.jl, OneApi.jl, or Metal.jl). We'll explore three different kernel implementations, each with different optimization strategies, and benchmark their performance against each other. |
| 4 | + |
| 5 | +By the end, you'll understand how to write efficient GPU kernels for ray tracing and the tradeoffs between different approaches! |
| 6 | + |
| 7 | +## Setup |
| 8 | + |
| 9 | +```julia (editor=true, logging=false, output=true) |
| 10 | +using Raycore, GeometryBasics, LinearAlgebra |
| 11 | +using Colors, ImageShow |
| 12 | +using WGLMakie |
| 13 | +using KernelAbstractions |
| 14 | +using BenchmarkTools |
| 15 | +``` |
| 16 | +To run things on the GPU with KernelAbstractions, you need to choose the correct package for your GPU and set the array type we use from there on. |
| 17 | + |
| 18 | +```julia (editor=true, logging=false, output=true) |
| 19 | +#using CUDA; GArray = CuArray; # For NVIDIA GPUS |
| 20 | +#using AMDGPU; GArray = ROCArray; # for AMD GPUs |
| 21 | +#using Metal; GArray = MtlArray; # for Apple hardware |
| 22 | +#using oneAPI; GArray = oneArray; # for intel |
| 23 | +# OpenCL with the pocl backend should work for most CPUs and some GPUs, but might not be as fast. |
| 24 | +# using pocl_jll, OpenCL; GArray = CLArray; |
| 25 | +GArray = Array # For the tutorial to run on CI we just use the CPU |
| 26 | +``` |
| 27 | +**Ready for GPU!** We have: |
| 28 | + |
| 29 | + * `Raycore` for fast ray-triangle intersections |
| 30 | + * `KernelAbstractions` for portable GPU kernels (works with CUDA, AMD, Metal, oneAPI, and OpenCL) |
| 31 | + * `BenchmarkTools` for performance comparison |
| 32 | + |
| 33 | +## Part 1: Scene Setup (Same as CPU Tutorial) |
| 34 | + |
| 35 | +Let's use the exact same scene as the CPU tutorial - the Makie cat with room geometry: |
| 36 | + |
| 37 | +```julia (editor=true, logging=false, output=true) |
| 38 | +# Load and prepare the cat model |
| 39 | +include("raytracing-core.jl") |
| 40 | +bvh, ctx = example_scene() |
| 41 | +# We have a Makie extension for plotting the scene graph |
| 42 | +f, ax, pl = plot(bvh; axis=(; show_axis=false)) |
| 43 | +f |
| 44 | +``` |
| 45 | +```julia (editor=true, logging=false, output=true) |
| 46 | +cam = cameracontrols(ax.scene) |
| 47 | +cam.eyeposition[] = [0, 1.0, -4] |
| 48 | +cam.lookat[] = [0, 0, 2] |
| 49 | +cam.upvector[] = [0.0, 1, 0.0] |
| 50 | +cam.fov[] = 45.0 |
| 51 | +``` |
| 52 | +## Part 2: GPU Kernel Version 1 - Basic Naive Approach |
| 53 | + |
| 54 | +The simplest GPU kernel - one thread per pixel: |
| 55 | + |
| 56 | +```julia (editor=true, logging=false, output=true) |
| 57 | +import KernelAbstractions as KA |
| 58 | + |
| 59 | +# Basic kernel: one thread per pixel, straightforward implementation |
| 60 | +@kernel function raytrace_kernel_v1!( |
| 61 | + img, @Const(bvh), @Const(ctx), |
| 62 | + camera_pos, focal_length, aspect, sky_color, ::Val{NSamples} |
| 63 | +) where {NSamples} |
| 64 | + # Get pixel coordinates |
| 65 | + idx = @index(Global, Linear) |
| 66 | + height, width = size(img) |
| 67 | + # Convert linear index to 2D coordinates |
| 68 | + x = ((idx - 1) % width) + 1 |
| 69 | + y = ((idx - 1) ÷ width) + 1 |
| 70 | + if x <= width && y <= height |
| 71 | + # Generate camera ray and calculate a simple light model |
| 72 | + color = Vec3f(0) |
| 73 | + for i in 1:NSamples |
| 74 | + color = color .+ sample_light(bvh, ctx, width, height, camera_pos, focal_length, aspect, x, y, sky_color) |
| 75 | + end |
| 76 | + @inbounds img[y, x] = to_rgb(color ./ NSamples) |
| 77 | + end |
| 78 | +end |
| 79 | +``` |
| 80 | +The `trace_gpu` function is a universal launcher that works with any of our kernels. It handles the backend-specific setup automatically using **KernelAbstractions.jl**: |
| 81 | + |
| 82 | +```julia (editor=true, logging=false, output=true) |
| 83 | +function trace_gpu(kernel, img, bvh, ctx; |
| 84 | + camera_pos=Point3f(0, -0.9, -2.5), fov=45.0f0, |
| 85 | + sky_color=RGB{Float32}(0.5f0,0.7f0,1.0f0), |
| 86 | + samples_per_pixel=4, |
| 87 | + ndrange=length(img), tilesize=nothing |
| 88 | + ) |
| 89 | + height, width = size(img) |
| 90 | + aspect = Float32(width / height) |
| 91 | + focal_length = 1.0f0 / tan(deg2rad(fov / 2)) |
| 92 | + |
| 93 | + # KernelAbstractions automatically detects the backend (CPU/GPU) from the array type |
| 94 | + backend = KA.get_backend(img) |
| 95 | + |
| 96 | + # Create the kernel with or without tilesize (for workgroup configuration) |
| 97 | + kernel! = isnothing(tilesize) ? kernel(backend) : kernel(backend, tilesize) |
| 98 | + |
| 99 | + kernel!(img, bvh, ctx, camera_pos, focal_length, aspect, sky_color, Val(samples_per_pixel), ndrange=ndrange) |
| 100 | + |
| 101 | + # Ensure GPU computation completes before returning |
| 102 | + KA.synchronize(backend) |
| 103 | + return img |
| 104 | +end |
| 105 | +``` |
| 106 | +**Key KernelAbstractions.jl concepts:** |
| 107 | + |
| 108 | + * **Backend detection**: `get_backend(array)` automatically determines if we're using CPU, AMD GPU, NVIDIA GPU, etc. |
| 109 | + * **Kernel compilation**: `kernel(backend)` compiles the kernel for the specific backend |
| 110 | + * **Workgroup configuration**: Optional `tilesize` parameter controls thread organization |
| 111 | + * **Thread indexing**: Inside kernels, use `@index(Global, Linear)` or `@index(Global, Cartesian)` to get thread IDs |
| 112 | + * **Synchronization**: `synchronize(backend)` ensures all GPU work completes before continuing |
| 113 | + |
| 114 | +Let's test kernel v1 on the CPU (yes, they always work with normal Arrays): |
| 115 | + |
| 116 | +```julia (editor=true, logging=false, output=true) |
| 117 | +img = fill(RGBf(0, 0, 0), 400, 720) |
| 118 | +bench_kernel_cpu_v1 = @benchmark trace_gpu(raytrace_kernel_v1!, img, bvh, ctx) |
| 119 | +img |
| 120 | +``` |
| 121 | +To run things on the GPU, we simply convert the arrays to the GPU backend array type. `to_gpu` is a helper in Raycore to convert nested structs correctly for the kernel. It's not doing anything special, besides that struct of arrays need to be converted to device arrays and for pure arrays `GPUArray(array)` is enough. |
| 122 | + |
| 123 | +```julia (editor=true, logging=false, output=true) |
| 124 | +using Raycore: to_gpu |
| 125 | +img = fill(RGBf(0, 0, 0), 400, 720) |
| 126 | +img_gpu = GArray(img); |
| 127 | +bvh_gpu = to_gpu(GArray, bvh); |
| 128 | +ctx_gpu = to_gpu(GArray, ctx); |
| 129 | +bench_kernel_v1 = @benchmark trace_gpu(raytrace_kernel_v1!, img_gpu, bvh_gpu, ctx_gpu) |
| 130 | +# Bring back to CPU to display image |
| 131 | +Array(img_gpu) |
| 132 | +``` |
| 133 | +**First GPU render!** This is the simplest approach - one thread per pixel with no optimization. |
| 134 | + |
| 135 | +## Part 3: Optimized Kernel - Loop Unrolling |
| 136 | + |
| 137 | +Loop overhead is significant on GPUs! Manually unrolling the sampling loop eliminates this overhead: |
| 138 | + |
| 139 | +```julia (editor=true, logging=false, output=true) |
| 140 | +# Optimized kernel: Unrolled sampling loop |
| 141 | +@kernel function raytrace_kernel_unrolled!( |
| 142 | + img, @Const(bvh), @Const(ctx), |
| 143 | + camera_pos, focal_length, aspect, sky_color, ::Val{NSamples} |
| 144 | + ) where {NSamples} |
| 145 | + idx = @index(Global, Linear) |
| 146 | + height, width = size(img) |
| 147 | + x = ((idx - 1) % width) + 1 |
| 148 | + y = ((idx - 1) ÷ width) + 1 |
| 149 | + if x <= width && y <= height |
| 150 | + # ntuple with compile-time constant for unrolling |
| 151 | + samples = ntuple(NSamples) do i |
| 152 | + sample_light(bvh, ctx, width, height, |
| 153 | + camera_pos, focal_length, aspect, |
| 154 | + x, y, sky_color |
| 155 | + ) |
| 156 | + end |
| 157 | + color = mean(samples) |
| 158 | + @inbounds img[y, x] = to_rgb(color) |
| 159 | + end |
| 160 | +end |
| 161 | + |
| 162 | +bench_kernel_unrolled = @benchmark trace_gpu(raytrace_kernel_unrolled!, img_gpu, bvh_gpu, ctx_gpu) |
| 163 | +Array(img_gpu) |
| 164 | +``` |
| 165 | + * This eliminates branch overhead from loop conditions |
| 166 | + * Reduces register pressure |
| 167 | + * Better instruction-level parallelism |
| 168 | + * **1.39x faster than baseline!** |
| 169 | + |
| 170 | +## Part 4: Tiled Kernel with Optimized Tile Size |
| 171 | + |
| 172 | +The tile size dramatically affects performance. Let's use the optimal size discovered through benchmarking: |
| 173 | + |
| 174 | +```julia (editor=true, logging=false, output=true) |
| 175 | +# Tiled kernel with optimized tile size |
| 176 | +@kernel function raytrace_kernel_tiled!( |
| 177 | + img, bvh, ctx, |
| 178 | + camera_pos, focal_length, aspect, sky_color, ::Val{NSamples} |
| 179 | +) where {NSamples} |
| 180 | + # Get tile and local coordinates |
| 181 | + _tile_xy = @index(Group, Cartesian) |
| 182 | + _local_xy = @index(Local, Cartesian) |
| 183 | + _groupsize = @groupsize() |
| 184 | + |
| 185 | + # Direct tuple unpacking is faster than Vec construction |
| 186 | + tile_x, tile_y = Tuple(_tile_xy) |
| 187 | + local_x, local_y = Tuple(_local_xy) |
| 188 | + group_w, group_h = _groupsize |
| 189 | + |
| 190 | + # Compute global pixel coordinates |
| 191 | + x = (tile_x - 1) * group_w + local_x |
| 192 | + y = (tile_y - 1) * group_h + local_y |
| 193 | + |
| 194 | + height, width = size(img) |
| 195 | + if x <= width && y <= height |
| 196 | + samples = ntuple(NSamples) do i |
| 197 | + sample_light(bvh, ctx, width, height, |
| 198 | + camera_pos, focal_length, aspect, |
| 199 | + x, y, sky_color |
| 200 | + ) |
| 201 | + end |
| 202 | + color = mean(samples) |
| 203 | + @inbounds img[y, x] = to_rgb(color) |
| 204 | + end |
| 205 | +end |
| 206 | +bench_kernel_tiled_32_16 = @benchmark trace_gpu( |
| 207 | + $raytrace_kernel_tiled!, $img_gpu, $bvh_gpu, $ctx_gpu; |
| 208 | + ndrange=size($img_gpu), tilesize=(32,16)) |
| 209 | + |
| 210 | +# Benchmark two more important tile sizes for comparison |
| 211 | +bench_kernel_tiled_32_32 = @benchmark trace_gpu( |
| 212 | + $raytrace_kernel_tiled!, $img_gpu, $bvh_gpu, $ctx_gpu; |
| 213 | + ndrange=size($img_gpu), tilesize=(32,32)) |
| 214 | + |
| 215 | +bench_kernel_tiled_8_8 = @benchmark trace_gpu( |
| 216 | + $raytrace_kernel_tiled!, $img_gpu, $bvh_gpu, $ctx_gpu; |
| 217 | + ndrange=size($img_gpu), tilesize=(8,8)) |
| 218 | + |
| 219 | +# Use optimal tile size: (32, 16) - discovered through benchmarking! |
| 220 | +Array(img_gpu) |
| 221 | +``` |
| 222 | +**Tile size matters!** With `(32, 16)` tiles, this kernel is **1.22x faster** than baseline. With poor tile sizes like `(8, 8)`, it can be **2.5x slower**! |
| 223 | + |
| 224 | +## Part 5: Wavefront Path Tracing |
| 225 | + |
| 226 | +The wavefront approach reorganizes ray tracing to minimize thread divergence by grouping similar work together. Instead of each thread handling an entire pixel's path, we separate the work into stages. Discussing the exact implementation is outside the scope of this tutorial, so we only include the finished renderer here: |
| 227 | + |
| 228 | +```julia (editor=true, logging=false, output=true) |
| 229 | +include("wavefront-renderer.jl") |
| 230 | +``` |
| 231 | +Let's benchmark the wavefront renderer on both CPU and GPU: |
| 232 | + |
| 233 | +```julia (editor=true, logging=false, output=true) |
| 234 | +# CPU benchmark |
| 235 | +renderer_cpu = WavefrontRenderer(img, bvh, ctx) |
| 236 | +bench_wavefront_cpu = @benchmark render!($renderer_cpu) |
| 237 | + |
| 238 | +# GPU benchmark |
| 239 | +renderer_gpu = to_gpu(GArray, renderer_cpu) |
| 240 | +bench_wavefront_gpu = @benchmark render!($renderer_gpu) |
| 241 | + |
| 242 | +renderer_gpu = to_gpu(GArray, WavefrontRenderer(img, bvh, ctx; samples_per_pixel=16)) |
| 243 | +render!(renderer_gpu) |
| 244 | +# Display result |
| 245 | +Array(renderer_gpu.framebuffer) |
| 246 | +``` |
| 247 | +**Wavefront benefits:** |
| 248 | + |
| 249 | + * Reduces thread divergence by grouping similar work |
| 250 | + * Better memory access patterns |
| 251 | + * Scales well with scene complexity |
| 252 | + * Enables advanced features like path tracing |
| 253 | + |
| 254 | +## Part 6: Comprehensive Performance Benchmarks |
| 255 | + |
| 256 | +Now let's compare all kernels including the wavefront renderer: |
| 257 | + |
| 258 | +```julia (editor=true, logging=false, output=true) |
| 259 | +benchmarks = [ |
| 260 | + bench_kernel_v1, bench_kernel_cpu_v1, bench_kernel_unrolled, |
| 261 | + bench_kernel_tiled_8_8, bench_kernel_tiled_32_16, bench_kernel_tiled_32_32, |
| 262 | + bench_wavefront_cpu, bench_wavefront_gpu |
| 263 | +] |
| 264 | +labels = [ |
| 265 | + "Baseline\n(gpu)", "Baseline\n(cpu)", "Unrolled", |
| 266 | + "Tiled\n(8×8)", "Tiled\n(32×16)", "Tiled\n(32×32)", |
| 267 | + "Wavefront\n(cpu)", "Wavefront\n(gpu)" |
| 268 | +] |
| 269 | + |
| 270 | +fig, times, speedups = plot_kernel_benchmarks(benchmarks, labels) |
| 271 | +# save(data"gpu-benchmarks.png", fig; backend=Main.GLMakie) |
| 272 | +# We use the Fig from my local run, since on CI this won't show the differences |
| 273 | +DOM.img(src=Asset(data"gpu-benchmarks.png"), width="700px") |
| 274 | +``` |
| 275 | +### Next Steps |
| 276 | + |
| 277 | + * Add **adaptive sampling** (more samples only where needed) |
| 278 | + * Explore **shared memory** optimizations for BVH traversal |
| 279 | + * Implement **streaming multisampling** across frames |
| 280 | + |
0 commit comments