Skip to content

Commit 1d6582c

Browse files
authored
Add GPU tutorial and performance improvements for GPU (#5)
* add gpubvh * gpu tutorial * refactore to use one BVH * only use one BVH type and add wavefront renderer * improve ray tracing tutorial * fixes and include wavefront * remove structarrays * dont overwrite benchmark * check in data * updates * remove versions * stricter soa and clean up * backwards compat and some small fixes
1 parent 37a68df commit 1d6582c

26 files changed

+2134
-399
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,3 +6,4 @@ docs/build
66

77
Manifest*.toml
88
.*-bbook/
9+
!.gpu_raytracing_tutorial-bbook

Project.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ RaycoreMakieExt = "Makie"
2020
[compat]
2121
julia = "1.10"
2222
GeometryBasics = "0.5"
23-
KernelAbstractions = "0.9"
23+
KernelAbstractions = "0.9, 0.10"
2424
LinearAlgebra = "1"
2525
Random = "1"
2626
StaticArrays = "1.9.7"

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ using Raycore, GeometryBasics, LinearAlgebra
2929
sphere = Tesselation(Sphere(Point3f(0, 0, 2), 1.0f0), 20)
3030

3131
# Build BVH acceleration structure
32-
bvh = BVHAccel([sphere])
32+
bvh = BVH([sphere])
3333

3434
# Cast rays and find intersections
3535
ray = Ray(o=Point3f(0, 0, 0), d=Vec3f(0, 0, 1))

docs/examples.jl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ begin
1414
l = 0.5
1515
floor = Rect3f(-l, -l, -0.01, 2l, 2l, 0.01)
1616
cat = load(Makie.assetpath("cat.obj"))
17-
bvh = Raycore.BVHAccel([s1, s2, s3, s4, cat]);
17+
bvh = Raycore.BVH([s1, s2, s3, s4, cat]);
1818
world_mesh = GeometryBasics.Mesh(bvh)
1919
f, ax, pl = Makie.mesh(world_mesh; color=:teal)
2020
center!(ax.scene)

docs/make.jl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ makedocs(;
1818
"Examples" => [
1919
"BVH Hit Tests" => "bvh_hit_tests.md",
2020
"Ray Tracing Tutorial" => "raytracing_tutorial.md",
21+
"GPU Ray Tracing Tutorial" => "gpu_raytracing.md",
2122
"View Factors and More" => "viewfactors.md",
2223
],
2324
],
64.4 KB
Loading
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
version = "0.1.0"

docs/src/bvh_hit_tests_content.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ function create_test_scene()
1717
sphere2 = Tesselation(Sphere(Point3f(0, 0, 3), 1.0f0), 20) # Middle
1818
sphere3 = Tesselation(Sphere(Point3f(0, 0, 1), 1.0f0), 20) # Closest
1919

20-
bvh = Raycore.BVHAccel([sphere1, sphere2, sphere3])
20+
bvh = Raycore.BVH([sphere1, sphere2, sphere3])
2121
return bvh
2222
end
2323

@@ -89,7 +89,7 @@ for i in 1:30
8989
push!(complex_spheres, Tesselation(Sphere(Point3f(x, y, z), r), 8))
9090
end
9191

92-
complex_bvh = Raycore.BVHAccel(complex_spheres)
92+
complex_bvh = Raycore.BVH(complex_spheres)
9393
# Test rays to find cases where any_hit differs from closest_hit
9494
test_rays = map(rand(Point2f, 200)) do p
9595
p = (p .* 14f0) .- 8f0

docs/src/gpu_raytracing.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# GPU Ray Tracing with Raycore
2+
3+
```@setup gpu_raytracing
4+
using Bonito, BonitoBook, Raycore
5+
book_app = App() do
6+
path = normpath(joinpath(dirname(pathof(Raycore)), "..", "docs", "src", "gpu_raytracing_tutorial.md"))
7+
BonitoBook.InlineBook(path)
8+
end
9+
Bonito.Page()
10+
```
11+
12+
```@example gpu_raytracing
13+
book_app # hide
14+
```
Lines changed: 280 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,280 @@
1+
# Ray Tracing on the GPU
2+
3+
In this tutorial, we'll take the ray tracer from the previous tutorial and port it to the GPU using **KernelAbstractions.jl** and a GPU backend of choice (CUDA.jl, AMDGPU.jl, OpenCL.jl, OneApi.jl, or Metal.jl). We'll explore three different kernel implementations, each with different optimization strategies, and benchmark their performance against each other.
4+
5+
By the end, you'll understand how to write efficient GPU kernels for ray tracing and the tradeoffs between different approaches!
6+
7+
## Setup
8+
9+
```julia (editor=true, logging=false, output=true)
10+
using Raycore, GeometryBasics, LinearAlgebra
11+
using Colors, ImageShow
12+
using WGLMakie
13+
using KernelAbstractions
14+
using BenchmarkTools
15+
```
16+
To run things on the GPU with KernelAbstractions, you need to choose the correct package for your GPU and set the array type we use from there on.
17+
18+
```julia (editor=true, logging=false, output=true)
19+
#using CUDA; GArray = CuArray; # For NVIDIA GPUS
20+
#using AMDGPU; GArray = ROCArray; # for AMD GPUs
21+
#using Metal; GArray = MtlArray; # for Apple hardware
22+
#using oneAPI; GArray = oneArray; # for intel
23+
# OpenCL with the pocl backend should work for most CPUs and some GPUs, but might not be as fast.
24+
# using pocl_jll, OpenCL; GArray = CLArray;
25+
GArray = Array # For the tutorial to run on CI we just use the CPU
26+
```
27+
**Ready for GPU!** We have:
28+
29+
* `Raycore` for fast ray-triangle intersections
30+
* `KernelAbstractions` for portable GPU kernels (works with CUDA, AMD, Metal, oneAPI, and OpenCL)
31+
* `BenchmarkTools` for performance comparison
32+
33+
## Part 1: Scene Setup (Same as CPU Tutorial)
34+
35+
Let's use the exact same scene as the CPU tutorial - the Makie cat with room geometry:
36+
37+
```julia (editor=true, logging=false, output=true)
38+
# Load and prepare the cat model
39+
include("raytracing-core.jl")
40+
bvh, ctx = example_scene()
41+
# We have a Makie extension for plotting the scene graph
42+
f, ax, pl = plot(bvh; axis=(; show_axis=false))
43+
f
44+
```
45+
```julia (editor=true, logging=false, output=true)
46+
cam = cameracontrols(ax.scene)
47+
cam.eyeposition[] = [0, 1.0, -4]
48+
cam.lookat[] = [0, 0, 2]
49+
cam.upvector[] = [0.0, 1, 0.0]
50+
cam.fov[] = 45.0
51+
```
52+
## Part 2: GPU Kernel Version 1 - Basic Naive Approach
53+
54+
The simplest GPU kernel - one thread per pixel:
55+
56+
```julia (editor=true, logging=false, output=true)
57+
import KernelAbstractions as KA
58+
59+
# Basic kernel: one thread per pixel, straightforward implementation
60+
@kernel function raytrace_kernel_v1!(
61+
img, @Const(bvh), @Const(ctx),
62+
camera_pos, focal_length, aspect, sky_color, ::Val{NSamples}
63+
) where {NSamples}
64+
# Get pixel coordinates
65+
idx = @index(Global, Linear)
66+
height, width = size(img)
67+
# Convert linear index to 2D coordinates
68+
x = ((idx - 1) % width) + 1
69+
y = ((idx - 1) ÷ width) + 1
70+
if x <= width && y <= height
71+
# Generate camera ray and calculate a simple light model
72+
color = Vec3f(0)
73+
for i in 1:NSamples
74+
color = color .+ sample_light(bvh, ctx, width, height, camera_pos, focal_length, aspect, x, y, sky_color)
75+
end
76+
@inbounds img[y, x] = to_rgb(color ./ NSamples)
77+
end
78+
end
79+
```
80+
The `trace_gpu` function is a universal launcher that works with any of our kernels. It handles the backend-specific setup automatically using **KernelAbstractions.jl**:
81+
82+
```julia (editor=true, logging=false, output=true)
83+
function trace_gpu(kernel, img, bvh, ctx;
84+
camera_pos=Point3f(0, -0.9, -2.5), fov=45.0f0,
85+
sky_color=RGB{Float32}(0.5f0,0.7f0,1.0f0),
86+
samples_per_pixel=4,
87+
ndrange=length(img), tilesize=nothing
88+
)
89+
height, width = size(img)
90+
aspect = Float32(width / height)
91+
focal_length = 1.0f0 / tan(deg2rad(fov / 2))
92+
93+
# KernelAbstractions automatically detects the backend (CPU/GPU) from the array type
94+
backend = KA.get_backend(img)
95+
96+
# Create the kernel with or without tilesize (for workgroup configuration)
97+
kernel! = isnothing(tilesize) ? kernel(backend) : kernel(backend, tilesize)
98+
99+
kernel!(img, bvh, ctx, camera_pos, focal_length, aspect, sky_color, Val(samples_per_pixel), ndrange=ndrange)
100+
101+
# Ensure GPU computation completes before returning
102+
KA.synchronize(backend)
103+
return img
104+
end
105+
```
106+
**Key KernelAbstractions.jl concepts:**
107+
108+
* **Backend detection**: `get_backend(array)` automatically determines if we're using CPU, AMD GPU, NVIDIA GPU, etc.
109+
* **Kernel compilation**: `kernel(backend)` compiles the kernel for the specific backend
110+
* **Workgroup configuration**: Optional `tilesize` parameter controls thread organization
111+
* **Thread indexing**: Inside kernels, use `@index(Global, Linear)` or `@index(Global, Cartesian)` to get thread IDs
112+
* **Synchronization**: `synchronize(backend)` ensures all GPU work completes before continuing
113+
114+
Let's test kernel v1 on the CPU (yes, they always work with normal Arrays):
115+
116+
```julia (editor=true, logging=false, output=true)
117+
img = fill(RGBf(0, 0, 0), 400, 720)
118+
bench_kernel_cpu_v1 = @benchmark trace_gpu(raytrace_kernel_v1!, img, bvh, ctx)
119+
img
120+
```
121+
To run things on the GPU, we simply convert the arrays to the GPU backend array type. `to_gpu` is a helper in Raycore to convert nested structs correctly for the kernel. It's not doing anything special, besides that struct of arrays need to be converted to device arrays and for pure arrays `GPUArray(array)` is enough.
122+
123+
```julia (editor=true, logging=false, output=true)
124+
using Raycore: to_gpu
125+
img = fill(RGBf(0, 0, 0), 400, 720)
126+
img_gpu = GArray(img);
127+
bvh_gpu = to_gpu(GArray, bvh);
128+
ctx_gpu = to_gpu(GArray, ctx);
129+
bench_kernel_v1 = @benchmark trace_gpu(raytrace_kernel_v1!, img_gpu, bvh_gpu, ctx_gpu)
130+
# Bring back to CPU to display image
131+
Array(img_gpu)
132+
```
133+
**First GPU render!** This is the simplest approach - one thread per pixel with no optimization.
134+
135+
## Part 3: Optimized Kernel - Loop Unrolling
136+
137+
Loop overhead is significant on GPUs! Manually unrolling the sampling loop eliminates this overhead:
138+
139+
```julia (editor=true, logging=false, output=true)
140+
# Optimized kernel: Unrolled sampling loop
141+
@kernel function raytrace_kernel_unrolled!(
142+
img, @Const(bvh), @Const(ctx),
143+
camera_pos, focal_length, aspect, sky_color, ::Val{NSamples}
144+
) where {NSamples}
145+
idx = @index(Global, Linear)
146+
height, width = size(img)
147+
x = ((idx - 1) % width) + 1
148+
y = ((idx - 1) ÷ width) + 1
149+
if x <= width && y <= height
150+
# ntuple with compile-time constant for unrolling
151+
samples = ntuple(NSamples) do i
152+
sample_light(bvh, ctx, width, height,
153+
camera_pos, focal_length, aspect,
154+
x, y, sky_color
155+
)
156+
end
157+
color = mean(samples)
158+
@inbounds img[y, x] = to_rgb(color)
159+
end
160+
end
161+
162+
bench_kernel_unrolled = @benchmark trace_gpu(raytrace_kernel_unrolled!, img_gpu, bvh_gpu, ctx_gpu)
163+
Array(img_gpu)
164+
```
165+
* This eliminates branch overhead from loop conditions
166+
* Reduces register pressure
167+
* Better instruction-level parallelism
168+
* **1.39x faster than baseline!**
169+
170+
## Part 4: Tiled Kernel with Optimized Tile Size
171+
172+
The tile size dramatically affects performance. Let's use the optimal size discovered through benchmarking:
173+
174+
```julia (editor=true, logging=false, output=true)
175+
# Tiled kernel with optimized tile size
176+
@kernel function raytrace_kernel_tiled!(
177+
img, bvh, ctx,
178+
camera_pos, focal_length, aspect, sky_color, ::Val{NSamples}
179+
) where {NSamples}
180+
# Get tile and local coordinates
181+
_tile_xy = @index(Group, Cartesian)
182+
_local_xy = @index(Local, Cartesian)
183+
_groupsize = @groupsize()
184+
185+
# Direct tuple unpacking is faster than Vec construction
186+
tile_x, tile_y = Tuple(_tile_xy)
187+
local_x, local_y = Tuple(_local_xy)
188+
group_w, group_h = _groupsize
189+
190+
# Compute global pixel coordinates
191+
x = (tile_x - 1) * group_w + local_x
192+
y = (tile_y - 1) * group_h + local_y
193+
194+
height, width = size(img)
195+
if x <= width && y <= height
196+
samples = ntuple(NSamples) do i
197+
sample_light(bvh, ctx, width, height,
198+
camera_pos, focal_length, aspect,
199+
x, y, sky_color
200+
)
201+
end
202+
color = mean(samples)
203+
@inbounds img[y, x] = to_rgb(color)
204+
end
205+
end
206+
bench_kernel_tiled_32_16 = @benchmark trace_gpu(
207+
$raytrace_kernel_tiled!, $img_gpu, $bvh_gpu, $ctx_gpu;
208+
ndrange=size($img_gpu), tilesize=(32,16))
209+
210+
# Benchmark two more important tile sizes for comparison
211+
bench_kernel_tiled_32_32 = @benchmark trace_gpu(
212+
$raytrace_kernel_tiled!, $img_gpu, $bvh_gpu, $ctx_gpu;
213+
ndrange=size($img_gpu), tilesize=(32,32))
214+
215+
bench_kernel_tiled_8_8 = @benchmark trace_gpu(
216+
$raytrace_kernel_tiled!, $img_gpu, $bvh_gpu, $ctx_gpu;
217+
ndrange=size($img_gpu), tilesize=(8,8))
218+
219+
# Use optimal tile size: (32, 16) - discovered through benchmarking!
220+
Array(img_gpu)
221+
```
222+
**Tile size matters!** With `(32, 16)` tiles, this kernel is **1.22x faster** than baseline. With poor tile sizes like `(8, 8)`, it can be **2.5x slower**!
223+
224+
## Part 5: Wavefront Path Tracing
225+
226+
The wavefront approach reorganizes ray tracing to minimize thread divergence by grouping similar work together. Instead of each thread handling an entire pixel's path, we separate the work into stages. Discussing the exact implementation is outside the scope of this tutorial, so we only include the finished renderer here:
227+
228+
```julia (editor=true, logging=false, output=true)
229+
include("wavefront-renderer.jl")
230+
```
231+
Let's benchmark the wavefront renderer on both CPU and GPU:
232+
233+
```julia (editor=true, logging=false, output=true)
234+
# CPU benchmark
235+
renderer_cpu = WavefrontRenderer(img, bvh, ctx)
236+
bench_wavefront_cpu = @benchmark render!($renderer_cpu)
237+
238+
# GPU benchmark
239+
renderer_gpu = to_gpu(GArray, renderer_cpu)
240+
bench_wavefront_gpu = @benchmark render!($renderer_gpu)
241+
242+
renderer_gpu = to_gpu(GArray, WavefrontRenderer(img, bvh, ctx; samples_per_pixel=16))
243+
render!(renderer_gpu)
244+
# Display result
245+
Array(renderer_gpu.framebuffer)
246+
```
247+
**Wavefront benefits:**
248+
249+
* Reduces thread divergence by grouping similar work
250+
* Better memory access patterns
251+
* Scales well with scene complexity
252+
* Enables advanced features like path tracing
253+
254+
## Part 6: Comprehensive Performance Benchmarks
255+
256+
Now let's compare all kernels including the wavefront renderer:
257+
258+
```julia (editor=true, logging=false, output=true)
259+
benchmarks = [
260+
bench_kernel_v1, bench_kernel_cpu_v1, bench_kernel_unrolled,
261+
bench_kernel_tiled_8_8, bench_kernel_tiled_32_16, bench_kernel_tiled_32_32,
262+
bench_wavefront_cpu, bench_wavefront_gpu
263+
]
264+
labels = [
265+
"Baseline\n(gpu)", "Baseline\n(cpu)", "Unrolled",
266+
"Tiled\n(8×8)", "Tiled\n(32×16)", "Tiled\n(32×32)",
267+
"Wavefront\n(cpu)", "Wavefront\n(gpu)"
268+
]
269+
270+
fig, times, speedups = plot_kernel_benchmarks(benchmarks, labels)
271+
# save(data"gpu-benchmarks.png", fig; backend=Main.GLMakie)
272+
# We use the Fig from my local run, since on CI this won't show the differences
273+
DOM.img(src=Asset(data"gpu-benchmarks.png"), width="700px")
274+
```
275+
### Next Steps
276+
277+
* Add **adaptive sampling** (more samples only where needed)
278+
* Explore **shared memory** optimizations for BVH traversal
279+
* Implement **streaming multisampling** across frames
280+

0 commit comments

Comments
 (0)