Skip to content

Commit 5f858fd

Browse files
committed
Update Open Graph and Twitter image metadata, enhance article introduction, and adjust section header font size for improved readability, Added a screenshot of the SASS code with a comment
1 parent f9cf8e4 commit 5f858fd

File tree

4 files changed

+89
-34
lines changed

4 files changed

+89
-34
lines changed

RTIOW.html

Lines changed: 88 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -12,15 +12,17 @@
1212
<!-- Open Graph Meta (for Facebook, LinkedIn, etc.) -->
1313
<meta property="og:title" content="Karim Sayed - Rendering Engineer">
1414
<meta property="og:description" content="A showcase of my projects and portfolio.">
15-
<meta property="og:image" content="https://karimsayedre.github.io/images/Pathtracing/0.jpg">
15+
<meta property="og:image"
16+
content="https://karimsayedre.github.io/images/RTIOW/2560x1440_50depth_3000samples_3400ms.png">
1617
<meta property="og:url" content="https://karimsayedre.github.io/">
1718
<meta property="og:type" content="website">
1819

1920
<!-- Twitter Card Meta -->
2021
<meta name="twitter:card" content="summary_large_image">
2122
<meta name="twitter:title" content="Karim Sayed - Rendering Engineer">
2223
<meta name="twitter:description" content="A showcase of my projects and portfolio.">
23-
<meta name="twitter:image" content="https://karimsayedre.github.io/images/Pathtracing/0.jpg">
24+
<meta name="twitter:image"
25+
content="https://karimsayedre.github.io/images/RTIOW/2560x1440_50depth_3000samples_3400ms.png">
2426

2527
<script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.3/dist/js/bootstrap.bundle.min.js"></script>
2628
<link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.3/dist/css/bootstrap.min.css" rel="stylesheet"
@@ -68,8 +70,8 @@
6870
<article>
6971
<div class="collapsible">
7072
<h1>CUDA Ray Tracing 3.6x Faster Than RTX: My CUDA Ray Tracing Journey</h1>
71-
<p><em>Note: This is a draft version. Final edits are still in progress. Feedback is welcome while final
72-
edits are underway.</em></p>
73+
<!-- <p><em>Note: This is a draft version. Final edits are still in progress. Feedback is welcome while final
74+
edits are underway.</em></p> -->
7375

7476
</div>
7577

@@ -79,10 +81,23 @@ <h1>CUDA Ray Tracing 3.6x Faster Than RTX: My CUDA Ray Tracing Journey</h1>
7981
<h2>Introduction</h2>
8082

8183
<p>
82-
Alright, this headline is a bit <strong>click-baity</strong>, but it's actually true… kind of.
83-
I'm comparing my CUDA path tracer against the
84-
<a href="https://github.com/GPSnoopy/RayTracingInVulkan" target="_blank">RayTracingInVulkan</a>
85-
repo by GPSnoopy:
84+
Welcome! This article is a deep dive into how I made a CUDA-based ray tracer that outperforms a
85+
Vulkan/RTX implementation—sometimes by more than 3x—on the same hardware. If you're interested in
86+
GPU programming, performance optimization, or just want to see how far you can push a path tracer,
87+
you're in the right place.
88+
</p>
89+
<p>
90+
The comparison is with <a href="https://github.com/GPSnoopy/RayTracingInVulkan"
91+
target="_blank">RayTracingInVulkan</a> by GPSnoopy, a well-known Vulkan/RTX renderer. My goal
92+
wasn't just to port <em>Ray Tracing in One Weekend</em> to CUDA, but to squeeze every last
93+
millisecond out of it—profiling, analyzing, and optimizing until the numbers surprised even me.
94+
And this is actually how I learned CUDA.
95+
</p>
96+
<p>
97+
In this write-up, I'll walk you through the journey: what worked, what didn't, and the key
98+
tricks that made the biggest difference. Whether you're a graphics programmer, a CUDA
99+
enthusiast, or just curious about real-world GPU optimization, I hope you'll find something useful
100+
here.
86101
</p>
87102

88103
<table class="perf-table">
@@ -99,7 +114,8 @@ <h2>Introduction</h2>
99114
</thead>
100115
<tbody>
101116
<tr>
102-
<td class="spec-value">GPSnoopy's repository</td>
117+
<td class="spec-value"><a href="https://github.com/GPSnoopy/RayTracingInVulkan"
118+
target="_blank">RayTracingInVulkan</a></td>
103119
<td class="spec-value">Vulkan</td>
104120
<td class="spec-value">RTX acceleration</td>
105121
<td class="spec-value">Procedural sphere tracing + triangle modes</td>
@@ -115,7 +131,9 @@ <h2>Introduction</h2>
115131
</td>
116132
</tr>
117133
<tr>
118-
<td class="spec-value">Mine</td>
134+
<td class="spec-value"><a
135+
href="https://github.com/karimsayedre/CUDA-Ray-Tracing-In-One-Weekend"
136+
target="_blank">CUDA-Ray-Tracing-In-One-Weekend</a>(Mine)</td>
119137
<td class="spec-value">CUDA</td>
120138
<td class="spec-value">No hardware RT cores</td>
121139
<td class="spec-value">Procedural spheres only</td>
@@ -124,21 +142,34 @@ <h2>Introduction</h2>
124142
<td class="spec-value">
125143
<ul>
126144
<li>Same resolution and settings</li>
145+
<li>Different sphere locations and materials</li>
146+
<li>Implements what we call "inline ray tracing" (without hardware RT pipeline,
147+
though)</li>
127148
</ul>
128149
</td>
129150
</tr>
130151
</tbody>
131152
</table>
132153

133154
<p>
134-
Why is the Vulkan/RTX version slower? While this is probably not the only reason, one key issue is
135-
that
136-
procedural shaders are actually <em>frowned upon</em> for performance because RT cores perform best
137-
with real
138-
triangle geometry. Even with triangle support, all scenes in GPSnoopy's repo stay below 33
139-
FPS—clearly
140-
showing the limitations of a mixed procedural + hardware-centric pipeline.
155+
Why is the Vulkan/RTX version slower? While there are many contributing factors, one probable
156+
reason—pointed out to me by GPSnoopy—is that procedural geometry in hardware-accelerated ray tracing
157+
is typically slower on NVIDIA GPUs. Unlike triangle meshes, procedural primitives (like spheres or
158+
AABBs) rely on intersection shaders that must run in software on the SMs, rather than using the
159+
fixed-function RT core pipeline optimized for triangle traversal and intersection. This introduces
160+
extra scheduling overhead and limits how much the GPU can offload to dedicated hardware.
161+
</p>
162+
163+
<p>
164+
Supporting this theory, <strong>RayTracingInVulkan</strong> consistently benchmarks better on AMD
165+
cards, which
166+
treat triangle and procedural geometry more uniformly in their ray tracing pipeline. The disparity
167+
suggests that
168+
procedural shader performance is a weaker point for NVIDIA's RT core architecture—at least in the
169+
current
170+
generation.
141171
</p>
172+
142173
<p>
143174
Another reason might be the ray tracing pipeline itself. While powerful and flexible, the hardware
144175
RT
@@ -388,7 +419,7 @@ <h5>Use Nsight Compute's built-in occupancy calculator!</h5>
388419

389420
<section class="section-header">
390421

391-
<h2 class="optimization-title">Optimization #1 — Aggressive Inlining via Header-Only CUDA Design</h2>
422+
<h2 class="optimization-title">Opt #1 — Aggressive Inlining via Header-Only CUDA Design</h2>
392423

393424
<p>
394425
In CUDA, performance often hinges on inlining. Unlike traditional C++, CUDA's
@@ -483,7 +514,7 @@ <h3>Before vs After</h3>
483514
</section>
484515

485516
<section class="section-header">
486-
<h2>Optimization #2 — Killing Recursion with an Explicit Stack</h2>
517+
<h2>Opt #2 — Killing Recursion with an Explicit Stack</h2>
487518

488519
<p>To eliminate recursion and cut down register pressure, I rewrote the BVH traversal to use an
489520
<strong>explicit stack in registers</strong>. The old code relied on a clean recursive structure
@@ -624,7 +655,7 @@ <h3>Comparison</h3>
624655
</section>
625656

626657
<section class="section-header">
627-
<h2 class="optimization-title">Optimization #3 — Don't Recompute What You Already Know</h2>
658+
<h2 class="optimization-title">Opt #3 — Don't Recompute What You Already Know</h2>
628659
<p>
629660
Here's a simple but powerful axiom in real-time ray tracing:
630661
<strong>Precompute what doesn't change.</strong> If you know you're going to need a value frequently
@@ -734,7 +765,7 @@ <h4>Gotcha: Moving Spheres and Dynamic AABBs</h4>
734765
</section>
735766

736767
<section class="section-header">
737-
<h2>Optimization #4 — Early Termination for Low Contributing Rays</h2>
768+
<h2>Opt #4 — Early Termination for Low Contributing Rays</h2>
738769
<p>
739770
This one's simple but powerful. If a ray's contribution becomes negligible, we just stop tracing
740771
it.
@@ -779,7 +810,7 @@ <h2>Optimization #4 — Early Termination for Low Contributing Rays</h2>
779810
</section>
780811

781812
<section class="section-header">
782-
<h2>Optimization #5 — Russian Roulette</h2>
813+
<h2>Opt #5 — Russian Roulette</h2>
783814
<p>
784815
Early termination is good — but we can go further with <strong>Russian Roulette</strong>. After a
785816
few bounces, we probabilistically decide whether a ray should continue or not, based on its current
@@ -846,7 +877,7 @@ <h2>Optimization #5 — Russian Roulette</h2>
846877

847878

848879
<section class="section-header">
849-
<h2>Optimization #6 — Structure of Arrays (SoA)</h2>
880+
<h2>Opt #6 — Structure of Arrays (SoA)</h2>
850881
<p>
851882
Our original implementation leaned on inheritance and virtual dispatch, with every
852883
object—spheres, BVH nodes,
@@ -1004,7 +1035,7 @@ <h2>Optimization #6 — Structure of Arrays (SoA)</h2>
10041035

10051036

10061037
<section class="section-header">
1007-
<h2>Optimization #7 — Four Levels of Ray-Slab Intersection Refinement</h2>
1038+
<h2>Opt #7 — Four Levels of Ray-Slab Intersection Refinement</h2>
10081039
<p>We improved BVH slab testing in four progressive steps, each trading complexity for fewer operations
10091040
inside the hot loop:</p>
10101041
<ol>
@@ -1080,7 +1111,7 @@ <h4>Gotcha: FMA throughput</h4>
10801111
</section>
10811112

10821113
<section class="section-header">
1083-
<h2>Optimization #8 — Surface Area Heuristic (SAH) BVH Construction</h2>
1114+
<h2>Opt #8 — Surface Area Heuristic (SAH) BVH Construction</h2>
10841115
<p>
10851116
Constructing a BVH by simply splitting primitives in half along an axis is easy—but not optimal. The
10861117
<strong>Surface Area Heuristic (SAH)</strong> chooses split planes based on minimizing the expected
@@ -1122,7 +1153,7 @@ <h2>Optimization #8 — Surface Area Heuristic (SAH) BVH Construction</h2>
11221153
</section>
11231154

11241155
<section class="section-header">
1125-
<h2>Optimization #9 — Alignment and Cacheline Efficiency</h2>
1156+
<h2>Opt #9 — Alignment and Cacheline Efficiency</h2>
11261157
<p>
11271158
Closely related to our Structure of Arrays (SoA) optimization, I found that <strong>data
11281159
alignment</strong> plays a massive role in
@@ -1301,7 +1332,7 @@ <h4>Gotcha: Float16 and Half Precision</h4>
13011332

13021333

13031334
<section class="section-header" id="optimization-10">
1304-
<h2>Optimization #10 — Using Constant Memory Instead of Global Memory</h2>
1335+
<h2>Opt #10 — Using Constant Memory Instead of Global Memory</h2>
13051336

13061337
<p>
13071338
One of the most effective register-saving optimizations in CUDA is proper use of the
@@ -1486,7 +1517,7 @@ <h3>Shared vs Constant vs Global Memory</h3>
14861517
</section>
14871518

14881519
<section class="section-header" id="optimization-11">
1489-
<h2>Optimization #11 — Prefer <code>&lt;cmath&gt;</code> Intrinsics Over <code>&lt;algorithm&gt;</code>
1520+
<h2>Opt #11 — Prefer <code>&lt;cmath&gt;</code> Intrinsics Over <code>&lt;algorithm&gt;</code>
14901521
in CUDA</h2>
14911522

14921523
<p>
@@ -1609,7 +1640,7 @@ <h3>Best Practices</h3>
16091640
</section>
16101641

16111642
<section class="section-header" id="optimization-12">
1612-
<h2>Optimization #12 — Roll Your Own RNG (LCG + Hash) Instead of <code>curand</code></h2>
1643+
<h2>Opt #12 — Roll Your Own RNG (LCG + Hash) Instead of <code>curand</code></h2>
16131644

16141645
<p>
16151646
When working with real-time GPU workloads like path tracing, CUDA's
@@ -1726,7 +1757,7 @@ <h3>When to Use</h3>
17261757

17271758

17281759
<section class="section-header" id="optimization-13">
1729-
<h2>Optimization #13 — Branchless Material Sampling &amp; Evaluation</h2>
1760+
<h2>Opt #13 — Branchless Material Sampling &amp; Evaluation</h2>
17301761

17311762
<p>
17321763
My old implementation uses a <code>switch</code> over material types (Lambert, Metal,
@@ -1825,7 +1856,7 @@ <h3>Why This Matters</h3>
18251856
</section>
18261857

18271858
<section class="section-header" id="optimization-14">
1828-
<h2>Optimization #14 — Bypass CPU Staging with CUDA↔OpenGL Interop</h2>
1859+
<h2>Opt #14 — Bypass CPU Staging with CUDA↔OpenGL Interop</h2>
18291860

18301861
<p>
18311862
Early on, I rendered each frame by copying the CUDA output into an <code>sf::Image</code> (CPU
@@ -2079,7 +2110,6 @@ <h4>Conclusion</h4>
20792110
</p>
20802111

20812112

2082-
<!-- Your existing image with preview functionality -->
20832113
<div class="image-container" data-preview="true">
20842114

20852115
<img src="images/RTIOW/Screenshot 2025-06-12 104519.png" class="preview-image"
@@ -2286,6 +2316,7 @@ <h3>Nsight Compute Results</h3>
22862316
<p>
22872317
These are from my RTX 3080 run if you'd like to explore the kernel analysis yourself.
22882318
</p>
2319+
22892320
<table class="perf-table">
22902321
<thead>
22912322
<tr>
@@ -2302,14 +2333,38 @@ <h3>Nsight Compute Results</h3>
23022333
</tr>
23032334
<tr>
23042335
<td>Latest/Improved</td>
2305-
<td>9 ms</td>
2336+
<td>8 ms</td>
23062337
<td><a href="Reports/Latest.ncu-rep" download>Download Latest Report (.ncu-rep)</a></td>
23072338
</tr>
23082339
</tbody>
23092340
</table>
23102341

23112342

23122343
<h3>benchmark Results</h3>
2344+
2345+
2346+
<!-- Your existing image with preview functionality -->
2347+
<div class="image-container" data-preview="true">
2348+
2349+
<img src="images/RTIOW/Screenshot 2025-06-13 192959.png" class="preview-image"
2350+
alt="CUDA Scheduler Performance">
2351+
2352+
<div class="gotcha-card">
2353+
<div class="gotcha-marker pro-tip-marker"></div>
2354+
2355+
<div class="image-comments">
2356+
<p>
2357+
The SASS instructions in the image are mostly FFMA, FMUL, and FMNMX—fast math operations
2358+
ideal for compute-bound CUDA kernels. This shows the ray tracing inner loops are
2359+
efficiently mapped to hardware, with minimal branching or memory stalls. Most stalls are
2360+
not memory-related but due to scheduler choices or instruction dependencies. This
2361+
efficient math mapping is a direct result of the described optimizations and explains
2362+
the significant speedup in the final benchmarks.
2363+
</p>
2364+
</div>
2365+
</div>
2366+
</div>
2367+
23132368
<p>
23142369
Below are the measured frame times in milliseconds. Lower is better:
23152370
</p>
65.7 KB
Loading
-57.8 KB
Loading

style/style.css

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -599,7 +599,7 @@ code {
599599
.section-header h2 {
600600
color: var(--accent-100);
601601
text-shadow: 1px 1px 3px rgba(0, 0, 0, 0.5);
602-
font-size: 1.8rem;
602+
font-size: 2.5rem;
603603
margin: 0;
604604
text-align: center;
605605
margin-bottom: 1rem;

0 commit comments

Comments
 (0)