1212 <!-- Open Graph Meta (for Facebook, LinkedIn, etc.) -->
1313 < meta property ="og:title " content ="Karim Sayed - Rendering Engineer ">
1414 < meta property ="og:description " content ="A showcase of my projects and portfolio. ">
15- < meta property ="og:image " content ="https://karimsayedre.github.io/images/Pathtracing/0.jpg ">
15+ < meta property ="og:image "
16+ content ="https://karimsayedre.github.io/images/RTIOW/2560x1440_50depth_3000samples_3400ms.png ">
1617 < meta property ="og:url " content ="https://karimsayedre.github.io/ ">
1718 < meta property ="og:type " content ="website ">
1819
1920 <!-- Twitter Card Meta -->
2021 < meta name ="twitter:card " content ="summary_large_image ">
2122 < meta name ="twitter:title " content ="Karim Sayed - Rendering Engineer ">
2223 < meta name ="twitter:description " content ="A showcase of my projects and portfolio. ">
23- < meta name ="twitter:image " content ="https://karimsayedre.github.io/images/Pathtracing/0.jpg ">
24+ < meta name ="twitter:image "
25+ content ="https://karimsayedre.github.io/images/RTIOW/2560x1440_50depth_3000samples_3400ms.png ">
2426
2527 < script src ="https://cdn.jsdelivr.net/npm/bootstrap@5.3.3/dist/js/bootstrap.bundle.min.js "> </ script >
2628 < link href ="https://cdn.jsdelivr.net/npm/bootstrap@5.3.3/dist/css/bootstrap.min.css " rel ="stylesheet "
6870 < article >
6971 < div class ="collapsible ">
7072 < h1 > CUDA Ray Tracing 3.6x Faster Than RTX: My CUDA Ray Tracing Journey</ h1 >
71- < p > < em > Note: This is a draft version. Final edits are still in progress. Feedback is welcome while final
72- edits are underway.</ em > </ p >
73+ <!-- < p><em>Note: This is a draft version. Final edits are still in progress. Feedback is welcome while final
74+ edits are underway.</em></p> -->
7375
7476 </ div >
7577
@@ -79,10 +81,23 @@ <h1>CUDA Ray Tracing 3.6x Faster Than RTX: My CUDA Ray Tracing Journey</h1>
7981 < h2 > Introduction</ h2 >
8082
8183 < p >
82- Alright, this headline is a bit < strong > click-baity</ strong > , but it's actually true… kind of.
83- I'm comparing my CUDA path tracer against the
84- < a href ="https://github.com/GPSnoopy/RayTracingInVulkan " target ="_blank "> RayTracingInVulkan</ a >
85- repo by GPSnoopy:
84+ Welcome! This article is a deep dive into how I made a CUDA-based ray tracer that outperforms a
85+ Vulkan/RTX implementation—sometimes by more than 3x—on the same hardware. If you're interested in
86+ GPU programming, performance optimization, or just want to see how far you can push a path tracer,
87+ you're in the right place.
88+ </ p >
89+ < p >
90+ The comparison is with < a href ="https://github.com/GPSnoopy/RayTracingInVulkan "
91+ target ="_blank "> RayTracingInVulkan</ a > by GPSnoopy, a well-known Vulkan/RTX renderer. My goal
92+ wasn't just to port < em > Ray Tracing in One Weekend</ em > to CUDA, but to squeeze every last
93+ millisecond out of it—profiling, analyzing, and optimizing until the numbers surprised even me.
94+ And this is actually how I learned CUDA.
95+ </ p >
96+ < p >
97+ In this write-up, I'll walk you through the journey: what worked, what didn't, and the key
98+ tricks that made the biggest difference. Whether you're a graphics programmer, a CUDA
99+ enthusiast, or just curious about real-world GPU optimization, I hope you'll find something useful
100+ here.
86101 </ p >
87102
88103 < table class ="perf-table ">
@@ -99,7 +114,8 @@ <h2>Introduction</h2>
99114 </ thead >
100115 < tbody >
101116 < tr >
102- < td class ="spec-value "> GPSnoopy's repository</ td >
117+ < td class ="spec-value "> < a href ="https://github.com/GPSnoopy/RayTracingInVulkan "
118+ target ="_blank "> RayTracingInVulkan</ a > </ td >
103119 < td class ="spec-value "> Vulkan</ td >
104120 < td class ="spec-value "> RTX acceleration</ td >
105121 < td class ="spec-value "> Procedural sphere tracing + triangle modes</ td >
@@ -115,7 +131,9 @@ <h2>Introduction</h2>
115131 </ td >
116132 </ tr >
117133 < tr >
118- < td class ="spec-value "> Mine</ td >
134+ < td class ="spec-value "> < a
135+ href ="https://github.com/karimsayedre/CUDA-Ray-Tracing-In-One-Weekend "
136+ target ="_blank "> CUDA-Ray-Tracing-In-One-Weekend</ a > (Mine)</ td >
119137 < td class ="spec-value "> CUDA</ td >
120138 < td class ="spec-value "> No hardware RT cores</ td >
121139 < td class ="spec-value "> Procedural spheres only</ td >
@@ -124,21 +142,34 @@ <h2>Introduction</h2>
124142 < td class ="spec-value ">
125143 < ul >
126144 < li > Same resolution and settings</ li >
145+ < li > Different sphere locations and materials</ li >
146+ < li > Implements what we call "inline ray tracing" (without hardware RT pipeline,
147+ though)</ li >
127148 </ ul >
128149 </ td >
129150 </ tr >
130151 </ tbody >
131152 </ table >
132153
133154 < p >
134- Why is the Vulkan/RTX version slower? While this is probably not the only reason, one key issue is
135- that
136- procedural shaders are actually < em > frowned upon</ em > for performance because RT cores perform best
137- with real
138- triangle geometry. Even with triangle support, all scenes in GPSnoopy's repo stay below 33
139- FPS—clearly
140- showing the limitations of a mixed procedural + hardware-centric pipeline.
155+ Why is the Vulkan/RTX version slower? While there are many contributing factors, one probable
156+ reason—pointed out to me by GPSnoopy—is that procedural geometry in hardware-accelerated ray tracing
157+ is typically slower on NVIDIA GPUs. Unlike triangle meshes, procedural primitives (like spheres or
158+ AABBs) rely on intersection shaders that must run in software on the SMs, rather than using the
159+ fixed-function RT core pipeline optimized for triangle traversal and intersection. This introduces
160+ extra scheduling overhead and limits how much the GPU can offload to dedicated hardware.
161+ </ p >
162+
163+ < p >
164+ Supporting this theory, < strong > RayTracingInVulkan</ strong > consistently benchmarks better on AMD
165+ cards, which
166+ treat triangle and procedural geometry more uniformly in their ray tracing pipeline. The disparity
167+ suggests that
168+ procedural shader performance is a weaker point for NVIDIA's RT core architecture—at least in the
169+ current
170+ generation.
141171 </ p >
172+
142173 < p >
143174 Another reason might be the ray tracing pipeline itself. While powerful and flexible, the hardware
144175 RT
@@ -388,7 +419,7 @@ <h5>Use Nsight Compute's built-in occupancy calculator!</h5>
388419
389420 < section class ="section-header ">
390421
391- < h2 class ="optimization-title "> Optimization #1 — Aggressive Inlining via Header-Only CUDA Design</ h2 >
422+ < h2 class ="optimization-title "> Opt #1 — Aggressive Inlining via Header-Only CUDA Design</ h2 >
392423
393424 < p >
394425 In CUDA, performance often hinges on inlining. Unlike traditional C++, CUDA's
@@ -483,7 +514,7 @@ <h3>Before vs After</h3>
483514 </ section >
484515
485516 < section class ="section-header ">
486- < h2 > Optimization #2 — Killing Recursion with an Explicit Stack</ h2 >
517+ < h2 > Opt #2 — Killing Recursion with an Explicit Stack</ h2 >
487518
488519 < p > To eliminate recursion and cut down register pressure, I rewrote the BVH traversal to use an
489520 < strong > explicit stack in registers</ strong > . The old code relied on a clean recursive structure
@@ -624,7 +655,7 @@ <h3>Comparison</h3>
624655 </ section >
625656
626657 < section class ="section-header ">
627- < h2 class ="optimization-title "> Optimization #3 — Don't Recompute What You Already Know</ h2 >
658+ < h2 class ="optimization-title "> Opt #3 — Don't Recompute What You Already Know</ h2 >
628659 < p >
629660 Here's a simple but powerful axiom in real-time ray tracing:
630661 < strong > Precompute what doesn't change.</ strong > If you know you're going to need a value frequently
@@ -734,7 +765,7 @@ <h4>Gotcha: Moving Spheres and Dynamic AABBs</h4>
734765 </ section >
735766
736767 < section class ="section-header ">
737- < h2 > Optimization #4 — Early Termination for Low Contributing Rays</ h2 >
768+ < h2 > Opt #4 — Early Termination for Low Contributing Rays</ h2 >
738769 < p >
739770 This one's simple but powerful. If a ray's contribution becomes negligible, we just stop tracing
740771 it.
@@ -779,7 +810,7 @@ <h2>Optimization #4 — Early Termination for Low Contributing Rays</h2>
779810 </ section >
780811
781812 < section class ="section-header ">
782- < h2 > Optimization #5 — Russian Roulette</ h2 >
813+ < h2 > Opt #5 — Russian Roulette</ h2 >
783814 < p >
784815 Early termination is good — but we can go further with < strong > Russian Roulette</ strong > . After a
785816 few bounces, we probabilistically decide whether a ray should continue or not, based on its current
@@ -846,7 +877,7 @@ <h2>Optimization #5 — Russian Roulette</h2>
846877
847878
848879 < section class ="section-header ">
849- < h2 > Optimization #6 — Structure of Arrays (SoA)</ h2 >
880+ < h2 > Opt #6 — Structure of Arrays (SoA)</ h2 >
850881 < p >
851882 Our original implementation leaned on inheritance and virtual dispatch, with every
852883 object—spheres, BVH nodes,
@@ -1004,7 +1035,7 @@ <h2>Optimization #6 — Structure of Arrays (SoA)</h2>
10041035
10051036
10061037 < section class ="section-header ">
1007- < h2 > Optimization #7 — Four Levels of Ray-Slab Intersection Refinement</ h2 >
1038+ < h2 > Opt #7 — Four Levels of Ray-Slab Intersection Refinement</ h2 >
10081039 < p > We improved BVH slab testing in four progressive steps, each trading complexity for fewer operations
10091040 inside the hot loop:</ p >
10101041 < ol >
@@ -1080,7 +1111,7 @@ <h4>Gotcha: FMA throughput</h4>
10801111 </ section >
10811112
10821113 < section class ="section-header ">
1083- < h2 > Optimization #8 — Surface Area Heuristic (SAH) BVH Construction</ h2 >
1114+ < h2 > Opt #8 — Surface Area Heuristic (SAH) BVH Construction</ h2 >
10841115 < p >
10851116 Constructing a BVH by simply splitting primitives in half along an axis is easy—but not optimal. The
10861117 < strong > Surface Area Heuristic (SAH)</ strong > chooses split planes based on minimizing the expected
@@ -1122,7 +1153,7 @@ <h2>Optimization #8 — Surface Area Heuristic (SAH) BVH Construction</h2>
11221153 </ section >
11231154
11241155 < section class ="section-header ">
1125- < h2 > Optimization #9 — Alignment and Cacheline Efficiency</ h2 >
1156+ < h2 > Opt #9 — Alignment and Cacheline Efficiency</ h2 >
11261157 < p >
11271158 Closely related to our Structure of Arrays (SoA) optimization, I found that < strong > data
11281159 alignment</ strong > plays a massive role in
@@ -1301,7 +1332,7 @@ <h4>Gotcha: Float16 and Half Precision</h4>
13011332
13021333
13031334 < section class ="section-header " id ="optimization-10 ">
1304- < h2 > Optimization #10 — Using Constant Memory Instead of Global Memory</ h2 >
1335+ < h2 > Opt #10 — Using Constant Memory Instead of Global Memory</ h2 >
13051336
13061337 < p >
13071338 One of the most effective register-saving optimizations in CUDA is proper use of the
@@ -1486,7 +1517,7 @@ <h3>Shared vs Constant vs Global Memory</h3>
14861517 </ section >
14871518
14881519 < section class ="section-header " id ="optimization-11 ">
1489- < h2 > Optimization #11 — Prefer < code > <cmath></ code > Intrinsics Over < code > <algorithm></ code >
1520+ < h2 > Opt #11 — Prefer < code > <cmath></ code > Intrinsics Over < code > <algorithm></ code >
14901521 in CUDA</ h2 >
14911522
14921523 < p >
@@ -1609,7 +1640,7 @@ <h3>Best Practices</h3>
16091640 </ section >
16101641
16111642 < section class ="section-header " id ="optimization-12 ">
1612- < h2 > Optimization #12 — Roll Your Own RNG (LCG + Hash) Instead of < code > curand</ code > </ h2 >
1643+ < h2 > Opt #12 — Roll Your Own RNG (LCG + Hash) Instead of < code > curand</ code > </ h2 >
16131644
16141645 < p >
16151646 When working with real-time GPU workloads like path tracing, CUDA's
@@ -1726,7 +1757,7 @@ <h3>When to Use</h3>
17261757
17271758
17281759 < section class ="section-header " id ="optimization-13 ">
1729- < h2 > Optimization #13 — Branchless Material Sampling & Evaluation</ h2 >
1760+ < h2 > Opt #13 — Branchless Material Sampling & Evaluation</ h2 >
17301761
17311762 < p >
17321763 My old implementation uses a < code > switch</ code > over material types (Lambert, Metal,
@@ -1825,7 +1856,7 @@ <h3>Why This Matters</h3>
18251856 </ section >
18261857
18271858 < section class ="section-header " id ="optimization-14 ">
1828- < h2 > Optimization #14 — Bypass CPU Staging with CUDA↔OpenGL Interop</ h2 >
1859+ < h2 > Opt #14 — Bypass CPU Staging with CUDA↔OpenGL Interop</ h2 >
18291860
18301861 < p >
18311862 Early on, I rendered each frame by copying the CUDA output into an < code > sf::Image</ code > (CPU
@@ -2079,7 +2110,6 @@ <h4>Conclusion</h4>
20792110 </ p >
20802111
20812112
2082- <!-- Your existing image with preview functionality -->
20832113 < div class ="image-container " data-preview ="true ">
20842114
20852115 < img src ="images/RTIOW/Screenshot 2025-06-12 104519.png " class ="preview-image "
@@ -2286,6 +2316,7 @@ <h3>Nsight Compute Results</h3>
22862316 < p >
22872317 These are from my RTX 3080 run if you'd like to explore the kernel analysis yourself.
22882318 </ p >
2319+
22892320 < table class ="perf-table ">
22902321 < thead >
22912322 < tr >
@@ -2302,14 +2333,38 @@ <h3>Nsight Compute Results</h3>
23022333 </ tr >
23032334 < tr >
23042335 < td > Latest/Improved</ td >
2305- < td > 9 ms</ td >
2336+ < td > 8 ms</ td >
23062337 < td > < a href ="Reports/Latest.ncu-rep " download > Download Latest Report (.ncu-rep)</ a > </ td >
23072338 </ tr >
23082339 </ tbody >
23092340 </ table >
23102341
23112342
23122343 < h3 > benchmark Results</ h3 >
2344+
2345+
2346+ <!-- Your existing image with preview functionality -->
2347+ < div class ="image-container " data-preview ="true ">
2348+
2349+ < img src ="images/RTIOW/Screenshot 2025-06-13 192959.png " class ="preview-image "
2350+ alt ="CUDA Scheduler Performance ">
2351+
2352+ < div class ="gotcha-card ">
2353+ < div class ="gotcha-marker pro-tip-marker "> </ div >
2354+
2355+ < div class ="image-comments ">
2356+ < p >
2357+ The SASS instructions in the image are mostly FFMA, FMUL, and FMNMX—fast math operations
2358+ ideal for compute-bound CUDA kernels. This shows the ray tracing inner loops are
2359+ efficiently mapped to hardware, with minimal branching or memory stalls. Most stalls are
2360+ not memory-related but due to scheduler choices or instruction dependencies. This
2361+ efficient math mapping is a direct result of the described optimizations and explains
2362+ the significant speedup in the final benchmarks.
2363+ </ p >
2364+ </ div >
2365+ </ div >
2366+ </ div >
2367+
23132368 < p >
23142369 Below are the measured frame times in milliseconds. Lower is better:
23152370 </ p >
0 commit comments