karimsayedre
diff --git a/‎RTIOW.html‎
Lines changed: 88 additions & 33 deletions b/‎RTIOW.html‎
Lines changed: 88 additions & 33 deletions
diff --git a/‎images/RTIOW/2560x1440_50depth_3000samples_3400ms.png‎
65.7 KB b/‎images/RTIOW/2560x1440_50depth_3000samples_3400ms.png‎
65.7 KB
diff --git a/‎images/RTIOW/Screenshot 2025-06-13 192959.png‎
-57.8 KB b/‎images/RTIOW/Screenshot 2025-06-13 192959.png‎
-57.8 KB
diff --git a/‎style/style.css‎
Lines changed: 1 addition & 1 deletion b/‎style/style.css‎
Lines changed: 1 addition & 1 deletion
@@ -12,15 +12,17 @@
     <!-- Open Graph Meta (for Facebook, LinkedIn, etc.) -->
     <meta property="og:title" content="Karim Sayed - Rendering Engineer">
     <meta property="og:description" content="A showcase of my projects and portfolio.">
-    <meta property="og:image" content="https://karimsayedre.github.io/images/Pathtracing/0.jpg">
+    <meta property="og:image"
+        content="https://karimsayedre.github.io/images/RTIOW/2560x1440_50depth_3000samples_3400ms.png">
     <meta property="og:url" content="https://karimsayedre.github.io/">
     <meta property="og:type" content="website">
 
     <!-- Twitter Card Meta -->
     <meta name="twitter:card" content="summary_large_image">
     <meta name="twitter:title" content="Karim Sayed - Rendering Engineer">
     <meta name="twitter:description" content="A showcase of my projects and portfolio.">
-    <meta name="twitter:image" content="https://karimsayedre.github.io/images/Pathtracing/0.jpg">
+    <meta name="twitter:image"
+        content="https://karimsayedre.github.io/images/RTIOW/2560x1440_50depth_3000samples_3400ms.png">
 
     <script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.3/dist/js/bootstrap.bundle.min.js"></script>
     <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.3/dist/css/bootstrap.min.css" rel="stylesheet"
@@ -68,8 +70,8 @@
         <article>
             <div class="collapsible">
                 <h1>CUDA Ray Tracing 3.6x Faster Than RTX: My CUDA Ray Tracing Journey</h1>
-                <p><em>Note: This is a draft version. Final edits are still in progress. Feedback is welcome while final
-                        edits are underway.</em></p>
+                <!-- <p><em>Note: This is a draft version. Final edits are still in progress. Feedback is welcome while final
+                        edits are underway.</em></p> -->
 
             </div>
 
@@ -79,10 +81,23 @@ <h1>CUDA Ray Tracing 3.6x Faster Than RTX: My CUDA Ray Tracing Journey</h1>
                 <h2>Introduction</h2>
 
                 <p>
-                    Alright, this headline is a bit <strong>click-baity</strong>, but it's actually true… kind of.
-                    I'm comparing my CUDA path tracer against the
-                    <a href="https://github.com/GPSnoopy/RayTracingInVulkan" target="_blank">RayTracingInVulkan</a>
-                    repo by GPSnoopy:
+                    Welcome! This article is a deep dive into how I made a CUDA-based ray tracer that outperforms a
+                    Vulkan/RTX implementation—sometimes by more than 3x—on the same hardware. If you're interested in
+                    GPU programming, performance optimization, or just want to see how far you can push a path tracer,
+                    you're in the right place.
+                </p>
+                <p>
+                    The comparison is with <a href="https://github.com/GPSnoopy/RayTracingInVulkan"
+                        target="_blank">RayTracingInVulkan</a> by GPSnoopy, a well-known Vulkan/RTX renderer. My goal
+                    wasn't just to port <em>Ray Tracing in One Weekend</em> to CUDA, but to squeeze every last
+                    millisecond out of it—profiling, analyzing, and optimizing until the numbers surprised even me.
+                    And this is actually how I learned CUDA.
+                </p>
+                <p>
+                    In this write-up, I'll walk you through the journey: what worked, what didn't, and the key
+                    tricks that made the biggest difference. Whether you're a graphics programmer, a CUDA
+                    enthusiast, or just curious about real-world GPU optimization, I hope you'll find something useful
+                    here.
                 </p>
 
                 <table class="perf-table">
@@ -99,7 +114,8 @@ <h2>Introduction</h2>
                     </thead>
                     <tbody>
                         <tr>
-                            <td class="spec-value">GPSnoopy's repository</td>
+                            <td class="spec-value"><a href="https://github.com/GPSnoopy/RayTracingInVulkan"
+                                    target="_blank">RayTracingInVulkan</a></td>
                             <td class="spec-value">Vulkan</td>
                             <td class="spec-value">RTX acceleration</td>
                             <td class="spec-value">Procedural sphere tracing + triangle modes</td>
@@ -115,7 +131,9 @@ <h2>Introduction</h2>
                             </td>
                         </tr>
                         <tr>
-                            <td class="spec-value">Mine</td>
+                            <td class="spec-value"><a
+                                    href="https://github.com/karimsayedre/CUDA-Ray-Tracing-In-One-Weekend"
+                                    target="_blank">CUDA-Ray-Tracing-In-One-Weekend</a>(Mine)</td>
                             <td class="spec-value">CUDA</td>
                             <td class="spec-value">No hardware RT cores</td>
                             <td class="spec-value">Procedural spheres only</td>
@@ -124,21 +142,34 @@ <h2>Introduction</h2>
                             <td class="spec-value">
                                 <ul>
                                     <li>Same resolution and settings</li>
+                                    <li>Different sphere locations and materials</li>
+                                    <li>Implements what we call "inline ray tracing" (without hardware RT pipeline,
+                                        though)</li>
                                 </ul>
                             </td>
                         </tr>
                     </tbody>
                 </table>
 
                 <p>
-                    Why is the Vulkan/RTX version slower? While this is probably not the only reason, one key issue is
-                    that
-                    procedural shaders are actually <em>frowned upon</em> for performance because RT cores perform best
-                    with real
-                    triangle geometry. Even with triangle support, all scenes in GPSnoopy's repo stay below 33
-                    FPS—clearly
-                    showing the limitations of a mixed procedural + hardware-centric pipeline.
+                    Why is the Vulkan/RTX version slower? While there are many contributing factors, one probable
+                    reason—pointed out to me by GPSnoopy—is that procedural geometry in hardware-accelerated ray tracing
+                    is typically slower on NVIDIA GPUs. Unlike triangle meshes, procedural primitives (like spheres or
+                    AABBs) rely on intersection shaders that must run in software on the SMs, rather than using the
+                    fixed-function RT core pipeline optimized for triangle traversal and intersection. This introduces
+                    extra scheduling overhead and limits how much the GPU can offload to dedicated hardware.
+                </p>
+
+                <p>
+                    Supporting this theory, <strong>RayTracingInVulkan</strong> consistently benchmarks better on AMD
+                    cards, which
+                    treat triangle and procedural geometry more uniformly in their ray tracing pipeline. The disparity
+                    suggests that
+                    procedural shader performance is a weaker point for NVIDIA's RT core architecture—at least in the
+                    current
+                    generation.
                 </p>
+
                 <p>
                     Another reason might be the ray tracing pipeline itself. While powerful and flexible, the hardware
                     RT
@@ -388,7 +419,7 @@ <h5>Use Nsight Compute's built-in occupancy calculator!</h5>
 
             <section class="section-header">
 
-                <h2 class="optimization-title">Optimization #1 — Aggressive Inlining via Header-Only CUDA Design</h2>
+                <h2 class="optimization-title">Opt #1 — Aggressive Inlining via Header-Only CUDA Design</h2>
 
                 <p>
                     In CUDA, performance often hinges on inlining. Unlike traditional C++, CUDA's
@@ -483,7 +514,7 @@ <h3>Before vs After</h3>
             </section>
 
             <section class="section-header">
-                <h2>Optimization #2 — Killing Recursion with an Explicit Stack</h2>
+                <h2>Opt #2 — Killing Recursion with an Explicit Stack</h2>
 
                 <p>To eliminate recursion and cut down register pressure, I rewrote the BVH traversal to use an
                     <strong>explicit stack in registers</strong>. The old code relied on a clean recursive structure
@@ -624,7 +655,7 @@ <h3>Comparison</h3>
             </section>
 
             <section class="section-header">
-                <h2 class="optimization-title">Optimization #3 — Don't Recompute What You Already Know</h2>
+                <h2 class="optimization-title">Opt #3 — Don't Recompute What You Already Know</h2>
                 <p>
                     Here's a simple but powerful axiom in real-time ray tracing:
                     <strong>Precompute what doesn't change.</strong> If you know you're going to need a value frequently
@@ -734,7 +765,7 @@ <h4>Gotcha: Moving Spheres and Dynamic AABBs</h4>
             </section>
 
             <section class="section-header">
-                <h2>Optimization #4 — Early Termination for Low Contributing Rays</h2>
+                <h2>Opt #4 — Early Termination for Low Contributing Rays</h2>
                 <p>
                     This one's simple but powerful. If a ray's contribution becomes negligible, we just stop tracing
                     it.
@@ -779,7 +810,7 @@ <h2>Optimization #4 — Early Termination for Low Contributing Rays</h2>
             </section>
 
             <section class="section-header">
-                <h2>Optimization #5 — Russian Roulette</h2>
+                <h2>Opt #5 — Russian Roulette</h2>
                 <p>
                     Early termination is good — but we can go further with <strong>Russian Roulette</strong>. After a
                     few bounces, we probabilistically decide whether a ray should continue or not, based on its current
@@ -846,7 +877,7 @@ <h2>Optimization #5 — Russian Roulette</h2>
 
 
             <section class="section-header">
-                <h2>Optimization #6 — Structure of Arrays (SoA)</h2>
+                <h2>Opt #6 — Structure of Arrays (SoA)</h2>
                 <p>
                     Our original implementation leaned on inheritance and virtual dispatch, with every
                     object—spheres, BVH nodes,
@@ -1004,7 +1035,7 @@ <h2>Optimization #6 — Structure of Arrays (SoA)</h2>
 
 
             <section class="section-header">
-                <h2>Optimization #7 — Four Levels of Ray-Slab Intersection Refinement</h2>
+                <h2>Opt #7 — Four Levels of Ray-Slab Intersection Refinement</h2>
                 <p>We improved BVH slab testing in four progressive steps, each trading complexity for fewer operations
                     inside the hot loop:</p>
                 <ol>
@@ -1080,7 +1111,7 @@ <h4>Gotcha: FMA throughput</h4>
             </section>
 
             <section class="section-header">
-                <h2>Optimization #8 — Surface Area Heuristic (SAH) BVH Construction</h2>
+                <h2>Opt #8 — Surface Area Heuristic (SAH) BVH Construction</h2>
                 <p>
                     Constructing a BVH by simply splitting primitives in half along an axis is easy—but not optimal. The
                     <strong>Surface Area Heuristic (SAH)</strong> chooses split planes based on minimizing the expected
@@ -1122,7 +1153,7 @@ <h2>Optimization #8 — Surface Area Heuristic (SAH) BVH Construction</h2>
             </section>
 
             <section class="section-header">
-                <h2>Optimization #9 — Alignment and Cacheline Efficiency</h2>
+                <h2>Opt #9 — Alignment and Cacheline Efficiency</h2>
                 <p>
                     Closely related to our Structure of Arrays (SoA) optimization, I found that <strong>data
                         alignment</strong> plays a massive role in
@@ -1301,7 +1332,7 @@ <h4>Gotcha: Float16 and Half Precision</h4>
 
 
             <section class="section-header" id="optimization-10">
-                <h2>Optimization #10 — Using Constant Memory Instead of Global Memory</h2>
+                <h2>Opt #10 — Using Constant Memory Instead of Global Memory</h2>
 
                 <p>
                     One of the most effective register-saving optimizations in CUDA is proper use of the
@@ -1486,7 +1517,7 @@ <h3>Shared vs Constant vs Global Memory</h3>
             </section>
 
             <section class="section-header" id="optimization-11">
-                <h2>Optimization #11 — Prefer <code>&lt;cmath&gt;</code> Intrinsics Over <code>&lt;algorithm&gt;</code>
+                <h2>Opt #11 — Prefer <code>&lt;cmath&gt;</code> Intrinsics Over <code>&lt;algorithm&gt;</code>
                     in CUDA</h2>
 
                 <p>
@@ -1609,7 +1640,7 @@ <h3>Best Practices</h3>
             </section>
 
             <section class="section-header" id="optimization-12">
-                <h2>Optimization #12 — Roll Your Own RNG (LCG + Hash) Instead of <code>curand</code></h2>
+                <h2>Opt #12 — Roll Your Own RNG (LCG + Hash) Instead of <code>curand</code></h2>
 
                 <p>
                     When working with real-time GPU workloads like path tracing, CUDA's
@@ -1726,7 +1757,7 @@ <h3>When to Use</h3>
 
 
             <section class="section-header" id="optimization-13">
-                <h2>Optimization #13 — Branchless Material Sampling &amp; Evaluation</h2>
+                <h2>Opt #13 — Branchless Material Sampling &amp; Evaluation</h2>
 
                 <p>
                     My old implementation uses a <code>switch</code> over material types (Lambert, Metal,
@@ -1825,7 +1856,7 @@ <h3>Why This Matters</h3>
             </section>
 
             <section class="section-header" id="optimization-14">
-                <h2>Optimization #14 — Bypass CPU Staging with CUDA↔OpenGL Interop</h2>
+                <h2>Opt #14 — Bypass CPU Staging with CUDA↔OpenGL Interop</h2>
 
                 <p>
                     Early on, I rendered each frame by copying the CUDA output into an <code>sf::Image</code> (CPU
@@ -2079,7 +2110,6 @@ <h4>Conclusion</h4>
                     </p>
 
 
-                    <!-- Your existing image with preview functionality -->
                     <div class="image-container" data-preview="true">
 
                         <img src="images/RTIOW/Screenshot 2025-06-12 104519.png" class="preview-image"
@@ -2286,6 +2316,7 @@ <h3>Nsight Compute Results</h3>
                 <p>
                     These are from my RTX 3080 run if you'd like to explore the kernel analysis yourself.
                 </p>
+
                 <table class="perf-table">
                     <thead>
                         <tr>
@@ -2302,14 +2333,38 @@ <h3>Nsight Compute Results</h3>
                         </tr>
                         <tr>
                             <td>Latest/Improved</td>
-                            <td>9 ms</td>
+                            <td>8 ms</td>
                             <td><a href="Reports/Latest.ncu-rep" download>Download Latest Report (.ncu-rep)</a></td>
                         </tr>
                     </tbody>
                 </table>
 
 
                 <h3>benchmark Results</h3>
+
+
+                <!-- Your existing image with preview functionality -->
+                <div class="image-container" data-preview="true">
+
+                    <img src="images/RTIOW/Screenshot 2025-06-13 192959.png" class="preview-image"
+                        alt="CUDA Scheduler Performance">
+
+                    <div class="gotcha-card">
+                        <div class="gotcha-marker  pro-tip-marker"></div>
+
+                        <div class="image-comments">
+                            <p>
+                                The SASS instructions in the image are mostly FFMA, FMUL, and FMNMX—fast math operations
+                                ideal for compute-bound CUDA kernels. This shows the ray tracing inner loops are
+                                efficiently mapped to hardware, with minimal branching or memory stalls. Most stalls are
+                                not memory-related but due to scheduler choices or instruction dependencies. This
+                                efficient math mapping is a direct result of the described optimizations and explains
+                                the significant speedup in the final benchmarks.
+                            </p>
+                        </div>
+                    </div>
+                </div>
+
                 <p>
                     Below are the measured frame times in milliseconds. Lower is better:
                 </p>
 
@@ -599,7 +599,7 @@ code {
 .section-header h2 {
   color: var(--accent-100);
   text-shadow: 1px 1px 3px rgba(0, 0, 0, 0.5);
-  font-size: 1.8rem;
+  font-size: 2.5rem;
   margin: 0;
   text-align: center;
   margin-bottom: 1rem;