You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Scale Sweep (A100, long kernel, default first-hit, 10k steps)
Paths
Time
Thesis 1070
10,000
0.012s
—
50,000
0.016s
—
100,000
0.026s
—
500,000
0.101s
—
1,000,000
0.192s
—
2,500,000
0.462s
—
5,000,000
0.914s
5.08s
Throughput: 5.5M paths/sec (A100) vs thesis 1M paths/sec (GTX 1070).
Gold Standard (A100, long kernel, dt=1E-8)
Test
Particles
Steps
Time
KS p-value
1D first-hit
10,000,000
1,000,000
143s
0.047 (PASS)
10 trillion particle-steps in under 3 minutes.
Validation
Test
Backend
Result
Notes
1D first-hit KS (1k paths)
GPU
PASS
1D first-hit KS (1k paths)
CPU
PASS
1D first-hit KS (10k paths)
GPU
PASS
After Brownian bridge correction
3D spherical KS (10k paths)
GPU
PASS
3D spherical binomial (10k)
GPU
PASS
After Brownian bridge correction
3D spherical (1k paths)
CPU
PASS
CPU/GPU agreement (5k paths)
Both
PASS
KS p > 0.01
GPU RNG quality (Philox)
GPU
PASS
CPU RNG quality (Box-Muller)
CPU
PASS
GPU reproducibility (seed=42)
GPU
PASS
Identical output
CPU reproducibility (seed=42)
CPU
PASS
Identical output
Wall reflection (10k paths)
GPU
PASS
Performance regression
GPU
PASS
No regression vs baseline
Optimization History
Change
GPU Impact
Bug fixes (gridSize, abs, float/double)
Correctness
CPU cleanup (stack vars, sqrt reduction)
~2x CPU speedup
Device-side hit detection
25-224x — eliminated per-timestep cudaMemcpy
Precompute constants + curand_normal4 + block 256
~5x on top of above
Long kernel (d_simulate_isolated)
~3x on top of above
Brownian bridge correction
Correctness (fixes boundary-crossing bias)
Wide kernel (d_update) with same optimizations
Same improvements, global memory positions
CUDA Graphs for wide kernel
Reverted — added overhead, no improvement
Total vs thesis wide baseline
~2,000x
Known Limitations
--use_fast_math disabled: caused wall reflection test failures due to reduced
trig precision. Could be re-enabled selectively for non-reflection code paths.