-
Notifications
You must be signed in to change notification settings - Fork 31
Description
Summary
The Nash cascade subsurface routing in nash_cascade.c can run on GPU with one thread per catchment. For the NWM's 2.7M catchments, this produces bit-identical results to the CPU implementation.
Benchmark (RTX 3060)
| Catchments | CPU | GPU | Speedup | Max error |
|---|---|---|---|---|
| 10K | <1 ms | 0.107 ms | — | 0 |
| 100K | 1 ms | 0.464 ms | 2.2x | 0 |
| 1M | 10 ms | 5.1 ms | 2.0x | 0 |
| 2.7M (full NWM) | 28 ms | 14.2 ms | 2.0x | 0 |
Zero error across all test configurations. The GPU kernel is algorithmically identical to the CPU code.
Why the Speedup is Modest
The Nash cascade is very simple arithmetic (multiply, add, subtract per reservoir). At ~6 FLOPs per reservoir per catchment, the kernel is memory-bandwidth bound, not compute bound. The 2x speedup reflects the RTX 3060's ~360 GB/s memory bandwidth vs CPU ~50 GB/s.
The real benefit of GPU porting would come from eliminating the Python/Cython overhead in the full NWM pipeline, where the kernel call overhead per timestep likely dominates over the actual computation.