GPU batched Nash cascade — bit-identical results at 2.7M catchments

## Summary

The Nash cascade subsurface routing in `nash_cascade.c` can run on GPU with one thread per catchment. For the NWM's 2.7M catchments, this produces bit-identical results to the CPU implementation.

## Benchmark (RTX 3060)

| Catchments | CPU | GPU | Speedup | Max error |
|---|---|---|---|---|
| 10K | <1 ms | 0.107 ms | — | **0** |
| 100K | 1 ms | 0.464 ms | 2.2x | **0** |
| 1M | 10 ms | 5.1 ms | 2.0x | **0** |
| 2.7M (full NWM) | 28 ms | 14.2 ms | 2.0x | **0** |

Zero error across all test configurations. The GPU kernel is algorithmically identical to the CPU code.

## Why the Speedup is Modest

The Nash cascade is very simple arithmetic (multiply, add, subtract per reservoir). At ~6 FLOPs per reservoir per catchment, the kernel is memory-bandwidth bound, not compute bound. The 2x speedup reflects the RTX 3060's ~360 GB/s memory bandwidth vs CPU ~50 GB/s.

The real benefit of GPU porting would come from eliminating the Python/Cython overhead in the full NWM pipeline, where the kernel call overhead per timestep likely dominates over the actual computation.

## Code

https://github.com/consigcody94/parallel-prefix-rt/blob/master/benchmarks/cuda/owp_batched_kernels.cu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU batched Nash cascade — bit-identical results at 2.7M catchments #169

Summary

Benchmark (RTX 3060)

Why the Speedup is Modest

Code

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Catchments	CPU	GPU	Speedup
10K	<1 ms	0.107 ms	—
100K	1 ms	0.464 ms	2.2x
1M	10 ms	5.1 ms	2.0x
2.7M (full NWM)	28 ms	14.2 ms	2.0x

GPU batched Nash cascade — bit-identical results at 2.7M catchments #169

Description

Summary

Benchmark (RTX 3060)

Why the Speedup is Modest

Code

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions