Skip to content

GPU batched Nash cascade — bit-identical results at 2.7M catchments #169

@consigcody94

Description

@consigcody94

Summary

The Nash cascade subsurface routing in nash_cascade.c can run on GPU with one thread per catchment. For the NWM's 2.7M catchments, this produces bit-identical results to the CPU implementation.

Benchmark (RTX 3060)

Catchments CPU GPU Speedup Max error
10K <1 ms 0.107 ms 0
100K 1 ms 0.464 ms 2.2x 0
1M 10 ms 5.1 ms 2.0x 0
2.7M (full NWM) 28 ms 14.2 ms 2.0x 0

Zero error across all test configurations. The GPU kernel is algorithmically identical to the CPU code.

Why the Speedup is Modest

The Nash cascade is very simple arithmetic (multiply, add, subtract per reservoir). At ~6 FLOPs per reservoir per catchment, the kernel is memory-bandwidth bound, not compute bound. The 2x speedup reflects the RTX 3060's ~360 GB/s memory bandwidth vs CPU ~50 GB/s.

The real benefit of GPU porting would come from eliminating the Python/Cython overhead in the full NWM pipeline, where the kernel call overhead per timestep likely dominates over the actual computation.

Code

https://github.com/consigcody94/parallel-prefix-rt/blob/master/benchmarks/cuda/owp_batched_kernels.cu

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions