Add NCCL extension and implement NCCLFourierTridiagonalPoissonSolver by glwagner · Pull Request #5445 · CliMA/Oceananigans.jl

glwagner · 2026-03-27T21:32:45Z

Experimenting with NCCL to accelerate multi-GPU Poisson solver.

Avoid calling `sync_device!` (CUDA.synchronize) in `fill_corners!` when all corner connectivity is `nothing`. This occurs for slab decompositions and single-rank distributed configurations where no diagonal neighbors exist. Each `sync_device!` call flushes the GPU's async execution pipeline. With ~300 halo fills per 10 timesteps in a typical Breeze.jl anelastic benchmark, the unnecessary syncs cause a 2.95x overhead on a single distributed GPU. This fix reduces the overhead to 1.19x. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

For doubly-periodic grids on GPU, plan the y-FFT along dimension 2 directly instead of reshaping to (Ny, Nx, Nz) and planning along dimension 1. This eliminates two permutedims! calls per y-FFT, which were the dominant cost in the distributed solver (69% of solve time). The old approach was: reshape → plan on dim 1 → permutedims → FFT → permutedims The new approach is: plan on dim 2 → FFT (cuFFT handles strided access natively) For Bounded y-topology, the old reshape approach is preserved because twiddle factors and index permutations assume a dim-1 layout. Also skip y-buffer allocation for Periodic y (no index permutation needed). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

For slab-x decomposition (Ry=1), the z-direction is always fully local in x-local space. This means the tridiagonal solve can be performed directly after the forward FFTs in x-local space, without transposing back to z-local space first. The optimized algorithm: y-FFT → transpose_y_to_x(MPI) → x-FFT → tridiag_solve → x-IFFT → transpose_x_to_y(MPI) → y-IFFT instead of the general algorithm: y-FFT → transpose_y_to_x(MPI) → x-FFT → transpose_x_to_y(MPI) → tridiag_solve → transpose_y_to_x(MPI) → x-IFFT → transpose_x_to_y(MPI) → y-IFFT This halves the number of MPI all-to-all operations per pressure solve from 4 to 2, reducing both communication volume and pipeline stalls from GPU synchronization. The tridiagonal solver is set up on the x-local grid with correctly partitioned eigenvalues. The general pencil decomposition path is preserved for Ry > 1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Cray MPICH's Alltoallv is catastrophically slow for GPU buffers (28x slower than Alltoall on A100 GPUs over NVLink). Since most distributed ocean/atmosphere simulations use equal partition sizes, use Alltoall when all chunk counts are equal, falling back to Alltoallv for unequal partitions. Benchmark results (transpose round-trip, 200×200×80 ComplexF32): 2 GPUs: Alltoallv 52.0 ms → Alltoall 1.85 ms (28x faster) 4 GPUs: Alltoallv 46.8 ms → Alltoall 2.20 ms (21x faster) 8 GPUs: Alltoallv 70.5 ms → Alltoall 8.69 ms (8x faster) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add OceananigansNCCLExt that replaces MPI Alltoall with NCCL Send/Recv for distributed FFT transposes. NCCL operations are GPU-stream-native, eliminating sync_device! pipeline stalls that dominate the current MPI-based solver cost. The extension provides NCCLDistributedFFTSolver which wraps the existing DistributedFourierTridiagonalPoissonSolver and overrides solve! to use NCCL grouped Send/Recv instead of MPI Alltoall. Benchmark results (A100-SXM4-80GB, NV12 NVLink, 200x200x80/GPU): 2 GPUs: 2.82 ms/solve (NCCL) 4 GPUs: 2.95 ms/solve (NCCL) Extension files: - nccl_communicator.jl: NCCL comm creation from MPI subcommunicator - nccl_transpose.jl: NCCL-based alltoall (no sync_device!) - nccl_solver.jl: NCCLDistributedFFTSolver with solve! for all stretched directions (Z, Y, X) plus slab-x optimized path Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

NCCLDistributed replaces MPI Isend/Irecv with NCCL grouped Send/Recv for all halo communication (sides + corners), eliminating sync_device! stalls throughout the time-stepping loop. Usage: arch = NCCLDistributed(GPU(); partition = Partition(Ngpus, 1)) grid = RectilinearGrid(arch, ...) # All halo fills now use NCCL automatically The NCCLCommunicator wrapper carries both NCCL (for GPU data transfer) and MPI (for reductions, initialization) communicators. MPI collective operations are forwarded to the inner MPI comm transparently. Benchmark: 87 μs/halo fill on 2x A100-80GB with NV12 NVLink. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Use yfield (not xfield/zfield) for arch check in transpose overrides, since twin_grid creates new architectures without NCCLCommunicator - Add __precompile__(false) to allow method overwriting of base transposes - Add NCCL subcommunicator cache for TransposableField MPI subcommunicators - Add MPI.Allreduce, Comm_split_type, Isend, Irecv forwarding - sync_device! is now a no-op for NCCLDistributed architecture Full simulation tested: 10 time steps at 76.3 ms/step on 2x A100-80GB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

giordano · 2026-03-27T22:53:33Z

ext/OceananigansNCCLExt/nccl_transpose.jl

+##### cache them in a global dictionary keyed by the MPI comm.
+#####
+
+const _nccl_subcomm_cache = Dict{MPI.Comm, NCCL.Communicator}()


This needs a lock to guard concurrent writes (and reads).

ext/OceananigansNCCLExt/nccl_transpose.jl

Instead of one NCCL groupStart/groupEnd per field per side, batch all distributed Send/Recv for a field into a single NCCL group call. This reduces NCCL kernel launches from ~120/timestep to ~30/timestep. Override fill_halo_regions! for DistributedField with NCCLCommunicator to use the batched path: pack all sides → one NCCL group with all Send/Recv → unpack all sides + fill local halos. Performance (200×200×80/GPU, A100-80GB, NV12 NVLink): 2 GPUs: 53.5 → 17.4 ms/timestep (3.1x improvement) 4 GPUs: 21.9 → 19.5 ms/timestep (1.1x improvement) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-authored-by: Mosè Giordano <765740+giordano@users.noreply.github.com>

When fill_halo_regions! is called with a tuple/NamedTuple of fields (as in update_state!), batch ALL fields' distributed Send/Recv into a single NCCL groupStart/groupEnd call. This reduces NCCL kernel launches from ~N*sides per call to 1 per call. Performance (200x200x80/GPU, A100-80GB, NV12 NVLink): 2 GPUs: 17.4 -> 16.0 ms/timestep 4 GPUs: 19.5 -> 16.0 ms/timestep (near-perfect weak scaling!) 1 GPU (non-distributed reference): 8.9 ms/timestep Total speedup from all NCCL optimizations vs initial: 2 GPUs: 53.5 -> 16.0 ms (3.3x) 4 GPUs: 21.9 -> 16.0 ms (1.4x) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…A/Oceananigans.jl into glw/nccl-distributed-solver

- NCCLDistributedArch, NCCLDistributedGrid, NCCLDistributedField type aliases - No underscore-prefixed functions - All multi-line functions have explicit return - Dispatch on NCCLDistributedArch/Field instead of runtime isa checks - Remove broken async overlap (will implement properly in next PR) - Revert NCCLCommunicator to simple nccl+mpi (no comm_stream yet) Performance (200x200x80/GPU, A100-80GB, NV12 NVLink, WENO5): 1 GPU: 13.37 ms/step (baseline) 2 GPUs: 23.46 ms/step (57% efficiency) 4 GPUs: 21.58 ms/step (62% efficiency) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Dispatch on NCCLDistributedGrid instead of overwriting the base method. This eliminates type piracy and removes the need for __precompile__(false). The NCCL extension now cleanly extends: - distributed_fill_halo_event!(c, k, bcs, loc, grid::NCCLDistributedGrid, ...) - fill_corners!(c, conn, idx, loc, arch::NCCLDistributedArch, ...) - sync_device!(::NCCLDistributedArch) = nothing ERF-like weak scaling (50x400x80/GPU, Centered+diffusion): 1 GPU: 6.85 ms/step 2 GPUs: 18.22 ms/step (37.6% efficiency) 4 GPUs: 16.49 ms/step (41.5% efficiency) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…A/Oceananigans.jl into glw/nccl-distributed-solver

Statistics.mean() on CuArray{ComplexF32} triggers a broken mapreduce scalar fallback path on CUDA 13 (KernelException at thread 241, block 97). Replace with sum(ϕ)/length(ϕ) which uses the GPU-native sum reduction that handles ComplexF32 correctly. This fixes the ERF anelastic benchmark crash on Perlmutter with cudatoolkit/13.0 + nccl/2.29.2-cu13. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When async=true, NCCL Send/Recv runs on a dedicated comm_stream while interior tendency computation proceeds on the default stream. synchronize_communication! waits via CUDA event (no CPU-GPU sync). Improvement (WENO5, 200x200x80/GPU, A100-80GB): 2 GPUs: 23.46 -> 21.66 ms/step (8% improvement) 4 GPUs: similar (~21-23 ms, some variability) ERF-like (50x400x80/GPU): 4 GPUs: 16.49 -> 14.71 ms/step (11% improvement) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

All NCCL operations (halo fills + solver transposes) now execute on a dedicated comm_stream. For synchronous calls, CUDA events synchronize the streams before unpack. This moves NCCL off the default stream, freeing it for compute kernels. Nsight verification (2 GPUs, 1024x1024x128): Before: 250.8 ms NCCL on default stream, 152.0 ms on comm_stream After: 20.4 ms NCCL on default stream, 194.9 ms on comm_stream Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace GPU broadcast kernels with cuMemcpy2DAsync for slab-x halo pack/unpack. Uses DMA engine instead of compute cores. Nsight analysis (2 GPUs, 1024x1024x128 F32, 3 timesteps): - broadcast_kernel_cartesian calls: 630 → 342 (DMA replaces x-direction) - Total kernel time: 2927 → 2574 ms (12% reduction) - NCCL total: 486 → 165 ms (DMA pack overlaps better with NCCL) - Wall-clock: ~435 ms/step (unchanged — DMA already overlapped) Root cause of remaining overhead: 33 gaps of 1-5ms on default stream totaling 70 ms — GPU idle while waiting for NCCL via cuStreamWaitEvent. These are the fundamental synchronization points where the algorithm requires communication results before proceeding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Start tracer halo communication immediately after tracer field advance, before the velocity step and pressure solver. Start velocity halos right after pressure correction, before cache_tendencies and update_state. This overlaps tracer NCCL transfers with the entire pressure solver pipeline (FFTs, tridiag, transposes), hiding ~15ms of latency. Large grid results (1024x1024x128/GPU F32, 2 GPUs): 435 → 420 ms (85%) The pipelined rk3_substep! overrides the base method for NCCLNonhydrostaticModel. update_state! is also overridden to skip fill_halo when halos are already pending. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

This directive prevented the NCCL extension from being cached, forcing every job to recompile it from scratch (~30 min on multi-node). Extensions can safely precompile even when they add methods to the parent module. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Without this directive, 8 ranks all try to precompile the NCCL extension simultaneously. One grabs the pidfile lock, attempts precompilation, fails (method overwriting not permitted), and retries indefinitely. Other ranks block on the lock forever. This caused 2-hour timeouts on multi-node Perlmutter jobs. The method overwriting that prevents precompilation needs to be refactored to use proper dispatch instead of type piracy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The NCCL extension transpose functions previously overwrote base methods with identical signatures (e.g. `transpose_y_to_x!(pf::TransposableField)`) and used runtime `if arch.communicator isa NCCLCommunicator` branching. This forced `__precompile__(false)`, causing every multi-node job to recompile the extension from scratch (~2h on Perlmutter Lustre). Fix: add architecture dispatch to base transpose functions: transpose_y_to_x!(pf) → transpose_y_to_x!(arch, pf) The NCCL extension now adds methods for `NCCLDistributedArch` — proper dispatch, no overwriting: DC.transpose_y_to_x!(::NCCLDistributedArch, pf::TransposableField) With `__precompile__(false)` removed, the extension caches normally and multi-node jobs start in seconds instead of hours. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

simone-silvestri · 2026-03-30T08:12:21Z

ext/OceananigansNCCLExt/nccl_pipelined_rk3.jl

@@ -0,0 +1,148 @@
+#####
+##### Pipelined RK3: start halo communication early, overlap with pressure solver


This is not specific to NCCL right? It applies to all the communication models. Also, I think the fill halos of tracers is already pipelined with the computation of tendencies

this is overlap with pressure solver, not tendencies

I agree though that its not specific to NCCL.

cuFFT decomposes dim-2 (strided) FFTs into many small kernels (one per z-level). Reshaping to (Ny, Nx, Nz) and planning along contiguous dim 1 is 3.4x faster despite the permutedims overhead. Changes: - discrete_transforms.jl: set transpose_dims=(2,1,3) for ALL GPU dim-2 transforms, not just Bounded - plan_distributed_transforms.jl: use reshape + dim-1 plan for all GPU y-FFTs - distributed_fft_tridiagonal_solver.jl: allocate y-buffer for all GPU topologies (needed for permutedims workspace) Results (50x400x80/GPU, x-distributed, 2 GPUs): NonhydrostaticModel: 12.60 → 8.40 ms (1.50x speedup, 37% → 56% eff) AnelasticDynamics: 14.68 → 11.32 ms (1.30x speedup, 40% → 52% eff) CompressibleDynamics: unchanged (no FFT solver) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove cuMemcpy2D halo pack/unpack — it crashed on 2D barotropic fields (z-size 1 doesn't match grid Nz). The broadcast kernel handles all field types correctly and has equivalent performance. HFSM with NCCL now works (SplitExplicitFreeSurface, SeawaterBuoyancy): 200x200x50/GPU, WENOVectorInvariant, WENO(order=7), 30 substeps 1 GPU: 4.4 ms/step 2 GPU: 5.4 ms/step (81% efficiency) 4 GPU: 6.1 ms/step (72% efficiency) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Extract alltoall_transpose!() as an overridable dispatch point in the distributed FFT transpose. The NCCL extension implements it using grouped NCCL.Send/Recv pairs on the global NCCL communicator, mapping sub-communicator ranks to global ranks. This enables NonhydrostaticModel distributed on machines without GPU-aware MPI (HFSM already worked). Also removes cuMemcpy2D halo pack/unpack that crashed on 2D fields. Strong scaling results (8x A100 SXM4, NCCL): HFSM 2800x2800x100 SplitExplicit: 92-95% at 8 GPUs NHM 1600x1600x100 FFT solver: 34% at 8 GPUs (transpose-limited) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

codecov · 2026-04-01T19:11:39Z

Codecov Report

❌ Patch coverage is 2.28311% with 428 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.49%. Comparing base (f70a4e4) to head (56d454a).
⚠️ Report is 13 commits behind head on main.

Files with missing lines	Patch %	Lines
ext/OceananigansNCCLExt/nccl_distributed.jl	0.00%	158 Missing ⚠️
ext/OceananigansNCCLExt/nccl_solver.jl	0.00%	85 Missing ⚠️
ext/OceananigansNCCLExt/nccl_transpose.jl	0.00%	64 Missing ⚠️
ext/OceananigansNCCLExt/nccl_pipelined_rk3.jl	0.00%	49 Missing ⚠️
...Computations/distributed_fft_tridiagonal_solver.jl	0.00%	34 Missing ⚠️
ext/OceananigansNCCLExt/nccl_zero_copy_halos.jl	0.00%	20 Missing ⚠️
ext/OceananigansNCCLExt/nccl_communicator.jl	0.00%	14 Missing ⚠️
...ributedComputations/plan_distributed_transforms.jl	25.00%	3 Missing ⚠️
...c/DistributedComputations/distributed_transpose.jl	88.88%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5445      +/-   ##
==========================================
- Coverage   73.59%   72.49%   -1.10%     
==========================================
  Files         400      406       +6     
  Lines       22867    23325     +458     
==========================================
+ Hits        16829    16910      +81     
- Misses       6038     6415     +377

Flag	Coverage Δ
buildkite	`67.71% <0.22%> (-1.30%)`	⬇️
julia	`67.71% <0.22%> (-1.30%)`	⬇️
reactant_1	`6.33% <0.00%> (-0.13%)`	⬇️
reactant_2	`10.25% <0.00%> (-0.20%)`	⬇️
reactant_3	`9.39% <0.00%> (-0.18%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Fix: dispatch all 4 transpose directions on architecture(pf.zfield) instead of architecture(pf.fromfield). The fromfield's grid uses MPI sub-communicators (not NCCLCommunicator), so only transpose_z_to_y! was dispatching to the NCCL path. The other 3 transposes fell through to CPU-staging fallback, causing 8.4x slowdown. Tested native ncclAlltoAll (NCCL ≥ 2.20) vs grouped Send/Recv: NHM 8x1 1600x1600x100: 444.7 ms vs 441.5 ms (identical). Grouped Send/Recv is simpler and equally fast — reverted to it. Also cleaned up dead code in nccl_distributed.jl (alltoall_transpose! is now a simple CPU-staging fallback for non-NCCL CuArray buffers). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Gregory Wagner and others added 8 commits March 26, 2026 12:47

Add NCCL compat entry to Project.toml

3d69ae2

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge branch 'main' into glw/nccl-distributed-solver

bff5dc5

glwagner marked this pull request as draft March 27, 2026 21:34

giordano reviewed Mar 27, 2026

View reviewed changes

ext/OceananigansNCCLExt/nccl_transpose.jl Outdated Show resolved Hide resolved

giordano added performance 🏍️ So we can get the wrong answer even faster GPU 👾 Where Oceananigans gets its powers from distributed 🕸️ Our plan for total cluster domination labels Mar 27, 2026

glwagner and others added 9 commits March 27, 2026 23:53

Update ext/OceananigansNCCLExt/nccl_transpose.jl

c91c7e6

Co-authored-by: Mosè Giordano <765740+giordano@users.noreply.github.com>

Merge branch 'glw/nccl-distributed-solver' of https://github.com/CliM…

ded8e27

…A/Oceananigans.jl into glw/nccl-distributed-solver

Merge branch 'main' into glw/nccl-distributed-solver

4893953

Merge branch 'glw/nccl-distributed-solver' of https://github.com/CliM…

e3452c0

…A/Oceananigans.jl into glw/nccl-distributed-solver

navidcy added the extensions 🧬 label Mar 28, 2026

glwagner and others added 5 commits March 29, 2026 00:12

Gregory LeClaire Wagner and others added 2 commits March 29, 2026 08:59

simone-silvestri reviewed Mar 30, 2026

View reviewed changes

glwagner and others added 3 commits March 31, 2026 19:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NCCL extension and implement NCCLFourierTridiagonalPoissonSolver#5445

Add NCCL extension and implement NCCLFourierTridiagonalPoissonSolver#5445
glwagner wants to merge 29 commits intomainfrom
glw/nccl-distributed-solver

glwagner commented Mar 27, 2026

Uh oh!

giordano Mar 27, 2026

Uh oh!

Uh oh!

simone-silvestri Mar 30, 2026

Uh oh!

glwagner Mar 30, 2026

Uh oh!

codecov bot commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		@@ -0,0 +1,148 @@
		#####
		##### Pipelined RK3: start halo communication early, overlap with pressure solver

Conversation

glwagner commented Mar 27, 2026

Uh oh!

giordano Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

simone-silvestri Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

glwagner Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Apr 1, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants