Skip to content

Add NCCL extension and implement NCCLFourierTridiagonalPoissonSolver#5445

Draft
glwagner wants to merge 29 commits intomainfrom
glw/nccl-distributed-solver
Draft

Add NCCL extension and implement NCCLFourierTridiagonalPoissonSolver#5445
glwagner wants to merge 29 commits intomainfrom
glw/nccl-distributed-solver

Conversation

@glwagner
Copy link
Copy Markdown
Member

Experimenting with NCCL to accelerate multi-GPU Poisson solver.

Gregory Wagner and others added 8 commits March 26, 2026 12:47
Avoid calling `sync_device!` (CUDA.synchronize) in `fill_corners!` when
all corner connectivity is `nothing`. This occurs for slab decompositions
and single-rank distributed configurations where no diagonal neighbors exist.

Each `sync_device!` call flushes the GPU's async execution pipeline. With
~300 halo fills per 10 timesteps in a typical Breeze.jl anelastic benchmark,
the unnecessary syncs cause a 2.95x overhead on a single distributed GPU.
This fix reduces the overhead to 1.19x.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
For doubly-periodic grids on GPU, plan the y-FFT along dimension 2
directly instead of reshaping to (Ny, Nx, Nz) and planning along
dimension 1. This eliminates two permutedims! calls per y-FFT, which
were the dominant cost in the distributed solver (69% of solve time).

The old approach was: reshape → plan on dim 1 → permutedims → FFT → permutedims
The new approach is: plan on dim 2 → FFT (cuFFT handles strided access natively)

For Bounded y-topology, the old reshape approach is preserved because
twiddle factors and index permutations assume a dim-1 layout.

Also skip y-buffer allocation for Periodic y (no index permutation needed).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
For slab-x decomposition (Ry=1), the z-direction is always fully local
in x-local space. This means the tridiagonal solve can be performed
directly after the forward FFTs in x-local space, without transposing
back to z-local space first.

The optimized algorithm:
  y-FFT → transpose_y_to_x(MPI) → x-FFT → tridiag_solve → x-IFFT → transpose_x_to_y(MPI) → y-IFFT

instead of the general algorithm:
  y-FFT → transpose_y_to_x(MPI) → x-FFT → transpose_x_to_y(MPI) → tridiag_solve → transpose_y_to_x(MPI) → x-IFFT → transpose_x_to_y(MPI) → y-IFFT

This halves the number of MPI all-to-all operations per pressure solve
from 4 to 2, reducing both communication volume and pipeline stalls
from GPU synchronization.

The tridiagonal solver is set up on the x-local grid with correctly
partitioned eigenvalues. The general pencil decomposition path is
preserved for Ry > 1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cray MPICH's Alltoallv is catastrophically slow for GPU buffers
(28x slower than Alltoall on A100 GPUs over NVLink). Since most
distributed ocean/atmosphere simulations use equal partition sizes,
use Alltoall when all chunk counts are equal, falling back to
Alltoallv for unequal partitions.

Benchmark results (transpose round-trip, 200×200×80 ComplexF32):
  2 GPUs: Alltoallv 52.0 ms → Alltoall 1.85 ms (28x faster)
  4 GPUs: Alltoallv 46.8 ms → Alltoall 2.20 ms (21x faster)
  8 GPUs: Alltoallv 70.5 ms → Alltoall 8.69 ms (8x faster)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add OceananigansNCCLExt that replaces MPI Alltoall with NCCL
Send/Recv for distributed FFT transposes. NCCL operations are
GPU-stream-native, eliminating sync_device! pipeline stalls that
dominate the current MPI-based solver cost.

The extension provides NCCLDistributedFFTSolver which wraps the
existing DistributedFourierTridiagonalPoissonSolver and overrides
solve! to use NCCL grouped Send/Recv instead of MPI Alltoall.

Benchmark results (A100-SXM4-80GB, NV12 NVLink, 200x200x80/GPU):
  2 GPUs: 2.82 ms/solve (NCCL)
  4 GPUs: 2.95 ms/solve (NCCL)

Extension files:
  - nccl_communicator.jl: NCCL comm creation from MPI subcommunicator
  - nccl_transpose.jl: NCCL-based alltoall (no sync_device!)
  - nccl_solver.jl: NCCLDistributedFFTSolver with solve! for all
    stretched directions (Z, Y, X) plus slab-x optimized path

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
NCCLDistributed replaces MPI Isend/Irecv with NCCL grouped Send/Recv
for all halo communication (sides + corners), eliminating sync_device!
stalls throughout the time-stepping loop.

Usage:
  arch = NCCLDistributed(GPU(); partition = Partition(Ngpus, 1))
  grid = RectilinearGrid(arch, ...)
  # All halo fills now use NCCL automatically

The NCCLCommunicator wrapper carries both NCCL (for GPU data transfer)
and MPI (for reductions, initialization) communicators. MPI collective
operations are forwarded to the inner MPI comm transparently.

Benchmark: 87 μs/halo fill on 2x A100-80GB with NV12 NVLink.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@glwagner glwagner marked this pull request as draft March 27, 2026 21:34
- Use yfield (not xfield/zfield) for arch check in transpose overrides,
  since twin_grid creates new architectures without NCCLCommunicator
- Add __precompile__(false) to allow method overwriting of base transposes
- Add NCCL subcommunicator cache for TransposableField MPI subcommunicators
- Add MPI.Allreduce, Comm_split_type, Isend, Irecv forwarding
- sync_device! is now a no-op for NCCLDistributed architecture

Full simulation tested: 10 time steps at 76.3 ms/step on 2x A100-80GB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
##### cache them in a global dictionary keyed by the MPI comm.
#####

const _nccl_subcomm_cache = Dict{MPI.Comm, NCCL.Communicator}()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a lock to guard concurrent writes (and reads).

@giordano giordano added performance 🏍️ So we can get the wrong answer even faster GPU 👾 Where Oceananigans gets its powers from distributed 🕸️ Our plan for total cluster domination labels Mar 27, 2026
glwagner and others added 9 commits March 27, 2026 23:53
Instead of one NCCL groupStart/groupEnd per field per side, batch all
distributed Send/Recv for a field into a single NCCL group call.
This reduces NCCL kernel launches from ~120/timestep to ~30/timestep.

Override fill_halo_regions! for DistributedField with NCCLCommunicator
to use the batched path: pack all sides → one NCCL group with all
Send/Recv → unpack all sides + fill local halos.

Performance (200×200×80/GPU, A100-80GB, NV12 NVLink):
  2 GPUs: 53.5 → 17.4 ms/timestep (3.1x improvement)
  4 GPUs: 21.9 → 19.5 ms/timestep (1.1x improvement)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Mosè Giordano <765740+giordano@users.noreply.github.com>
When fill_halo_regions! is called with a tuple/NamedTuple of fields
(as in update_state!), batch ALL fields' distributed Send/Recv into
a single NCCL groupStart/groupEnd call. This reduces NCCL kernel
launches from ~N*sides per call to 1 per call.

Performance (200x200x80/GPU, A100-80GB, NV12 NVLink):
  2 GPUs: 17.4 -> 16.0 ms/timestep
  4 GPUs: 19.5 -> 16.0 ms/timestep (near-perfect weak scaling!)
  1 GPU (non-distributed reference): 8.9 ms/timestep

Total speedup from all NCCL optimizations vs initial:
  2 GPUs: 53.5 -> 16.0 ms (3.3x)
  4 GPUs: 21.9 -> 16.0 ms (1.4x)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- NCCLDistributedArch, NCCLDistributedGrid, NCCLDistributedField type aliases
- No underscore-prefixed functions
- All multi-line functions have explicit return
- Dispatch on NCCLDistributedArch/Field instead of runtime isa checks
- Remove broken async overlap (will implement properly in next PR)
- Revert NCCLCommunicator to simple nccl+mpi (no comm_stream yet)

Performance (200x200x80/GPU, A100-80GB, NV12 NVLink, WENO5):
  1 GPU:  13.37 ms/step (baseline)
  2 GPUs: 23.46 ms/step (57% efficiency)
  4 GPUs: 21.58 ms/step (62% efficiency)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dispatch on NCCLDistributedGrid instead of overwriting the base method.
This eliminates type piracy and removes the need for __precompile__(false).

The NCCL extension now cleanly extends:
  - distributed_fill_halo_event!(c, k, bcs, loc, grid::NCCLDistributedGrid, ...)
  - fill_corners!(c, conn, idx, loc, arch::NCCLDistributedArch, ...)
  - sync_device!(::NCCLDistributedArch) = nothing

ERF-like weak scaling (50x400x80/GPU, Centered+diffusion):
  1 GPU:  6.85 ms/step
  2 GPUs: 18.22 ms/step (37.6% efficiency)
  4 GPUs: 16.49 ms/step (41.5% efficiency)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Statistics.mean() on CuArray{ComplexF32} triggers a broken mapreduce
scalar fallback path on CUDA 13 (KernelException at thread 241,
block 97). Replace with sum(ϕ)/length(ϕ) which uses the GPU-native
sum reduction that handles ComplexF32 correctly.

This fixes the ERF anelastic benchmark crash on Perlmutter with
cudatoolkit/13.0 + nccl/2.29.2-cu13.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
glwagner and others added 5 commits March 29, 2026 00:12
When async=true, NCCL Send/Recv runs on a dedicated comm_stream while
interior tendency computation proceeds on the default stream.
synchronize_communication! waits via CUDA event (no CPU-GPU sync).

Improvement (WENO5, 200x200x80/GPU, A100-80GB):
  2 GPUs: 23.46 -> 21.66 ms/step (8% improvement)
  4 GPUs: similar (~21-23 ms, some variability)

ERF-like (50x400x80/GPU):
  4 GPUs: 16.49 -> 14.71 ms/step (11% improvement)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All NCCL operations (halo fills + solver transposes) now execute on
a dedicated comm_stream. For synchronous calls, CUDA events synchronize
the streams before unpack. This moves NCCL off the default stream,
freeing it for compute kernels.

Nsight verification (2 GPUs, 1024x1024x128):
  Before: 250.8 ms NCCL on default stream, 152.0 ms on comm_stream
  After:   20.4 ms NCCL on default stream, 194.9 ms on comm_stream

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace GPU broadcast kernels with cuMemcpy2DAsync for slab-x halo
pack/unpack. Uses DMA engine instead of compute cores.

Nsight analysis (2 GPUs, 1024x1024x128 F32, 3 timesteps):
- broadcast_kernel_cartesian calls: 630 → 342 (DMA replaces x-direction)
- Total kernel time: 2927 → 2574 ms (12% reduction)
- NCCL total: 486 → 165 ms (DMA pack overlaps better with NCCL)
- Wall-clock: ~435 ms/step (unchanged — DMA already overlapped)

Root cause of remaining overhead: 33 gaps of 1-5ms on default stream
totaling 70 ms — GPU idle while waiting for NCCL via cuStreamWaitEvent.
These are the fundamental synchronization points where the algorithm
requires communication results before proceeding.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Start tracer halo communication immediately after tracer field advance,
before the velocity step and pressure solver. Start velocity halos right
after pressure correction, before cache_tendencies and update_state.

This overlaps tracer NCCL transfers with the entire pressure solver
pipeline (FFTs, tridiag, transposes), hiding ~15ms of latency.

Large grid results (1024x1024x128/GPU F32, 2 GPUs): 435 → 420 ms (85%)

The pipelined rk3_substep! overrides the base method for NCCLNonhydrostaticModel.
update_state! is also overridden to skip fill_halo when halos are already pending.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This directive prevented the NCCL extension from being cached,
forcing every job to recompile it from scratch (~30 min on multi-node).
Extensions can safely precompile even when they add methods to the
parent module.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Gregory LeClaire Wagner and others added 2 commits March 29, 2026 08:59
Without this directive, 8 ranks all try to precompile the NCCL
extension simultaneously. One grabs the pidfile lock, attempts
precompilation, fails (method overwriting not permitted), and
retries indefinitely. Other ranks block on the lock forever.
This caused 2-hour timeouts on multi-node Perlmutter jobs.

The method overwriting that prevents precompilation needs to be
refactored to use proper dispatch instead of type piracy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The NCCL extension transpose functions previously overwrote base methods
with identical signatures (e.g. `transpose_y_to_x!(pf::TransposableField)`)
and used runtime `if arch.communicator isa NCCLCommunicator` branching.
This forced `__precompile__(false)`, causing every multi-node job to
recompile the extension from scratch (~2h on Perlmutter Lustre).

Fix: add architecture dispatch to base transpose functions:
  transpose_y_to_x!(pf) → transpose_y_to_x!(arch, pf)

The NCCL extension now adds methods for `NCCLDistributedArch` —
proper dispatch, no overwriting:
  DC.transpose_y_to_x!(::NCCLDistributedArch, pf::TransposableField)

With `__precompile__(false)` removed, the extension caches normally
and multi-node jobs start in seconds instead of hours.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@@ -0,0 +1,148 @@
#####
##### Pipelined RK3: start halo communication early, overlap with pressure solver
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not specific to NCCL right? It applies to all the communication models. Also, I think the fill halos of tracers is already pipelined with the computation of tendencies

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is overlap with pressure solver, not tendencies

I agree though that its not specific to NCCL.

glwagner and others added 3 commits March 31, 2026 19:31
cuFFT decomposes dim-2 (strided) FFTs into many small kernels
(one per z-level). Reshaping to (Ny, Nx, Nz) and planning along
contiguous dim 1 is 3.4x faster despite the permutedims overhead.

Changes:
  - discrete_transforms.jl: set transpose_dims=(2,1,3) for ALL
    GPU dim-2 transforms, not just Bounded
  - plan_distributed_transforms.jl: use reshape + dim-1 plan for
    all GPU y-FFTs
  - distributed_fft_tridiagonal_solver.jl: allocate y-buffer for
    all GPU topologies (needed for permutedims workspace)

Results (50x400x80/GPU, x-distributed, 2 GPUs):
  NonhydrostaticModel: 12.60 → 8.40 ms (1.50x speedup, 37% → 56% eff)
  AnelasticDynamics:   14.68 → 11.32 ms (1.30x speedup, 40% → 52% eff)
  CompressibleDynamics: unchanged (no FFT solver)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove cuMemcpy2D halo pack/unpack — it crashed on 2D barotropic
fields (z-size 1 doesn't match grid Nz). The broadcast kernel
handles all field types correctly and has equivalent performance.

HFSM with NCCL now works (SplitExplicitFreeSurface, SeawaterBuoyancy):
  200x200x50/GPU, WENOVectorInvariant, WENO(order=7), 30 substeps
  1 GPU: 4.4 ms/step
  2 GPU: 5.4 ms/step (81% efficiency)
  4 GPU: 6.1 ms/step (72% efficiency)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract alltoall_transpose!() as an overridable dispatch point in the
distributed FFT transpose. The NCCL extension implements it using
grouped NCCL.Send/Recv pairs on the global NCCL communicator, mapping
sub-communicator ranks to global ranks.

This enables NonhydrostaticModel distributed on machines without
GPU-aware MPI (HFSM already worked).

Also removes cuMemcpy2D halo pack/unpack that crashed on 2D fields.

Strong scaling results (8x A100 SXM4, NCCL):
  HFSM 2800x2800x100 SplitExplicit: 92-95% at 8 GPUs
  NHM 1600x1600x100 FFT solver: 34% at 8 GPUs (transpose-limited)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 1, 2026

Codecov Report

❌ Patch coverage is 2.28311% with 428 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.49%. Comparing base (f70a4e4) to head (56d454a).
⚠️ Report is 13 commits behind head on main.

Files with missing lines Patch % Lines
ext/OceananigansNCCLExt/nccl_distributed.jl 0.00% 158 Missing ⚠️
ext/OceananigansNCCLExt/nccl_solver.jl 0.00% 85 Missing ⚠️
ext/OceananigansNCCLExt/nccl_transpose.jl 0.00% 64 Missing ⚠️
ext/OceananigansNCCLExt/nccl_pipelined_rk3.jl 0.00% 49 Missing ⚠️
...Computations/distributed_fft_tridiagonal_solver.jl 0.00% 34 Missing ⚠️
ext/OceananigansNCCLExt/nccl_zero_copy_halos.jl 0.00% 20 Missing ⚠️
ext/OceananigansNCCLExt/nccl_communicator.jl 0.00% 14 Missing ⚠️
...ributedComputations/plan_distributed_transforms.jl 25.00% 3 Missing ⚠️
...c/DistributedComputations/distributed_transpose.jl 88.88% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5445      +/-   ##
==========================================
- Coverage   73.59%   72.49%   -1.10%     
==========================================
  Files         400      406       +6     
  Lines       22867    23325     +458     
==========================================
+ Hits        16829    16910      +81     
- Misses       6038     6415     +377     
Flag Coverage Δ
buildkite 67.71% <0.22%> (-1.30%) ⬇️
julia 67.71% <0.22%> (-1.30%) ⬇️
reactant_1 6.33% <0.00%> (-0.13%) ⬇️
reactant_2 10.25% <0.00%> (-0.20%) ⬇️
reactant_3 9.39% <0.00%> (-0.18%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Fix: dispatch all 4 transpose directions on architecture(pf.zfield)
instead of architecture(pf.fromfield). The fromfield's grid uses MPI
sub-communicators (not NCCLCommunicator), so only transpose_z_to_y!
was dispatching to the NCCL path. The other 3 transposes fell through
to CPU-staging fallback, causing 8.4x slowdown.

Tested native ncclAlltoAll (NCCL ≥ 2.20) vs grouped Send/Recv:
  NHM 8x1 1600x1600x100: 444.7 ms vs 441.5 ms (identical).
  Grouped Send/Recv is simpler and equally fast — reverted to it.

Also cleaned up dead code in nccl_distributed.jl (alltoall_transpose!
is now a simple CPU-staging fallback for non-NCCL CuArray buffers).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

distributed 🕸️ Our plan for total cluster domination extensions 🧬 GPU 👾 Where Oceananigans gets its powers from performance 🏍️ So we can get the wrong answer even faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants