Add NCCL extension and implement NCCLFourierTridiagonalPoissonSolver#5445
Add NCCL extension and implement NCCLFourierTridiagonalPoissonSolver#5445
Conversation
Avoid calling `sync_device!` (CUDA.synchronize) in `fill_corners!` when all corner connectivity is `nothing`. This occurs for slab decompositions and single-rank distributed configurations where no diagonal neighbors exist. Each `sync_device!` call flushes the GPU's async execution pipeline. With ~300 halo fills per 10 timesteps in a typical Breeze.jl anelastic benchmark, the unnecessary syncs cause a 2.95x overhead on a single distributed GPU. This fix reduces the overhead to 1.19x. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
For doubly-periodic grids on GPU, plan the y-FFT along dimension 2 directly instead of reshaping to (Ny, Nx, Nz) and planning along dimension 1. This eliminates two permutedims! calls per y-FFT, which were the dominant cost in the distributed solver (69% of solve time). The old approach was: reshape → plan on dim 1 → permutedims → FFT → permutedims The new approach is: plan on dim 2 → FFT (cuFFT handles strided access natively) For Bounded y-topology, the old reshape approach is preserved because twiddle factors and index permutations assume a dim-1 layout. Also skip y-buffer allocation for Periodic y (no index permutation needed). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
For slab-x decomposition (Ry=1), the z-direction is always fully local in x-local space. This means the tridiagonal solve can be performed directly after the forward FFTs in x-local space, without transposing back to z-local space first. The optimized algorithm: y-FFT → transpose_y_to_x(MPI) → x-FFT → tridiag_solve → x-IFFT → transpose_x_to_y(MPI) → y-IFFT instead of the general algorithm: y-FFT → transpose_y_to_x(MPI) → x-FFT → transpose_x_to_y(MPI) → tridiag_solve → transpose_y_to_x(MPI) → x-IFFT → transpose_x_to_y(MPI) → y-IFFT This halves the number of MPI all-to-all operations per pressure solve from 4 to 2, reducing both communication volume and pipeline stalls from GPU synchronization. The tridiagonal solver is set up on the x-local grid with correctly partitioned eigenvalues. The general pencil decomposition path is preserved for Ry > 1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cray MPICH's Alltoallv is catastrophically slow for GPU buffers (28x slower than Alltoall on A100 GPUs over NVLink). Since most distributed ocean/atmosphere simulations use equal partition sizes, use Alltoall when all chunk counts are equal, falling back to Alltoallv for unequal partitions. Benchmark results (transpose round-trip, 200×200×80 ComplexF32): 2 GPUs: Alltoallv 52.0 ms → Alltoall 1.85 ms (28x faster) 4 GPUs: Alltoallv 46.8 ms → Alltoall 2.20 ms (21x faster) 8 GPUs: Alltoallv 70.5 ms → Alltoall 8.69 ms (8x faster) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add OceananigansNCCLExt that replaces MPI Alltoall with NCCL
Send/Recv for distributed FFT transposes. NCCL operations are
GPU-stream-native, eliminating sync_device! pipeline stalls that
dominate the current MPI-based solver cost.
The extension provides NCCLDistributedFFTSolver which wraps the
existing DistributedFourierTridiagonalPoissonSolver and overrides
solve! to use NCCL grouped Send/Recv instead of MPI Alltoall.
Benchmark results (A100-SXM4-80GB, NV12 NVLink, 200x200x80/GPU):
2 GPUs: 2.82 ms/solve (NCCL)
4 GPUs: 2.95 ms/solve (NCCL)
Extension files:
- nccl_communicator.jl: NCCL comm creation from MPI subcommunicator
- nccl_transpose.jl: NCCL-based alltoall (no sync_device!)
- nccl_solver.jl: NCCLDistributedFFTSolver with solve! for all
stretched directions (Z, Y, X) plus slab-x optimized path
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
NCCLDistributed replaces MPI Isend/Irecv with NCCL grouped Send/Recv for all halo communication (sides + corners), eliminating sync_device! stalls throughout the time-stepping loop. Usage: arch = NCCLDistributed(GPU(); partition = Partition(Ngpus, 1)) grid = RectilinearGrid(arch, ...) # All halo fills now use NCCL automatically The NCCLCommunicator wrapper carries both NCCL (for GPU data transfer) and MPI (for reductions, initialization) communicators. MPI collective operations are forwarded to the inner MPI comm transparently. Benchmark: 87 μs/halo fill on 2x A100-80GB with NV12 NVLink. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Use yfield (not xfield/zfield) for arch check in transpose overrides, since twin_grid creates new architectures without NCCLCommunicator - Add __precompile__(false) to allow method overwriting of base transposes - Add NCCL subcommunicator cache for TransposableField MPI subcommunicators - Add MPI.Allreduce, Comm_split_type, Isend, Irecv forwarding - sync_device! is now a no-op for NCCLDistributed architecture Full simulation tested: 10 time steps at 76.3 ms/step on 2x A100-80GB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| ##### cache them in a global dictionary keyed by the MPI comm. | ||
| ##### | ||
|
|
||
| const _nccl_subcomm_cache = Dict{MPI.Comm, NCCL.Communicator}() |
There was a problem hiding this comment.
This needs a lock to guard concurrent writes (and reads).
Instead of one NCCL groupStart/groupEnd per field per side, batch all distributed Send/Recv for a field into a single NCCL group call. This reduces NCCL kernel launches from ~120/timestep to ~30/timestep. Override fill_halo_regions! for DistributedField with NCCLCommunicator to use the batched path: pack all sides → one NCCL group with all Send/Recv → unpack all sides + fill local halos. Performance (200×200×80/GPU, A100-80GB, NV12 NVLink): 2 GPUs: 53.5 → 17.4 ms/timestep (3.1x improvement) 4 GPUs: 21.9 → 19.5 ms/timestep (1.1x improvement) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Mosè Giordano <765740+giordano@users.noreply.github.com>
When fill_halo_regions! is called with a tuple/NamedTuple of fields (as in update_state!), batch ALL fields' distributed Send/Recv into a single NCCL groupStart/groupEnd call. This reduces NCCL kernel launches from ~N*sides per call to 1 per call. Performance (200x200x80/GPU, A100-80GB, NV12 NVLink): 2 GPUs: 17.4 -> 16.0 ms/timestep 4 GPUs: 19.5 -> 16.0 ms/timestep (near-perfect weak scaling!) 1 GPU (non-distributed reference): 8.9 ms/timestep Total speedup from all NCCL optimizations vs initial: 2 GPUs: 53.5 -> 16.0 ms (3.3x) 4 GPUs: 21.9 -> 16.0 ms (1.4x) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…A/Oceananigans.jl into glw/nccl-distributed-solver
- NCCLDistributedArch, NCCLDistributedGrid, NCCLDistributedField type aliases - No underscore-prefixed functions - All multi-line functions have explicit return - Dispatch on NCCLDistributedArch/Field instead of runtime isa checks - Remove broken async overlap (will implement properly in next PR) - Revert NCCLCommunicator to simple nccl+mpi (no comm_stream yet) Performance (200x200x80/GPU, A100-80GB, NV12 NVLink, WENO5): 1 GPU: 13.37 ms/step (baseline) 2 GPUs: 23.46 ms/step (57% efficiency) 4 GPUs: 21.58 ms/step (62% efficiency) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dispatch on NCCLDistributedGrid instead of overwriting the base method. This eliminates type piracy and removes the need for __precompile__(false). The NCCL extension now cleanly extends: - distributed_fill_halo_event!(c, k, bcs, loc, grid::NCCLDistributedGrid, ...) - fill_corners!(c, conn, idx, loc, arch::NCCLDistributedArch, ...) - sync_device!(::NCCLDistributedArch) = nothing ERF-like weak scaling (50x400x80/GPU, Centered+diffusion): 1 GPU: 6.85 ms/step 2 GPUs: 18.22 ms/step (37.6% efficiency) 4 GPUs: 16.49 ms/step (41.5% efficiency) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…A/Oceananigans.jl into glw/nccl-distributed-solver
Statistics.mean() on CuArray{ComplexF32} triggers a broken mapreduce
scalar fallback path on CUDA 13 (KernelException at thread 241,
block 97). Replace with sum(ϕ)/length(ϕ) which uses the GPU-native
sum reduction that handles ComplexF32 correctly.
This fixes the ERF anelastic benchmark crash on Perlmutter with
cudatoolkit/13.0 + nccl/2.29.2-cu13.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When async=true, NCCL Send/Recv runs on a dedicated comm_stream while interior tendency computation proceeds on the default stream. synchronize_communication! waits via CUDA event (no CPU-GPU sync). Improvement (WENO5, 200x200x80/GPU, A100-80GB): 2 GPUs: 23.46 -> 21.66 ms/step (8% improvement) 4 GPUs: similar (~21-23 ms, some variability) ERF-like (50x400x80/GPU): 4 GPUs: 16.49 -> 14.71 ms/step (11% improvement) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All NCCL operations (halo fills + solver transposes) now execute on a dedicated comm_stream. For synchronous calls, CUDA events synchronize the streams before unpack. This moves NCCL off the default stream, freeing it for compute kernels. Nsight verification (2 GPUs, 1024x1024x128): Before: 250.8 ms NCCL on default stream, 152.0 ms on comm_stream After: 20.4 ms NCCL on default stream, 194.9 ms on comm_stream Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace GPU broadcast kernels with cuMemcpy2DAsync for slab-x halo pack/unpack. Uses DMA engine instead of compute cores. Nsight analysis (2 GPUs, 1024x1024x128 F32, 3 timesteps): - broadcast_kernel_cartesian calls: 630 → 342 (DMA replaces x-direction) - Total kernel time: 2927 → 2574 ms (12% reduction) - NCCL total: 486 → 165 ms (DMA pack overlaps better with NCCL) - Wall-clock: ~435 ms/step (unchanged — DMA already overlapped) Root cause of remaining overhead: 33 gaps of 1-5ms on default stream totaling 70 ms — GPU idle while waiting for NCCL via cuStreamWaitEvent. These are the fundamental synchronization points where the algorithm requires communication results before proceeding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Start tracer halo communication immediately after tracer field advance, before the velocity step and pressure solver. Start velocity halos right after pressure correction, before cache_tendencies and update_state. This overlaps tracer NCCL transfers with the entire pressure solver pipeline (FFTs, tridiag, transposes), hiding ~15ms of latency. Large grid results (1024x1024x128/GPU F32, 2 GPUs): 435 → 420 ms (85%) The pipelined rk3_substep! overrides the base method for NCCLNonhydrostaticModel. update_state! is also overridden to skip fill_halo when halos are already pending. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This directive prevented the NCCL extension from being cached, forcing every job to recompile it from scratch (~30 min on multi-node). Extensions can safely precompile even when they add methods to the parent module. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Without this directive, 8 ranks all try to precompile the NCCL extension simultaneously. One grabs the pidfile lock, attempts precompilation, fails (method overwriting not permitted), and retries indefinitely. Other ranks block on the lock forever. This caused 2-hour timeouts on multi-node Perlmutter jobs. The method overwriting that prevents precompilation needs to be refactored to use proper dispatch instead of type piracy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The NCCL extension transpose functions previously overwrote base methods with identical signatures (e.g. `transpose_y_to_x!(pf::TransposableField)`) and used runtime `if arch.communicator isa NCCLCommunicator` branching. This forced `__precompile__(false)`, causing every multi-node job to recompile the extension from scratch (~2h on Perlmutter Lustre). Fix: add architecture dispatch to base transpose functions: transpose_y_to_x!(pf) → transpose_y_to_x!(arch, pf) The NCCL extension now adds methods for `NCCLDistributedArch` — proper dispatch, no overwriting: DC.transpose_y_to_x!(::NCCLDistributedArch, pf::TransposableField) With `__precompile__(false)` removed, the extension caches normally and multi-node jobs start in seconds instead of hours. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| @@ -0,0 +1,148 @@ | |||
| ##### | |||
| ##### Pipelined RK3: start halo communication early, overlap with pressure solver | |||
There was a problem hiding this comment.
This is not specific to NCCL right? It applies to all the communication models. Also, I think the fill halos of tracers is already pipelined with the computation of tendencies
There was a problem hiding this comment.
this is overlap with pressure solver, not tendencies
I agree though that its not specific to NCCL.
cuFFT decomposes dim-2 (strided) FFTs into many small kernels
(one per z-level). Reshaping to (Ny, Nx, Nz) and planning along
contiguous dim 1 is 3.4x faster despite the permutedims overhead.
Changes:
- discrete_transforms.jl: set transpose_dims=(2,1,3) for ALL
GPU dim-2 transforms, not just Bounded
- plan_distributed_transforms.jl: use reshape + dim-1 plan for
all GPU y-FFTs
- distributed_fft_tridiagonal_solver.jl: allocate y-buffer for
all GPU topologies (needed for permutedims workspace)
Results (50x400x80/GPU, x-distributed, 2 GPUs):
NonhydrostaticModel: 12.60 → 8.40 ms (1.50x speedup, 37% → 56% eff)
AnelasticDynamics: 14.68 → 11.32 ms (1.30x speedup, 40% → 52% eff)
CompressibleDynamics: unchanged (no FFT solver)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove cuMemcpy2D halo pack/unpack — it crashed on 2D barotropic fields (z-size 1 doesn't match grid Nz). The broadcast kernel handles all field types correctly and has equivalent performance. HFSM with NCCL now works (SplitExplicitFreeSurface, SeawaterBuoyancy): 200x200x50/GPU, WENOVectorInvariant, WENO(order=7), 30 substeps 1 GPU: 4.4 ms/step 2 GPU: 5.4 ms/step (81% efficiency) 4 GPU: 6.1 ms/step (72% efficiency) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract alltoall_transpose!() as an overridable dispatch point in the distributed FFT transpose. The NCCL extension implements it using grouped NCCL.Send/Recv pairs on the global NCCL communicator, mapping sub-communicator ranks to global ranks. This enables NonhydrostaticModel distributed on machines without GPU-aware MPI (HFSM already worked). Also removes cuMemcpy2D halo pack/unpack that crashed on 2D fields. Strong scaling results (8x A100 SXM4, NCCL): HFSM 2800x2800x100 SplitExplicit: 92-95% at 8 GPUs NHM 1600x1600x100 FFT solver: 34% at 8 GPUs (transpose-limited) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5445 +/- ##
==========================================
- Coverage 73.59% 72.49% -1.10%
==========================================
Files 400 406 +6
Lines 22867 23325 +458
==========================================
+ Hits 16829 16910 +81
- Misses 6038 6415 +377
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Fix: dispatch all 4 transpose directions on architecture(pf.zfield) instead of architecture(pf.fromfield). The fromfield's grid uses MPI sub-communicators (not NCCLCommunicator), so only transpose_z_to_y! was dispatching to the NCCL path. The other 3 transposes fell through to CPU-staging fallback, causing 8.4x slowdown. Tested native ncclAlltoAll (NCCL ≥ 2.20) vs grouped Send/Recv: NHM 8x1 1600x1600x100: 444.7 ms vs 441.5 ms (identical). Grouped Send/Recv is simpler and equally fast — reverted to it. Also cleaned up dead code in nccl_distributed.jl (alltoall_transpose! is now a simple CPU-staging fallback for non-NCCL CuArray buffers). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Experimenting with NCCL to accelerate multi-GPU Poisson solver.