Skip to content

Fix distributed tripolar fold#5439

Open
briochemc wants to merge 25 commits intomainfrom
bp-claude/fix-distributed-tripolar-fold
Open

Fix distributed tripolar fold#5439
briochemc wants to merge 25 commits intomainfrom
bp-claude/fix-distributed-tripolar-fold

Conversation

@briochemc
Copy link
Copy Markdown
Collaborator

@briochemc briochemc commented Mar 27, 2026

This PR fixes the distributed tripolar fold with no extra MPI passes, for both UPivot and FPivot topologies. The current distributed UPivot on main is broken (halos don't match the serial case), so this PR addresses that and adds FPivot support.

Key changes

New distributed fold topologies

Doing this without extra MPI passes requires local y-topologies that encode more information than just FullyConnected. This is accomplished by introducing 4 new topology types for distributed tripolar grid fold boundaries:

  • LeftConnectedRightCenterFolded (1xN slab UPivot)
  • LeftConnectedRightFaceFolded (1xN slab FPivot)
  • LeftConnectedRightCenterConnected (MxN pencil UPivot)
  • LeftConnectedRightFaceConnected (MxN pencil FPivot)

These replace the previous FullyConnected y-topology on northernmost distributed ranks.

Fold-aware MPI communication buffers

Three buffer types encode fold geometry for the distributed zipper for pencil partitions (slab partitions use the serial fold BC directly):

  • TwoDZipperBuffer — interior-width x-reversal with FL/WFL flags controlling fold-line inclusion and write ownership based on pivot position.
  • ZipperCornerBuffer — NW/NE corner halos where the fold intersects x-boundaries between ranks.
  • TripolarXBuffer — X-direction halos adjacent to the fold with matching fold-line awareness.

Worksize-based kernel coverage for FPivot

Added worksize override for distributed FPivot grids returning (Nx, Ny+1, Nz), matching the serial override from #5408. Extended the async distributed buffer tendency kernel parameters to also use worksize(grid) instead of size(grid), closing a coverage gap at the fold line that caused serial/distributed divergence after 2 time steps.

TODO: extend testing

I have a bunch of extra tests that I have run locally, and which seem to pass. I intend to include some here, in particular to cover the FPivot topology. Future commits could add to CI:

  • FPivot versions of existing distributed tripolar tests (grid/field reconstruction, boundary conditions, simulations)
  • Index-tracing halo verification for all field locations × partitions × fold topologies
  • Per-rank local field + halo comparison against serial parent array. This is a bit like the index-tracing vizualizations for each rank (see example below).

Pivot point mismatch

There is one single point where serial and distributed halo-filling don't match: the pivot points. That's because it is a very special case that would need special handling, and I don't think we should "waste" lines of codes for it since it will always be covered by land. Take the ranks A and B "touching" the central pivot point pictured below (for a FPivot pencil partition; the star represents the pivot point, which is at FF location for this topology). When you fill the halos of A around the pivot, you use the north, corner, and x buffers. The buffers all have nice "rectangle like" shapes. But the pivot point must come from B, despite being on a row coming from A itself through the corner buffer. And this only happens at the pivot, so for any partition (Rx,Ry) where Rx>2 it would need an extra type/encoding to correctly fill that single point only for this rank next to the pivot. So I think it's not worth it, but happy to be convinced otherwise.

IMG_233B6A9D9E52-1

Visual verification

The fold is verified visually using index-tracking fields with half-circles representing 180° rotation around pivot points. For a 4x2 UPivot 20x20 grid with Hx=Hy=3:

image

The fold is correct if all halos are filled and the circles are centered around the stars (= the pivot points).


For a 4x2 FPivot:

image

And I also checked the halos for each rank, e.g., 4x2 UPivot rank 7:

image

For the record, it currently looks like that for the main branch, missing top halo row and fold line for example + some incorrectly pivoted cells (non-concentric circles):

image

So overall I think that the src/ part is ready for review.


Follows from #5408.
Supersedes #5381.

briochemc and others added 11 commits March 27, 2026 21:28
Introduce 4 new topology types for distributed tripolar grid fold boundaries:
- LeftConnectedRightCenterFolded (1xN UPivot)
- LeftConnectedRightFaceFolded (1xN FPivot, Face-extended)
- LeftConnectedRightCenterConnected{WestOfPivot/EastOfPivot} (MxN UPivot)
- LeftConnectedRightFaceConnected{WestOfPivot/EastOfPivot} (MxN FPivot, Face-extended)

These replace the previous FullyConnected y-topology on northernmost distributed
ranks, which lost fold information and caused FPivot CF/FF fields to crash (OOB)
because they need Ny+1 Face points.

Key changes:
- Grids.jl: new types, WestOfPivot/EastOfPivot, global_fold_topology()
- grid_utils.jl: BoundedTopology union includes Face-extended fold types
- distributed_grids.jl: insert_connected_topology 5-arg methods for fold topologies
- distributed_zipper.jl: complete rewrite with topology-based buffer dispatch,
  corrected y-ranges for Face-extended grids, fold-line WoP/EoP handling
- Tripolar struct simplified to 3 type params (fold info now in grid topology)
- Split-explicit, advection, and BC unions updated for new types

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…lo fill

- Make Tripolar{N,F,S,FT} carry the fold topology as a phantom type
  parameter (isbits for GPU). Add fold_topology() getter, remove
  global_fold_topology() which could not resolve the fold type from
  non-fold rank topologies (FullyConnected, RightConnected).

- Fix reconstruct_global_grid and with_halo for distributed TripolarGrid:
  pass fold_topology from conformal_mapping so FPivot grids reconstruct
  correctly (was defaulting to RightCenterFolded).

- Fix distributed zipper corner buffers: use has_fold_line to size
  corner buffers (Hy+1 when fold line present), dispatch on
  TwoDZipperBuffer only (corners not needed for 1D partitions),
  remove redundant arch::Distributed constraint, rename H→Hy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…atch

Remove the WestOfPivot/EastOfPivot type parameter from topology types
and replace with per-buffer fold-line control via two new type parameters
on TwoDZipperBuffer and ZipperCornerBuffer:

- FL (fold line in buffer): true when the buffer contains the fold-line
  row (Hy+1 rows). Set to has_fold_line() for ALL buffers to ensure MPI
  size matching between mirror partners.

- WFL (writes fold line): true when the recv should write the fold-line
  row. Computed per-buffer from the rank's x-position relative to the
  pivot (Nx/2), accounting for periodic wrap at domain boundaries.

Helper functions north_writes_fold_line, northwest_writes_fold_line,
northeast_writes_fold_line determine WFL based on rx and Rx.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…atch

- Remove WestOfPivot/EastOfPivot type parameters and pivot_side function
- Add FL (fold-line in buffer) and WFL (writes fold line) type parameters
  to TwoDZipperBuffer and ZipperCornerBuffer
- Per-buffer fold-line helpers: north/northwest/northeast_writes_fold_line
  accounting for both pivot points (x-periodicity)
- TripolarXBuffer with location-aware y-size using length(loc_y, topo, Ny)
  instead of has_fold_line (only FPivot Face-y gets Ny+1, not UPivot)
- Temporary @info diagnostics for send/recv tracing (to be removed)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add FL/WFL type parameters to TripolarXBuffer, mirroring the north/corner
buffer pattern. The west/east x-buffers complement the adjacent NW/NE
corner: if the corner writes the fold line, the x-buffer skips it (and
vice versa), ensuring no overlapping writes with async MPI.

Separate west_tripolar_buffer and east_tripolar_buffer constructors since
they complement different corners (NW vs NE).

Remaining: Face-x pivot point (global x = Nx/2+1) at the fold line is
unfilled in distributed because it falls in the gap between the NE corner
(Hx-1 columns) and the east x-buffer (WFL=false). This only affects the
exact pivot point which is under land in real ocean configurations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t ZBC

- Remove local zipper_bc function (duplicated north_fold_boundary_condition)
- Use north_fold_boundary_condition(fold_topology(...))(sign) directly
- Remove local const ZBC (already defined in BoundaryConditions)
- Import ZBC from BoundaryConditions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… types

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
briochemc and others added 4 commits March 30, 2026 23:58
Co-authored-by: Simone Silvestri <silvestri.simone0@gmail.com>
- Refactor distributed_zipper.jl: replace ~65 combinatorial send/recv methods
  with ~20 using helper functions for y-ranges, x-ranges, and type-parameter
  accessors. Move FL/WFL to front of type parameter list for clean dispatch.
  Consolidate 4 corner buffer constructors into 2.

- Fix FPivot distributed simulation mismatch: extend async buffer tendency
  kernel parameters to use worksize(grid) instead of size(grid), ensuring the
  fold line at Ny+1 is covered by the buffer pass for RightFaceFolded grids.

- Add worksize override for distributed FPivot grids (DRFTRG) so kernels
  correctly operate on the extra Face-y row at Ny+1.

- Remove stale explicit imports (FZBC, UZBC, RightConnected,
  instantiated_location) and fix self-qualified fold_topology access.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The overrides were modifying the serial fold BC to skip the fold-line
consistency substitution, which is no longer needed now that the
distributed fold handles this correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ment

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Simone Silvestri <silvestri.simone0@gmail.com>
briochemc and others added 6 commits April 1, 2026 19:18
OneDZipperBuffer was never constructed because slab partitions (Rx=1) use
the serial fold BC directly, not a DistributedZipper. The OneDFoldTopology
+ DistributedZipper combination was impossible. Re-add loc_id import needed
by distributed_zipper_north_tags.jl.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@briochemc briochemc added distributed 🕸️ Our plan for total cluster domination boundary conditions 🏓 benchmark performance runs preconfigured benchamarks and spits out timing labels Apr 1, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

Benchmark Comparison

Benchmark Comparison: PR vs Main

Benchmark PR (pts/s) Main (pts/s) Change
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 35291886.688 35908244.039 -1.7%
EarthOcean_tripolar_180x90x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 27694397.961 24486548.568 +13.1%
EarthOcean_tripolar_720x360x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 36849186.421 36839517.392 +0%
EarthOcean_tripolar_360x180x50_F32_WENOVectorInvariantDefault_WENO7_CATKE_2tr 49338262.613 50037677.104 -1.4%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_nothing_2tr 65666788.328 65448194.401 +0.3%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE+Biharmonic_2tr 27359839.867 27390169.548 -0.1%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE+GM+Biharmonic_2tr 10803072.717 10826485.918 -0.2%
EarthOcean_tripolar_360x180x50_F64_nothing_nothing_CATKE_2tr 67279367.909 67250327.498 +0%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariant5_WENO5_CATKE_2tr 42582316.608 42588272.110 -0%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariant9_WENO9_CATKE_2tr 25151592.996 25129587.668 +0.1%
EarthOcean_lat_lon_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 32208582.452 32163668.995 +0.1%
EarthOcean_immersed_lat_lon_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 33231776.255 33296784.703 -0.2%
EarthOcean_tripolar_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 33874726.588 33863642.960 +0%
EarthOcean_lat_lon_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 37481573.780 37462272.400 +0.1%
EarthOcean_immersed_lat_lon_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 36515742.125 36498669.701 +0%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_3tr 32151827.796 32406966.460 -0.8%

NSYS Kernel Profiling

EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr

Kernel Median (ms) Main (ms) Change Instances Avg (ms) Min (ms) Max (ms)
gpu__rk_substep_turbulent_kinetic_energy_ 4.013 4.013 -0.0% 315 4.018 4.007 4.334
gpu_compute_hydrostatic_free_surface_Gu_ 3.955 3.954 +0.0% 318 3.941 2.683 4.138
gpu_compute_hydrostatic_free_surface_Gv_ 3.779 3.779 -0.0% 318 3.775 2.575 4.072
gpu_compute_hydrostatic_free_surface_Gc_ 2.559 2.524 +1.4% 315 2.565 2.506 2.936
gpu_compute_hydrostatic_free_surface_Gc_ 2.529 2.524 +0.2% 315 2.538 2.486 2.905
gpu_compute_hydrostatic_free_surface_Gc_ 2.519 2.524 -0.2% 315 2.528 2.472 2.873
gpu_compute_CATKE_closure_fields_ 1.616 1.619 -0.2% 318 1.618 1.607 1.738
gpu_compute_TKE_diffusivity_ 1.269 1.271 -0.2% 315 1.271 1.264 1.374
gpu__compute_w_from_continuity_ 0.339 0.341 -0.6% 633 0.339 0.334 0.345
gpu_solve_batched_tridiagonal_system_kernel_ 0.526 0.526 +0.1% 315 0.527 0.519 0.543

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

Benchmark Comparison

Benchmark Comparison: PR vs Main

Benchmark PR (pts/s) Main (pts/s) Change
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 35820258.812 35869840.139 -0.1%
EarthOcean_tripolar_180x90x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 24832847.680 26374010.075 -5.8%
EarthOcean_tripolar_720x360x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 36864084.262 36844249.055 +0.1%
EarthOcean_tripolar_360x180x50_F32_WENOVectorInvariantDefault_WENO7_CATKE_2tr 51655390.024 49482270.863 +4.4%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_nothing_2tr 65655484.662 65411664.527 +0.4%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE+Biharmonic_2tr 27362704.574 27384523.066 -0.1%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE+GM+Biharmonic_2tr 10808958.323 10806080.088 +0%
EarthOcean_tripolar_360x180x50_F64_nothing_nothing_CATKE_2tr 67310180.383 66995242.379 +0.5%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariant5_WENO5_CATKE_2tr 42603704.615 42618577.948 -0%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariant9_WENO9_CATKE_2tr 25031423.775 25025735.701 +0%
EarthOcean_lat_lon_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 32187434.920 32196789.208 -0%
EarthOcean_immersed_lat_lon_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 33266998.028 33302413.775 -0.1%
EarthOcean_tripolar_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 33755447.904 33856226.545 -0.3%
EarthOcean_lat_lon_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 37479187.637 37467534.294 +0%
EarthOcean_immersed_lat_lon_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 36511914.769 36436837.053 +0.2%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_3tr 32464217.419 32469296.700 -0%

NSYS Kernel Profiling

EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr

Kernel Median (ms) Main (ms) Change Instances Avg (ms) Min (ms) Max (ms)
gpu__rk_substep_turbulent_kinetic_energy_ 4.012 4.013 -0.0% 315 4.018 4.007 4.334
gpu_compute_hydrostatic_free_surface_Gu_ 3.948 3.966 -0.5% 318 3.935 2.689 4.120
gpu_compute_hydrostatic_free_surface_Gv_ 3.783 3.781 +0.1% 318 3.782 2.580 4.083
gpu_compute_hydrostatic_free_surface_Gc_ 2.556 2.527 +1.1% 315 2.564 2.513 2.928
gpu_compute_hydrostatic_free_surface_Gc_ 2.527 2.527 -0.0% 315 2.535 2.476 2.927
gpu_compute_hydrostatic_free_surface_Gc_ 2.520 2.527 -0.3% 315 2.526 2.474 2.864
gpu_compute_CATKE_closure_fields_ 1.618 1.618 +0.0% 318 1.620 1.611 1.740
gpu_compute_TKE_diffusivity_ 1.271 1.271 -0.0% 315 1.272 1.265 1.377
gpu__compute_w_from_continuity_ 0.340 0.340 +0.1% 633 0.340 0.335 0.349
gpu_solve_batched_tridiagonal_system_kernel_ 0.526 0.526 -0.0% 315 0.526 0.519 0.542

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmark performance runs preconfigured benchamarks and spits out timing boundary conditions 🏓 distributed 🕸️ Our plan for total cluster domination

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants