MPI boundary issue

From Mark on slack:
_______________
I've been using ExaMPM (DamBreak) for performance and profiling on one Crusher node and continue to see what look like communications errors leading to spurious new velocities at processor boundaries which then lead to numerical blow-up and crashes (almost always in `g2p->scatter->packBuffer`). This only occurs for problems over about 50^3 cells and 4 or more MPI ranks (I test with `srun -N1 -n8 -S16 --exclusive -t30:00 --cpus-per-task=1 --threads-per-core=1 --gpus-per-task=1 --gpu-bind=closest ./DamBreak 0.01 2 3 0.00004 10.0 2500 hip`) and does not go away with wider halos, evenly-divisible ny, different Y boundary conditions (periodic, slip, noslip), or different versions of Cabana (0.5.0, head). It seems to be suppressed somewhat with fewer particles per cell and by using `AMD_SANITIZE_KERNEL` and `AMD_SANITIZE_COPY` vars, but never goes away. For a while, it seemed to always happen 2 or 3 time steps after a Silo write, but still occurs without any Silo writes. It can occur anywhere between steps 5000 and 100000. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPI boundary issue #38

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MPI boundary issue #38

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions