AOTInductor model crashes with multiple MPI ranks on a single multi-GPU node

**Description:**

When running LAMMPS with pair_allegro using an AOTInductor-compiled model (`.pt2`) on a single node with multiple GPUs, I get a CUDA illegal memory access error. The same setup works perfectly when each MPI rank is on a separate node with its own GPU.

**Environment:**
- LAMMPS: 12 Jun 2025
- System: CSCS Alps (GH200 nodes, 4 GPUs per node)
- Model: AOTInductor-compiled `.pt2`
- MPI: cray-mpich

**Working setup (4 nodes, 1 GPU each):**
```
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1

srun lmp -in test.in
```
→ 358 timesteps/s, 222 atoms, `processors 2 2 1`

**Failing setup (1 node, 4 GPUs):**
```
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4

# with CUDA_VISIBLE_DEVICES=0,1,2,3
srun ./wrapper.sh lmp -in test.in
```

**Error:**
```
NequIP/Allegro is using input precision d and output precision d
Error: CUDA driver error: an illegal memory access was encountered
Exception: run_func_( container_handle_, input_handles.data(), input_handles.size(),
  output_handles.data(), output_handles.size(),
  reinterpret_cast<AOTInductorStreamHandle>(stream_handle), proxy_executor_handle_)
  API call failed at /pytorch/torch/csrc/inductor/aoti_runner/model_container_runner.cpp, line 145
```

All ranks crash with this error. Each rank should be assigned to a separate GPU via the `MPI_Comm_split_type` shared memory communicator logic in `pair_nequip_allegro.cpp` (line 96).

**Additional observations:**
- The GPU assignment code uses `MPI_COMM_WORLD` for `MPI_Comm_split_type`, which also causes issues with LAMMPS `-partition` mode (ranks beyond GPU count trigger the "my rank is bigger than the number of visible devices" error).
- TorchScript models may not have this issue, but AOTInductor is preferred for performance.
- The problem seems specific to multiple processes loading the same AOTInductor `.pt2` on the same node.

**Question:**
Is there a known workaround for running multiple MPI ranks with AOTInductor models on a single multi-GPU node? Would using `world` (LAMMPS partition communicator) instead of `MPI_COMM_WORLD` in the shared memory split help, or is this a fundamental PyTorch AOTInductor limitation?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AOTInductor model crashes with multiple MPI ranks on a single multi-GPU node #82

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

AOTInductor model crashes with multiple MPI ranks on a single multi-GPU node #82

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions