Skip to content

AOTInductor model crashes with multiple MPI ranks on a single multi-GPU node #82

@matr1x-1

Description

@matr1x-1

Description:

When running LAMMPS with pair_allegro using an AOTInductor-compiled model (.pt2) on a single node with multiple GPUs, I get a CUDA illegal memory access error. The same setup works perfectly when each MPI rank is on a separate node with its own GPU.

Environment:

  • LAMMPS: 12 Jun 2025
  • System: CSCS Alps (GH200 nodes, 4 GPUs per node)
  • Model: AOTInductor-compiled .pt2
  • MPI: cray-mpich

Working setup (4 nodes, 1 GPU each):

#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1

srun lmp -in test.in

→ 358 timesteps/s, 222 atoms, processors 2 2 1

Failing setup (1 node, 4 GPUs):

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4

# with CUDA_VISIBLE_DEVICES=0,1,2,3
srun ./wrapper.sh lmp -in test.in

Error:

NequIP/Allegro is using input precision d and output precision d
Error: CUDA driver error: an illegal memory access was encountered
Exception: run_func_( container_handle_, input_handles.data(), input_handles.size(),
  output_handles.data(), output_handles.size(),
  reinterpret_cast<AOTInductorStreamHandle>(stream_handle), proxy_executor_handle_)
  API call failed at /pytorch/torch/csrc/inductor/aoti_runner/model_container_runner.cpp, line 145

All ranks crash with this error. Each rank should be assigned to a separate GPU via the MPI_Comm_split_type shared memory communicator logic in pair_nequip_allegro.cpp (line 96).

Additional observations:

  • The GPU assignment code uses MPI_COMM_WORLD for MPI_Comm_split_type, which also causes issues with LAMMPS -partition mode (ranks beyond GPU count trigger the "my rank is bigger than the number of visible devices" error).
  • TorchScript models may not have this issue, but AOTInductor is preferred for performance.
  • The problem seems specific to multiple processes loading the same AOTInductor .pt2 on the same node.

Question:
Is there a known workaround for running multiple MPI ranks with AOTInductor models on a single multi-GPU node? Would using world (LAMMPS partition communicator) instead of MPI_COMM_WORLD in the shared memory split help, or is this a fundamental PyTorch AOTInductor limitation?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions