-
Notifications
You must be signed in to change notification settings - Fork 11
Description
Description:
When running LAMMPS with pair_allegro using an AOTInductor-compiled model (.pt2) on a single node with multiple GPUs, I get a CUDA illegal memory access error. The same setup works perfectly when each MPI rank is on a separate node with its own GPU.
Environment:
- LAMMPS: 12 Jun 2025
- System: CSCS Alps (GH200 nodes, 4 GPUs per node)
- Model: AOTInductor-compiled
.pt2 - MPI: cray-mpich
Working setup (4 nodes, 1 GPU each):
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
srun lmp -in test.in
→ 358 timesteps/s, 222 atoms, processors 2 2 1
Failing setup (1 node, 4 GPUs):
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
# with CUDA_VISIBLE_DEVICES=0,1,2,3
srun ./wrapper.sh lmp -in test.in
Error:
NequIP/Allegro is using input precision d and output precision d
Error: CUDA driver error: an illegal memory access was encountered
Exception: run_func_( container_handle_, input_handles.data(), input_handles.size(),
output_handles.data(), output_handles.size(),
reinterpret_cast<AOTInductorStreamHandle>(stream_handle), proxy_executor_handle_)
API call failed at /pytorch/torch/csrc/inductor/aoti_runner/model_container_runner.cpp, line 145
All ranks crash with this error. Each rank should be assigned to a separate GPU via the MPI_Comm_split_type shared memory communicator logic in pair_nequip_allegro.cpp (line 96).
Additional observations:
- The GPU assignment code uses
MPI_COMM_WORLDforMPI_Comm_split_type, which also causes issues with LAMMPS-partitionmode (ranks beyond GPU count trigger the "my rank is bigger than the number of visible devices" error). - TorchScript models may not have this issue, but AOTInductor is preferred for performance.
- The problem seems specific to multiple processes loading the same AOTInductor
.pt2on the same node.
Question:
Is there a known workaround for running multiple MPI ranks with AOTInductor models on a single multi-GPU node? Would using world (LAMMPS partition communicator) instead of MPI_COMM_WORLD in the shared memory split help, or is this a fundamental PyTorch AOTInductor limitation?