You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Especially note the presence of the CUDA directory in this path. This indicates that the loaded OpenMPI library is [CUDA-aware](https://www.open-mpi.org/faq/?category=runcuda).
59
60
60
61
If you `conda activate` the Anaconda environment **after** loading the OpenMPI library, your application would be built with the MPI library from Anaconda, which has worse performance on this cluster and could lead to errors. See [On Computing Well: Installing and Running ‘mpi4py’ on the Cluster](https://oncomputingwell.princeton.edu/2018/11/installing-and-running-mpi4py-on-the-cluster/) for a related discussion.
61
62
@@ -142,20 +143,33 @@ The workflow is to request an interactive session:
[//]: # (Note, the modules might not/are not inherited from the shell that spawns the interactive Slurm session. Need to reload anaconda module, activate environment, and reload other compiler/library modules)
148
+
149
+
Re-load the above modules and reactivate your conda environment. Confirm that the correct CUDA-aware OpenMPI library is in your interactive Slurm sessions's shell search path:
Then, launch the application from the command line:
146
155
147
156
```bash
148
157
mpirun -N 4 python mpi_learn.py
149
158
```
150
-
where `-N` is a synonym for `-npernode` in OpenMPI. Do **not** use `srun` to launch the job inside an interactive session.
159
+
where `-N` is a synonym for `-npernode` in OpenMPI. Do **not** use `srun` to launch the job inside an interactive session. If
160
+
you an encounter an error such as "unrecognized argument N", it is likely that your modules are incorrect and point to an Intel MPI distribution instead of CUDA-aware OpenMPI. Intel MPI is based on MPICH, which does not offer the `-npernode` option. You can confirm this by checking:
[//]: # (This option appears to be redundant given the salloc options; "mpirun python mpi_learn.py" appears to work just the same.)
153
167
154
168
[//]: # (HOWEVER, "srun python mpi_learn.py", "srun --ntasks-per-node python mpi_learn.py", etc. NEVER works--- it just hangs without any output. Why?)
155
169
156
170
[//]: # (Consistent with https://www.open-mpi.org/faq/?category=slurm ?)
157
171
158
-
[//]: # (certain output seems to be repeated by ntasks-per-node, e.g. echoing the conf.yaml. Expected?)
172
+
[//]: # (certain output seems to be repeated by ntasks-per-node, e.g. echoing the conf.yaml. Expected? Or, replace the print calls with print_unique)
0 commit comments