Skip to content

Commit 80f3a72

Browse files
authored
Update PrincetonUTutorial.md
1 parent 1ed2ca0 commit 80f3a72

File tree

1 file changed

+16
-2
lines changed

1 file changed

+16
-2
lines changed

docs/PrincetonUTutorial.md

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@ you should see something like this:
5656
$ which mpicc
5757
/usr/local/openmpi/cuda-8.0/3.0.0/intel170/x86_64/bin/mpicc
5858
```
59+
Especially note the presence of the CUDA directory in this path. This indicates that the loaded OpenMPI library is [CUDA-aware](https://www.open-mpi.org/faq/?category=runcuda).
5960

6061
If you `conda activate` the Anaconda environment **after** loading the OpenMPI library, your application would be built with the MPI library from Anaconda, which has worse performance on this cluster and could lead to errors. See [On Computing Well: Installing and Running ‘mpi4py’ on the Cluster](https://oncomputingwell.princeton.edu/2018/11/installing-and-running-mpi4py-on-the-cluster/) for a related discussion.
6162

@@ -142,20 +143,33 @@ The workflow is to request an interactive session:
142143
```bash
143144
salloc -N [X] --ntasks-per-node=4 --ntasks-per-socket=2 --gres=gpu:4 -c 4 --mem-per-cpu=0 -t 0-6:00
144145
```
146+
147+
[//]: # (Note, the modules might not/are not inherited from the shell that spawns the interactive Slurm session. Need to reload anaconda module, activate environment, and reload other compiler/library modules)
148+
149+
Re-load the above modules and reactivate your conda environment. Confirm that the correct CUDA-aware OpenMPI library is in your interactive Slurm sessions's shell search path:
150+
```bash
151+
$ which mpirun
152+
/usr/local/openmpi/cuda-8.0/3.0.0/intel170/x86_64/bin/mpirun
153+
```
145154
Then, launch the application from the command line:
146155

147156
```bash
148157
mpirun -N 4 python mpi_learn.py
149158
```
150-
where `-N` is a synonym for `-npernode` in OpenMPI. Do **not** use `srun` to launch the job inside an interactive session.
159+
where `-N` is a synonym for `-npernode` in OpenMPI. Do **not** use `srun` to launch the job inside an interactive session. If
160+
you an encounter an error such as "unrecognized argument N", it is likely that your modules are incorrect and point to an Intel MPI distribution instead of CUDA-aware OpenMPI. Intel MPI is based on MPICH, which does not offer the `-npernode` option. You can confirm this by checking:
161+
```bash
162+
$ which mpirun
163+
/opt/intel/compilers_and_libraries_2019.3.199/linux/mpi/intel64/bin/mpirun
164+
```
151165

152166
[//]: # (This option appears to be redundant given the salloc options; "mpirun python mpi_learn.py" appears to work just the same.)
153167

154168
[//]: # (HOWEVER, "srun python mpi_learn.py", "srun --ntasks-per-node python mpi_learn.py", etc. NEVER works--- it just hangs without any output. Why?)
155169

156170
[//]: # (Consistent with https://www.open-mpi.org/faq/?category=slurm ?)
157171

158-
[//]: # (certain output seems to be repeated by ntasks-per-node, e.g. echoing the conf.yaml. Expected?)
172+
[//]: # (certain output seems to be repeated by ntasks-per-node, e.g. echoing the conf.yaml. Expected? Or, replace the print calls with print_unique)
159173

160174

161175
### Understanding the data

0 commit comments

Comments
 (0)