For psiflow configurations where multiple apps run in parallel on the same worker node, it seems like MPI core affinities can bind multiple processes to the same core, leading to very poor performance.
Default behaviour
Launching two GPAW evaluations on the same node
[CONFIG]
cores_per_worker: 8
launch_command: 'apptainer exec -e --no-init oras://ghcr.io/molmod/gpaw:24.1 /opt/entry.sh mpirun -np 8 --map-by CORE --bind-to CORE --display-map gpaw python DUMMY.py'
slurm:
nodes_per_block: 1
cores_per_node: 16
[OUTPUT JOB 1]
User: ???@node3521.doduo.os
pid: 771665, CPU affinity: {75}
pid: 771662, CPU affinity: {2}
pid: 771667, CPU affinity: {76}
pid: 771673, CPU affinity: {79}
pid: 771675, CPU affinity: {80}
pid: 771671, CPU affinity: {78}
pid: 771660, CPU affinity: {0}
pid: 771669, CPU affinity: {77}
======================== JOB MAP ========================
Data for node: node3521 Num slots: 16 Max slots: 0 Num procs: 8
Process OMPI jobid: [44168,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]]:[B/.][./././././././././././././.]
Process OMPI jobid: [44168,1] App: 0 Process rank: 1 Bound: socket 0[core 1[hwt 0]]:[./B][./././././././././././././.]
Process OMPI jobid: [44168,1] App: 0 Process rank: 2 Bound: socket 1[core 2[hwt 0]]:[./.][B/././././././././././././.]
Process OMPI jobid: [44168,1] App: 0 Process rank: 3 Bound: socket 1[core 3[hwt 0]]:[./.][./B/./././././././././././.]
Process OMPI jobid: [44168,1] App: 0 Process rank: 4 Bound: socket 1[core 4[hwt 0]]:[./.][././B/././././././././././.]
Process OMPI jobid: [44168,1] App: 0 Process rank: 5 Bound: socket 1[core 5[hwt 0]]:[./.][./././B/./././././././././.]
Process OMPI jobid: [44168,1] App: 0 Process rank: 6 Bound: socket 1[core 6[hwt 0]]:[./.][././././B/././././././././.]
Process OMPI jobid: [44168,1] App: 0 Process rank: 7 Bound: socket 1[core 7[hwt 0]]:[./.][./././././B/./././././././.]
=============================================================
[OUTPUT JOB 2]
User: ???@node3521.doduo.os
pid: 771664, CPU affinity: {75}
pid: 771663, CPU affinity: {2}
pid: 771666, CPU affinity: {76}
pid: 771661, CPU affinity: {0}
pid: 771670, CPU affinity: {78}
pid: 771674, CPU affinity: {80}
pid: 771672, CPU affinity: {79}
pid: 771668, CPU affinity: {77}
======================== JOB MAP ========================
Data for node: node3521 Num slots: 16 Max slots: 0 Num procs: 8
Process OMPI jobid: [44169,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]]:[B/.][./././././././././././././.]
Process OMPI jobid: [44169,1] App: 0 Process rank: 1 Bound: socket 0[core 1[hwt 0]]:[./B][./././././././././././././.]
Process OMPI jobid: [44169,1] App: 0 Process rank: 2 Bound: socket 1[core 2[hwt 0]]:[./.][B/././././././././././././.]
Process OMPI jobid: [44169,1] App: 0 Process rank: 3 Bound: socket 1[core 3[hwt 0]]:[./.][./B/./././././././././././.]
Process OMPI jobid: [44169,1] App: 0 Process rank: 4 Bound: socket 1[core 4[hwt 0]]:[./.][././B/././././././././././.]
Process OMPI jobid: [44169,1] App: 0 Process rank: 5 Bound: socket 1[core 5[hwt 0]]:[./.][./././B/./././././././././.]
Process OMPI jobid: [44169,1] App: 0 Process rank: 6 Bound: socket 1[core 6[hwt 0]]:[./.][././././B/././././././././.]
Process OMPI jobid: [44169,1] App: 0 Process rank: 7 Bound: socket 1[core 7[hwt 0]]:[./.][./././././B/./././././././.]
=============================================================
Calculations run extremely slowly on the same 8 cores, but do not crash
This is not unexpected, independent mpirun calls do not communicate which cores to use. It can be easily avoided by setting cores_per_worker=cores_per_node in the psiflow config, but that seems to counteract the parsl idea of blocks.
A hacky workaround I might have found is to wrap the MPI call inside a jobstep:
srun -n 1 -c 8 -v --cpu-bind=v,cores apptainer exec [...] mpirun -np 8 --map-by CORE --bind-to CORE --display-map gpaw [...]
where we have to ask for 1 task using 8 cores because otherwise mpirun complains about available slots. Very elegant.
Hacky behaviour
Launching two GPAW evaluations on the same node
[CONFIG]
cores_per_worker: 8
launch_command: 'srun -n 1 -c 8 -v --cpu-bind=v,cores apptainer exec -e --no-init oras://ghcr.io/molmod/gpaw:24.1 /opt/entry.sh mpirun -np 8 --map-by CORE --bind-to CORE --display-map gpaw python DUMMY.py'
slurm:
nodes_per_block: 1
cores_per_node: 16
[OUTPUT JOB 1]
User: ???@node4217.shinx.os
pid: 1935915, CPU affinity: {102}
pid: 1935897, CPU affinity: {94}
pid: 1935922, CPU affinity: {104}
pid: 1935857, CPU affinity: {93}
pid: 1935902, CPU affinity: {95}
pid: 1935907, CPU affinity: {99}
pid: 1935918, CPU affinity: {103}
pid: 1935911, CPU affinity: {100}
======================== JOB MAP ========================
Data for node: node4217 Num slots: 8 Max slots: 0 Num procs: 8
Process OMPI jobid: [64657,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]]:[B/./.][././././.]
Process OMPI jobid: [64657,1] App: 0 Process rank: 1 Bound: socket 0[core 1[hwt 0]]:[./B/.][././././.]
Process OMPI jobid: [64657,1] App: 0 Process rank: 2 Bound: socket 0[core 2[hwt 0]]:[././B][././././.]
Process OMPI jobid: [64657,1] App: 0 Process rank: 3 Bound: socket 1[core 3[hwt 0]]:[././.][B/./././.]
Process OMPI jobid: [64657,1] App: 0 Process rank: 4 Bound: socket 1[core 4[hwt 0]]:[././.][./B/././.]
Process OMPI jobid: [64657,1] App: 0 Process rank: 5 Bound: socket 1[core 5[hwt 0]]:[././.][././B/./.]
Process OMPI jobid: [64657,1] App: 0 Process rank: 6 Bound: socket 1[core 6[hwt 0]]:[././.][./././B/.]
Process OMPI jobid: [64657,1] App: 0 Process rank: 7 Bound: socket 1[core 7[hwt 0]]:[././.][././././B]
=============================================================
[OUTPUT JOB 2]
User: ???@node4217.shinx.os
pid: 1935906, CPU affinity: {114}
pid: 1935912, CPU affinity: {116}
pid: 1935908, CPU affinity: {115}
pid: 1935901, CPU affinity: {112}
pid: 1935903, CPU affinity: {113}
pid: 1935896, CPU affinity: {111}
pid: 1935914, CPU affinity: {117}
pid: 1935864, CPU affinity: {105}
======================== JOB MAP ========================
Data for node: node4217 Num slots: 8 Max slots: 0 Num procs: 8
Process OMPI jobid: [64663,1] App: 0 Process rank: 0 Bound: socket 1[core 0[hwt 0]]:[][B/././././././.]
Process OMPI jobid: [64663,1] App: 0 Process rank: 1 Bound: socket 1[core 1[hwt 0]]:[][./B/./././././.]
Process OMPI jobid: [64663,1] App: 0 Process rank: 2 Bound: socket 1[core 2[hwt 0]]:[][././B/././././.]
Process OMPI jobid: [64663,1] App: 0 Process rank: 3 Bound: socket 1[core 3[hwt 0]]:[][./././B/./././.]
Process OMPI jobid: [64663,1] App: 0 Process rank: 4 Bound: socket 1[core 4[hwt 0]]:[][././././B/././.]
Process OMPI jobid: [64663,1] App: 0 Process rank: 5 Bound: socket 1[core 5[hwt 0]]:[][./././././B/./.]
Process OMPI jobid: [64663,1] App: 0 Process rank: 6 Bound: socket 1[core 6[hwt 0]]:[][././././././B/.]
Process OMPI jobid: [64663,1] App: 0 Process rank: 7 Bound: socket 1[core 7[hwt 0]]:[][./././././././B]
=============================================================
Here, the JOB MAP info seems to contradict the CPU affinity logs, but I predict this is due to srun interfering in what MPI has access to (notice Num slots: 8 instead of the total 16 that are available on the node). Also, both calculations seem to run fine.
Generally, I think this issue would present itself for most bash apps running on the same node (all reference calculations, but also I-PI simulations). Why does psiflow not use srun (or equivalent) to separate app resources (e.g., through some parsl launchery deal)?
On a sidenote, it is apparently possible to launch GPAW directly with srun (see), which does not seem to work with the container sandwiched in between. That's an issue for the GPAW repo, however.
For
psiflowconfigurations where multiple apps run in parallel on the same worker node, it seems likeMPIcore affinities can bind multiple processes to the same core, leading to very poor performance.Default behaviour
Launching two GPAW evaluations on the same node
Calculations run extremely slowly on the same 8 cores, but do not crash
This is not unexpected, independent
mpiruncalls do not communicate which cores to use. It can be easily avoided by settingcores_per_worker=cores_per_nodein thepsiflowconfig, but that seems to counteract theparslidea of blocks.A hacky workaround I might have found is to wrap the
MPIcall inside a jobstep:srun -n 1 -c 8 -v --cpu-bind=v,cores apptainer exec [...] mpirun -np 8 --map-by CORE --bind-to CORE --display-map gpaw [...]where we have to ask for 1 task using 8 cores because otherwise
mpiruncomplains about available slots. Very elegant.Hacky behaviour
Launching two GPAW evaluations on the same node
Here, the
JOB MAPinfo seems to contradict theCPU affinitylogs, but I predict this is due tosruninterfering in whatMPIhas access to (noticeNum slots: 8instead of the total 16 that are available on the node). Also, both calculations seem to run fine.Generally, I think this issue would present itself for most bash apps running on the same node (all reference calculations, but also I-PI simulations). Why does
psiflownot usesrun(or equivalent) to separate app resources (e.g., through someparsllaunchery deal)?On a sidenote, it is apparently possible to launch
GPAWdirectly withsrun(see), which does not seem to work with the container sandwiched in between. That's an issue for theGPAWrepo, however.