Skip to content

Conversation

@smoors
Copy link
Collaborator

@smoors smoors commented Jan 3, 2026

this runs a short test in a prerun cmd to get the process binding, which is checked with the check_process_binding.py script. the results are written into the job error file.

fixes #307

Important

the test currently doesn't fail on binding error, as we don't yet have a bullet-proof solution for setting the binding in all cases (see also the discussion in #305). so, for now, both the errors and warnings are printed as warnings on screen, adding sanity checks can be added in a follow-up PR.

example output:

PROCESS BINDING ERROR: wrong number of processes: expected 3, found 5
PROCESS BINDING ERROR: wrong number of nodes: expected 2, found 1
PROCESS BINDING ERROR: wrong number of cpus per process: expected 4, found Counter({2: 5})
PROCESS BINDING WARNING: processes spanning multiple packages: Counter({2: 1})
PROCESS BINDING WARNING: processes spanning multiple numanodes: Counter({2: 3})
PROCESS BINDING WARNING: processes with cores shared by processing units, indicating hyperthreading: Counter({2: 2}),

Note

i managed to get the correct launcher run command by updating the job resources in the assign_tasks_per_compute_unit function. this also allowed simplifying the openfoam test and make it more robust.

@satishskamath
Copy link
Collaborator

satishskamath commented Jan 9, 2026

I think I found a problem.

[satishk@tcn3 ~]$ hwloc-calc -p -H package.numanode core:0-5
Package:0.NUMANode:0
[satishk@tcn3 ~]$ module unload hwloc/2.9.1-GCCcore-12.3.0
[satishk@tcn3 ~]$ module load hwloc/2.8.0-GCCcore-12.2.0

The following have been reloaded with a version change:
  1) GCCcore/12.3.0 => GCCcore/12.2.0     2) libpciaccess/0.17-GCCcore-12.3.0 => libpciaccess/0.17-GCCcore-12.2.0     3) libxml2/2.11.4-GCCcore-12.3.0 => libxml2/2.10.3-GCCcore-12.2.0     4) numactl/2.0.16-GCCcore-12.3.0 => numactl/2.0.16-GCCcore-12.2.0

[satishk@tcn3 ~]$ hwloc-calc -p -H package.numanode core:0-5
unsupported (non-normal) --hierarchical type numanode

Somewhere between hwloc versions 2.8.0 and 2.9.1 there was a change in the numanode object type reporting. So may not work for all OpenMPI versions. The rest of the options still do get reported.

[satishk@tcn3 ~]$ hwloc-calc -p -H package.core.pu core:0-5
Package:0.Core:0.PU:0 Package:0.Core:1.PU:1 Package:0.Core:2.PU:2 Package:0.Core:3.PU:3 Package:0.Core:4.PU:4 Package:0.Core:5.PU:5

@smoors
Copy link
Collaborator Author

smoors commented Jan 9, 2026

good catch.
hwloc 2.8.0 is already quite old (2022b toolchain), so i'll add a fallback that skips the NUMA node if not supported.
the NUMA check is nice to have but not super critical imho. the Package check is more important.

@smoors
Copy link
Collaborator Author

smoors commented Jan 10, 2026

@satishskamath fallback added. i also added a check for the number of nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Amend mixin class to run with mpirun ... --report-bindings and do a sanity check on the binding

2 participants