Skip to content

Conversation

@rubber-duck-debug
Copy link
Collaborator

Re-opening this PR. Will address the previous comments in future commits.

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/96

@github-actions
Copy link

github-actions bot commented May 9, 2025

preview available: https://docs.tds.cscs.ch/96

@github-actions
Copy link

github-actions bot commented May 9, 2025

preview available: https://docs.tds.cscs.ch/96

@github-actions
Copy link

github-actions bot commented May 9, 2025

preview available: https://docs.tds.cscs.ch/96

@github-actions
Copy link

github-actions bot commented May 9, 2025

preview available: https://docs.tds.cscs.ch/96

@github-actions
Copy link

github-actions bot commented May 9, 2025

preview available: https://docs.tds.cscs.ch/96

@github-actions
Copy link

github-actions bot commented May 9, 2025

preview available: https://docs.tds.cscs.ch/96

@github-actions
Copy link

github-actions bot commented May 9, 2025

preview available: https://docs.tds.cscs.ch/96

@github-actions
Copy link

github-actions bot commented May 9, 2025

preview available: https://docs.tds.cscs.ch/96

@github-actions
Copy link

github-actions bot commented May 9, 2025

preview available: https://docs.tds.cscs.ch/96

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/96

@RMeli RMeli requested a review from msimberg May 15, 2025 12:02
@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/96

Copy link
Collaborator

@msimberg msimberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor formatting issue, which I think is not intentional?

The other question about launcher scripts can be discussed after merging or even offline.


export MPICH_GPU_SUPPORT_ENABLED=1

numactl --cpunodebind=$NUMA_NODE --membind=$NUMA_NODE "$@"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity: have you found using numactl necessary/useful to actually improve performance or is it there "just to be sure"? Not asking for a change, just asking to understand if we need to align other launcher scripts.

This launcher script is basically the same as the "single rank per gpu" case in the slurm docs, except for the memory binding. The CPU binding is already set by slurm, so should be redundant to set it with numactl. The visible devices should be equivalent to what we set with --gpus-per-task=1 (assuming we stick with four tasks per node). If the additional --membind seems useful, we might want to recommend it in the slurm docs. Conversely, if it doesn't seem to help, simply referring to the slurm docs would be simpler. First touch mostly takes care of the memory binding, except --membind obviously has the added benefit of disallowing overallocating on a numa node, which may also be good default to recommend for other applications.

What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think its probably possible to avoid using this wrapper with sbatch/srun commands. I'll have a look into this today or tomorrow depending on availability!

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/96

@rubber-duck-debug
Copy link
Collaborator Author

Please wait on a merge - I want to first test to see if numactl can be removed.

@RMeli RMeli marked this pull request as draft May 19, 2025 07:49
@RMeli
Copy link
Member

RMeli commented May 19, 2025

Please wait on a merge

I converted it to a draft, so that it can't accidentally be merged.

@msimberg
Copy link
Collaborator

Ping @nickjbrowning? Reminder that if you need more time to check the numactl stuff, I'd say we merge as is. We can always update it later, and the numactl instructions don't hurt even if they may turn out to be unnecessary.

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/96

@rubber-duck-debug
Copy link
Collaborator Author

@msimberg I've just checked the numactl stuff and we can remove it. Recently had a ticket where something related came up and the latest commit here has the current best-practice wrt. kokkos + GPUs.

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/96

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/96

@rubber-duck-debug
Copy link
Collaborator Author

Ready to go from my perspective, maybe @RMeli and @msimberg check my changes for the eiger documentation (specifically the proc binding). I'm using whats recommended by LAMMPS here, but I don't have much insight as to what we should be recommending in terms of binding.

Comment on lines +84 to +86
#SBATCH --gpus-per-node=4
#SBATCH --gpus-per-task=1
#SBATCH --gpu-bind=per_task:1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the --gpus-per-task=1 alone covers this. I've never seen the --gpus-per-node=4 + --gpu-bind=per_task:1 form before so can't say for sure if that does something different, but if the goal is to have four ranks and one GPU per task, then in my experience just having --gpus-per-task=1 is sufficient.

Suggested change
#SBATCH --gpus-per-node=4
#SBATCH --gpus-per-task=1
#SBATCH --gpu-bind=per_task:1
#SBATCH --gpus-per-task=1

If you're unsure and would like to leave the other options there for now that's also ok by me. @RMeli?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also never seen this combination. If they are equivalent, I'd go for --gpus-per-task=1 for consistency in our documentation. If it does something different, maybe it is worth commenting/adding a note explaining the difference?

Copy link
Collaborator

@msimberg msimberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @nickjbrowning for pushing this through. I added a couple of minor comments, but not blocking in my opinion.

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/96

@sekelle
Copy link
Member

sekelle commented Jun 2, 2025

Hey guys, can we finally merge this?

@RMeli
Copy link
Member

RMeli commented Jun 2, 2025

@sekelle, @nickjbrowning was on holidays last week. I was waiting for the last conversation to be resolved before merging.

@RMeli RMeli self-assigned this Jun 2, 2025
@RMeli RMeli marked this pull request as ready for review June 2, 2025 12:02
@RMeli
Copy link
Member

RMeli commented Jun 2, 2025

Since the documentation is now live, I'll merge this and then we can go back to refining the last few details.

@RMeli RMeli merged commit 0f8e2d1 into eth-cscs:main Jun 2, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants