Skip to content

Conversation

@msimberg
Copy link
Collaborator

@msimberg msimberg commented Jun 6, 2025

Draft. Not clear if all are needed.

Do we need to start separating recommended environment variables by nccl, libfabric, etc. version, or is it sufficient if we recommend the best practices for the latest versions?

@msimberg msimberg requested review from Madeeks and boeschf June 6, 2025 12:39
@github-actions
Copy link

github-actions bot commented Jun 6, 2025

preview available: https://docs.tds.cscs.ch/146

1 similar comment
@github-actions
Copy link

github-actions bot commented Jun 6, 2025

preview available: https://docs.tds.cscs.ch/146

Comment on lines 21 to 24
export NCCL_CROSS_NIC=1
export NCCL_NET_FORCE_FLUSH=1
export NCCL_NET_GDR_LEVEL=PHB # (2)!
export NCCL_SOCKET_IFNAME=hsn
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NCCL_CROSS_NIC and NCCL_SOCKET_IFNAME are set in the CE hook, and seem safe to recommend for most users (latter anyway seems to be just a sanity setting to avoid using the wrong network).

@boeschf
Copy link
Contributor

boeschf commented Jun 6, 2025

we should also change https://github.com/msimberg/cscs-docs/blob/63149793755baedc9052b2e4aad920f01b266f33/docs/software/ml/pytorch.md?plain=1#L318

@msimberg
Copy link
Collaborator Author

msimberg commented Jun 6, 2025

we should also change https://github.com/msimberg/cscs-docs/blob/63149793755baedc9052b2e4aad920f01b266f33/docs/software/ml/pytorch.md?plain=1#L318

Good catch. I wonder if we can avoid copy-pasting and manually having to make sure they're synchronized. I like that the pytorch submission script is standalone. Do you think it would be bad if we just link to the nccl page from there? I'd imagine many users will miss copying the nccl variables in that case...

From a quick search this also seems to exist: https://squidfunk.github.io/mkdocs-material/reference/code-blocks/#embedding-external-files. That might allow defining these in one place and including in many.

That said, that might be overkill at the moment so I might just copy them over for now.

Any comments on which vars we can actually safely recommend and which we might want to wait with still? Or if we have to add a warning about some variables only being good/useful with nccl 2.26 and libfabric 1.22?

@github-actions
Copy link

github-actions bot commented Jun 6, 2025

preview available: https://docs.tds.cscs.ch/146

@msimberg
Copy link
Collaborator Author

@RMeli I opened #152 for the snippets idea. I'll reserve this PR to actually add environment variables that haven't been recommended yet (at least not in the docs).

@msimberg msimberg force-pushed the more-nccl-env-vars branch from 6314979 to ef3019e Compare June 11, 2025 15:55
@msimberg
Copy link
Collaborator Author

The added environment variables are now in ef3019e.

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/146

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/146

@msimberg msimberg force-pushed the more-nccl-env-vars branch from 959ec3a to ef4ec10 Compare June 13, 2025 14:39
@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/146

@msimberg
Copy link
Collaborator Author

I'm closing this for now. These additional variables are not understood well enough to warrant adding them without explanation to the docs.

@msimberg msimberg closed this Jul 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants