You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Added support for handling the differences between the AWS OFI plugin
for use on slingshot systems.
* Update the version number.
* Addressed reviewer feedback.
* Added a note about forcing libfabric.
logger.warn(f"WARNING: invalid path provided in LBANN_USE_THIS_OFI_PLUGIN: {different_ofi_plugin}. Ensure one is loaded or performance will be degraded.")
logger.warn(f"WARNING: using RCCL communication protocol and no default AWS_OFI_RCCL plugin was detected. Checked {aws_ofi_plugin}. Ensure one is loaded or performance will be degraded.")
# Unless overriden by an external env variable set the NCCL_NET to ensure that the libfabric interface is used, e.g.: libfabric, IB, Socket
154
+
msg="By default HPC-launcher will force slingshot systems to use the libfabric NCCL/RCCL plugin or fail. This behavior can be overridden by setting NCCL_NET=Socket in the calling environment."
155
+
ifrocm_major>=7androcm_minor>=1:
156
+
# Add AWS_OFI_NCCL for ROCm 7.1 - Ensure that it pick up the correct library object
0 commit comments