Skip to content

Update allgather fallback algo#476

Merged
Binyang2014 merged 8 commits intomainfrom
binyli/allgather-fallback
Mar 14, 2025
Merged

Update allgather fallback algo#476
Binyang2014 merged 8 commits intomainfrom
binyli/allgather-fallback

Conversation

@Binyang2014
Copy link
Contributor

@Binyang2014 Binyang2014 commented Mar 4, 2025

Enhancements to all-gather operation, a temporary solution to fix the memory overhead when integrating msccl++ with pytorch.
This solution will not register input/output buffer to msccl++, so the temp output buffer for allgather could be reused by torch automatically.

  • Introduced a new allgather8 kernel function in apps/nccl/src/allgather.hpp to handle larger data sizes more efficiently. This includes double buffering to hide synchronization overhead and support for both in-place and out-of-place operations.
  • Modified the allgather function to decide between allgather6 and allgather8 based on data size and platform, improving performance for large data sizes.

Configuration and environment improvements:

  • Added a new environment variable MSCCLPP_DISABLE_CHANNEL_CACHE to control whether the channel cache is disabled, enhancing configurability. This variable is now part of the Env class and is logged during environment initialization.
  • Removed the redundant global variable mscclppDisableChannelCache from src/debug.cc and updated its usage to refer to the new environment variable.

@Binyang2014 Binyang2014 changed the title Binyli/allgather fallback Update allgather fallback algo Mar 5, 2025
@Binyang2014 Binyang2014 marked this pull request as ready for review March 5, 2025 00:25
Copy link
Contributor

@caiomcbr caiomcbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Binyang2014
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@Binyang2014 Binyang2014 merged commit 0b840ba into main Mar 14, 2025
25 checks passed
@Binyang2014 Binyang2014 deleted the binyli/allgather-fallback branch March 14, 2025 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants