[AICOMRCCL-355] Enable threshold-based p2p-batching #3000
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
Address regression on MB/GB range for alltoall due when enabling p2p-batching on all message sizes.
Technical Details
P2p-batching is enabled only if sendBytes and recvBytes are below a per-rank size threshold message size which is 64KB by default.
Alltoallv hang was addressed by adding sendBytes == recvBytes requirement for p2p-batching and adding
batchP2Pentry in wipBatch to prevent batching together ops that are below and above the threshold.The user can specify a threshold through the
RCCL_P2P_BATCH_THRESHOLDenv. variable.JIRA ID
AICOMRCCL-355
Test Plan
Correctness tests on 4/8 node gfx950 and 4/8/16/32/33 node gfx942.
Test Result
gfx950 tests show improved performance.
Submission Checklist