Skip to content

Conversation

@isaki001
Copy link
Contributor

Motivation

Address regression on MB/GB range for alltoall due when enabling p2p-batching on all message sizes.

Technical Details

P2p-batching is enabled only if sendBytes and recvBytes are below a per-rank size threshold message size which is 64KB by default.
Alltoallv hang was addressed by adding sendBytes == recvBytes requirement for p2p-batching and adding batchP2P entry in wipBatch to prevent batching together ops that are below and above the threshold.

The user can specify a threshold through the RCCL_P2P_BATCH_THRESHOLD env. variable.

JIRA ID

AICOMRCCL-355

Test Plan

Correctness tests on 4/8 node gfx950 and 4/8/16/32/33 node gfx942.

Test Result

gfx950 tests show improved performance.

Submission Checklist

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants