Skip to content

Conversation

@isaki001
Copy link
Contributor

Motivation

This feature allows for further investigation to see if balancing channels across XCCs can yield improved performance.

Technical Details

Addition of new complementary ncclP2pChannelForPart / ncclP2pChannelToPart functions that take in a shiftSize variable that adjusts how work is assigned.
Modifications to the two key functions listed above, then additions to ncclComm / ncclDevComm to carry this new information.
A new RCCL environment variable to adjust, as well as a function for properly setting the value / establishing default shiftSize.

This PR was originally created by Gilbert Lee: PR2067

JIRA ID

AICOMRCCL-350

Test Plan

Correctness tests on gfx942 with MX and AINIC.
Perf. comparisons with RCCL_P2P_SHIFT_SIZE values of "1,2,3,4,5" compared to default run and RCCL develop branch.

Test Result

On 4-node AINIC runs, RCCL_P2P_SHIFT_SIZE =1 improved peak bw.

Submission Checklist

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants