Skip to content

Conversation

@kmehant
Copy link
Collaborator

@kmehant kmehant commented Feb 23, 2025

Current implementation uses global rank of the process to prepare the device index which would not work in a multi node setting. Therefore, we would need to use local rank since devices are not continuously indexed across the nodes.

@kmehant kmehant changed the title Allow for multi node training for accelerated moe fix: Allow for multi node training for accelerated moe Feb 23, 2025
@kmehant kmehant marked this pull request as ready for review February 23, 2025 19:14
@kmehant kmehant requested a review from fabianlim as a code owner February 23, 2025 19:14
@fabianlim
Copy link
Contributor

@kmehant i understand the fix, but can you update the description for record keeping purposes

@fabianlim fabianlim requested a review from willmj February 24, 2025 02:55
@kmehant
Copy link
Collaborator Author

kmehant commented Feb 24, 2025

#129 (comment)

@fabianlim
Apologies for missing that, have added it.

Copy link
Contributor

@fabianlim fabianlim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but one suggestion

if torch.distributed.is_initialized():
world_size = torch.distributed.get_world_size()
rank = torch.distributed.get_rank()
rank = int(os.environ["LOCAL_RANK"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make it consistent and follow the new style.

Suggested change
rank = int(os.environ["LOCAL_RANK"])
# we do not need to use the fallback as this is wrapped in an `is_initialized` block
rank = torch.distributed.get_node_local_rank()

Copy link
Collaborator Author

@kmehant kmehant Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabianlim Have included this suggestion thanks.

@kmehant kmehant force-pushed the mn-sharedmoe-final branch 3 times, most recently from 1bb2f8c to 548b710 Compare February 24, 2025 06:43
Signed-off-by: Mehant Kammakomati <[email protected]>
Signed-off-by: Yu Chin Fabian Lim <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>
@kmehant
Copy link
Collaborator Author

kmehant commented Feb 24, 2025

@fabianlim requesting your merge.

@fabianlim
Copy link
Contributor

@kmehant lets have @willmj look at it first

@fabianlim fabianlim merged commit 791bdd9 into foundation-model-stack:main Feb 27, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants