fix: Allow for multi node training for accelerated moe #129

kmehant · 2025-02-23T19:13:52Z

Current implementation uses global rank of the process to prepare the device index which would not work in a multi node setting. Therefore, we would need to use local rank since devices are not continuously indexed across the nodes.

fabianlim · 2025-02-24T02:54:29Z

@kmehant i understand the fix, but can you update the description for record keeping purposes

kmehant · 2025-02-24T03:23:18Z

#129 (comment)

@fabianlim
Apologies for missing that, have added it.

fabianlim

LGTM but one suggestion

fabianlim · 2025-02-24T04:01:12Z

plugins/accelerated-moe/src/fms_acceleration_moe/framework_plugin_scattermoe.py

        if torch.distributed.is_initialized():
            world_size = torch.distributed.get_world_size()
-            rank = torch.distributed.get_rank()
+            rank = int(os.environ["LOCAL_RANK"])


can we make it consistent and follow the new style.

Suggested change

rank = int(os.environ["LOCAL_RANK"])

# we do not need to use the fallback as this is wrapped in an `is_initialized` block

rank = torch.distributed.get_node_local_rank()

@fabianlim Have included this suggestion thanks.

Signed-off-by: Mehant Kammakomati <[email protected]> Signed-off-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]>

kmehant · 2025-02-24T07:00:54Z

@fabianlim requesting your merge.

fabianlim · 2025-02-24T07:19:55Z

@kmehant lets have @willmj look at it first

kmehant changed the title ~~Allow for multi node training for accelerated moe~~ fix: Allow for multi node training for accelerated moe Feb 23, 2025

kmehant marked this pull request as ready for review February 23, 2025 19:14

kmehant requested a review from fabianlim as a code owner February 23, 2025 19:14

fabianlim requested a review from willmj February 24, 2025 02:55

fabianlim approved these changes Feb 24, 2025

View reviewed changes

kmehant force-pushed the mn-sharedmoe-final branch 3 times, most recently from 1bb2f8c to 548b710 Compare February 24, 2025 06:43

fix: compute device correctly

ed7821d

Signed-off-by: Mehant Kammakomati <[email protected]> Signed-off-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]>

kmehant force-pushed the mn-sharedmoe-final branch from 548b710 to ed7821d Compare February 24, 2025 06:45

fabianlim approved these changes Feb 27, 2025

View reviewed changes

fabianlim merged commit 791bdd9 into foundation-model-stack:main Feb 27, 2025
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Allow for multi node training for accelerated moe #129

fix: Allow for multi node training for accelerated moe #129

Uh oh!

kmehant commented Feb 23, 2025 •

edited

Loading

Uh oh!

fabianlim commented Feb 24, 2025

Uh oh!

kmehant commented Feb 24, 2025 •

edited

Loading

Uh oh!

fabianlim left a comment

Uh oh!

fabianlim Feb 24, 2025

Uh oh!

kmehant Feb 24, 2025 •

edited

Loading

Uh oh!

kmehant commented Feb 24, 2025

Uh oh!

fabianlim commented Feb 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	rank = int(os.environ["LOCAL_RANK"])
	# we do not need to use the fallback as this is wrapped in an `is_initialized` block
	rank = torch.distributed.get_node_local_rank()

fix: Allow for multi node training for accelerated moe #129

fix: Allow for multi node training for accelerated moe #129

Uh oh!

Conversation

kmehant commented Feb 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fabianlim commented Feb 24, 2025

Uh oh!

kmehant commented Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fabianlim left a comment

Choose a reason for hiding this comment

Uh oh!

fabianlim Feb 24, 2025

Choose a reason for hiding this comment

Uh oh!

kmehant Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kmehant commented Feb 24, 2025

Uh oh!

fabianlim commented Feb 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kmehant commented Feb 23, 2025 •

edited

Loading

kmehant commented Feb 24, 2025 •

edited

Loading

kmehant Feb 24, 2025 •

edited

Loading