Skip to content

Conversation

@kmehant
Copy link
Collaborator

@kmehant kmehant commented Mar 6, 2025

Description of the change

This is to support automatic HF checkpointing for moe EP training runs using fms-acceleration.

Depends on foundation-model-stack/fms-acceleration#133

Related issue number

#480

@github-actions
Copy link

github-actions bot commented Mar 6, 2025

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

@github-actions github-actions bot added the feat label Mar 6, 2025
@kmehant kmehant force-pushed the moe-hf-chkpt-wf branch 2 times, most recently from 12f2345 to 9bd210f Compare March 6, 2025 10:37
@kmehant kmehant marked this pull request as ready for review March 6, 2025 10:37
@dushyantbehl
Copy link
Collaborator

@kmehant can you fix the lint error

@kmehant kmehant force-pushed the moe-hf-chkpt-wf branch 4 times, most recently from f049c36 to e2ad0f0 Compare March 6, 2025 12:51
Copy link
Collaborator

@dushyantbehl dushyantbehl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@willmj willmj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Mehant! This looks good to me, but could we update the notes section on Fast MoE to include that this conversion now happens automatically?

@willmj
Copy link
Collaborator

willmj commented Mar 6, 2025

Also if we could add a small unit test here (maybe similar to the e2e tests we have in test_sft_trainer) to make sure this conversion is happening correctly and inference can be run on the model, that would be good - but it may be hard to find a small moe model to test with.

@kmehant kmehant force-pushed the moe-hf-chkpt-wf branch 2 times, most recently from d658990 to 33bbc92 Compare March 7, 2025 08:07
@kmehant
Copy link
Collaborator Author

kmehant commented Mar 7, 2025

@willmj addressed your comments, thanks

@kmehant kmehant force-pushed the moe-hf-chkpt-wf branch 2 times, most recently from a55356e to 573f7a4 Compare March 7, 2025 08:25
Copy link
Collaborator

@willmj willmj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

kmehant added 4 commits March 8, 2025 07:45
Signed-off-by: Mehant Kammakomati <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>
@kmehant kmehant merged commit c3aa25c into foundation-model-stack:main Mar 10, 2025
9 checks passed
@willmj willmj mentioned this pull request Mar 18, 2025
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants