Skip to content

Updating support of Megatron-LM#3842

Merged
SunMarc merged 31 commits intohuggingface:mainfrom
pengdurice:peng-fix-using-megatron-oss-v1
Dec 3, 2025
Merged

Updating support of Megatron-LM#3842
SunMarc merged 31 commits intohuggingface:mainfrom
pengdurice:peng-fix-using-megatron-oss-v1

Conversation

@pengdurice
Copy link
Contributor

@pengdurice pengdurice commented Nov 21, 2025

What does this PR do?

This PR is to make sure that accelerate continues support using Megatron-LM as backend for training large scale LLMs.
In details:

  1. FP8 quantization of the model is by-passed if Megatron is used since the model will be re-initialized later on.
  2. Added a set of useful parameters necessary for enabling Megatron-LM: e.g. gradient checkpointing, cpu optimizer offloading.
  3. Made changes on calling Megatron's functions since the func signatures have been changed.
  4. Removed megatron_generate since it is no longer supported.
  5. Updated the user guide doc.
  6. The change has been tested with fine-tuning GLM4.6 model using 64 H200 GPUs with 70K sequence length.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@pengdurice
Copy link
Contributor Author

@SunMarc This is Peng. I am working on using accelerate and megatron-LM for fine tuning GLM4.6 models. This PR includes some updates of accelerate to better support new features as well as configurations from Megatron. Would you please help take a look? Thank you!

@SunMarc
Copy link
Member

SunMarc commented Nov 24, 2025

Of course, I will have a look this week !

@SunMarc SunMarc self-requested a review November 24, 2025 15:34
@pengdurice
Copy link
Contributor Author

Of course, I will have a look this week !

Thank you! Take your time and have a nice holiday ahead if you are in the US!

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, we don't maintain that much megatron as it is quite complex for users but happy to have this PR !

"attention_dropout": self.attention_dropout,
"hidden_dropout": self.hidden_dropout,
"attention_softmax_in_fp32": self.attention_softmax_in_fp32,
# "expert_tensor_parallel_size": self.expert_tensor_parallel_size,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need to pass it ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite find this config useful, so left it commented out. Let me uncomment it. thanks for the comment!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SunMarc , fixed it with a new commit. Would you please take another look? Thanks!

@SunMarc
Copy link
Member

SunMarc commented Dec 3, 2025

@bot /style

@github-actions
Copy link
Contributor

github-actions bot commented Dec 3, 2025

Style bot fixed some files and pushed the changes.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@SunMarc SunMarc merged commit 7e38469 into huggingface:main Dec 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants