Updating support of Megatron-LM by pengdurice · Pull Request #3842 · huggingface/accelerate

pengdurice · 2025-11-21T06:10:55Z

What does this PR do?

This PR is to make sure that accelerate continues support using Megatron-LM as backend for training large scale LLMs.
In details:

FP8 quantization of the model is by-passed if Megatron is used since the model will be re-initialized later on.
Added a set of useful parameters necessary for enabling Megatron-LM: e.g. gradient checkpointing, cpu optimizer offloading.
Made changes on calling Megatron's functions since the func signatures have been changed.
Removed megatron_generate since it is no longer supported.
Updated the user guide doc.
The change has been tested with fine-tuning GLM4.6 model using 64 H200 GPUs with 70K sequence length.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

…h dist

…inee's code

…lm.py and found out that it is self.expert_tensor_parallel_size that causes the OOM isue

pengdurice · 2025-11-21T17:55:28Z

@SunMarc This is Peng. I am working on using accelerate and megatron-LM for fine tuning GLM4.6 models. This PR includes some updates of accelerate to better support new features as well as configurations from Megatron. Would you please help take a look? Thank you!

SunMarc · 2025-11-24T15:34:31Z

Of course, I will have a look this week !

pengdurice · 2025-11-24T15:49:52Z

Of course, I will have a look this week !

Thank you! Take your time and have a nice holiday ahead if you are in the US!

SunMarc

Thanks a lot, we don't maintain that much megatron as it is quite complex for users but happy to have this PR !

SunMarc · 2025-12-02T18:26:31Z

src/accelerate/utils/dataclasses.py

+            "attention_dropout": self.attention_dropout,
+            "hidden_dropout": self.hidden_dropout,
+            "attention_softmax_in_fp32": self.attention_softmax_in_fp32,
+            # "expert_tensor_parallel_size": self.expert_tensor_parallel_size,


don't need to pass it ?

I don't quite find this config useful, so left it commented out. Let me uncomment it. thanks for the comment!

@SunMarc , fixed it with a new commit. Would you please take another look? Thanks!

SunMarc · 2025-12-03T12:56:12Z

@bot /style

github-actions · 2025-12-03T12:56:34Z

Style bot fixed some files and pushed the changes.

HuggingFaceDocBuilderDev · 2025-12-03T12:57:49Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Peng-Writer and others added 29 commits October 13, 2025 16:05

now the training of gpt model runs e2e

080fd5d

add qwen_moe config

05950aa

clean up the code

58981c8

clean up the code

25d3566

default build tokenizer to False

5e29abe

add qwen and glm models

05051e9

add deepseek v3 config

4fdfdec

glm4.5 air running, with fp8, loading model from saved pytorch dist

40c111e

glm4.5 (355b-a32b) running, with fp8, loading model from saved pytorc…

51b8b31

…h dist

save for now

015d954

11/13/12pm, 5092, error comes back again: kernel didn't clean all nans

4eb8640

glm4.5air train with 70k seq length OK by hacking the transformer eng…

72d1190

…inee's code

remove logging

97a36ed

remove logging

3492e2a

save before updating to a new branch

6dbe2c4

fix an issue when newest Megatron-LM is used

22494f1

add the changes and get ready to revert back

df9c8d8

removing the assignments in the model_provider_func func of megatron_…

a7554fd

…lm.py and found out that it is self.expert_tensor_parallel_size that causes the OOM isue

move last pp # of layers to the config

029b2c1

fix last layer non error

46110c3

allow cpu offload optimizer and NOT saving optimizer states

eda478c

add h2d and d2h overlapp for offload optimizer to cpu

2395323

clean up the code for submitting the PR

09983b2

Merge branch 'main' into peng-fix-using-megatron-oss-v1

3fab2f8

move back to in

2c28681

other small changes

7404398

other small changes

f66de72

update readme doc

560146e

update readme doc

8b78137

SunMarc self-requested a review November 24, 2025 15:34

SunMarc approved these changes Dec 2, 2025

View reviewed changes

fix an uncomment

3c1f1db

Apply style fixes

7c3044a

SunMarc merged commit 7e38469 into huggingface:main Dec 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating support of Megatron-LM#3842

Updating support of Megatron-LM#3842
SunMarc merged 31 commits intohuggingface:mainfrom
pengdurice:peng-fix-using-megatron-oss-v1

pengdurice commented Nov 21, 2025 •

edited

Loading

Uh oh!

pengdurice commented Nov 21, 2025

Uh oh!

SunMarc commented Nov 24, 2025

Uh oh!

pengdurice commented Nov 24, 2025

Uh oh!

SunMarc left a comment

Uh oh!

SunMarc Dec 2, 2025

Uh oh!

pengdurice Dec 2, 2025

Uh oh!

pengdurice Dec 2, 2025

Uh oh!

SunMarc commented Dec 3, 2025

Uh oh!

github-actions bot commented Dec 3, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

pengdurice commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

pengdurice commented Nov 21, 2025

Uh oh!

SunMarc commented Nov 24, 2025

Uh oh!

pengdurice commented Nov 24, 2025

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

SunMarc Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

pengdurice Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

pengdurice Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

SunMarc commented Dec 3, 2025

Uh oh!

github-actions bot commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pengdurice commented Nov 21, 2025 •

edited

Loading

github-actions bot commented Dec 3, 2025 •

edited

Loading