Skip to content

Conversation

@thuwzt
Copy link
Contributor

@thuwzt thuwzt commented Dec 20, 2024

Fix for the issue #1330 : [BUG] MoE load balancing loss is accumulated twice when using activation checkpointing

@taehwakkwon
Copy link

Thank you for fixing this issue. I have a question regarding the aux loss.

Although it appears to be doubled, I believe the actual training remains the same since recomputed is turned off. Aux loss is not accumulated to the optimizer. The reason I'm asking is that we previously had recomputed turned on for a period, but now we can turn it off. I'm confused about whether to double the moe-aux-loss-coeff. I don't think it's necessary to double it because loss value is added to aux_losses_tracker.

tracker[name]["values"][layer_number - 1] += loss.detach() # Aggregate the loss for the layer.
.

@thuwzt
Copy link
Contributor Author

thuwzt commented Jan 8, 2025

@taehwakkwon Yes, you understand correctly. This is merely a display bug related to the tracker and logger. The compute flow remains unaffected, regardless of whether --moe-layer-recompute is used or not.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 9, 2025

Marking as stale. No activity in 60 days.

@github-actions github-actions bot added the stale No activity in 60 days on issue or PR label Mar 9, 2025
@sbhavani sbhavani added bug Something isn't working module: debugging and removed stale No activity in 60 days on issue or PR labels Jul 25, 2025
@sbhavani
Copy link
Contributor

sbhavani commented Jan 6, 2026

Thanks @thuwzt for identifying and fixing this MoE aux loss double accumulation bug (#1330). Your fix has been incorporated via internal MR !2574 (commit e6d56d6), where you are credited as a co-author.
As part of our new year cleanup, we're closing this PR. Thanks again for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants